MindLLM: Lightweight large language model pre-training, evaluation and domain application

被引:0
|
作者
Yang, Yizhe
Sun, Huashan
Li, Jiawei
Liu, Runheng
Li, Yinghao
Liu, Yuhang
Gao, Yang
Huang, Heyan [1 ]
机构
[1] Beijing Inst Technol, Sch Comp Sci, Beijing, Peoples R China
来源
AI OPEN | 2024年 / 5卷
基金
中国国家自然科学基金;
关键词
Large language model; Light weight; Bilingual;
D O I
10.1016/j.aiopen.2024.08.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language tasks, marking significant strides towards general artificial intelligence. While general artificial intelligence is leveraged by developing increasingly large-scale models, there could be another branch to develop lightweight custom models that better serve certain domains, taking into account the high cost of training and deploying LLMs and the scarcity of resources. In this paper, we present MindLLM, a novel series of bilingual lightweight large language models, trained from scratch, alleviating such burdens by offering models with 1.3 billion and 3 billion parameters. A thorough account of experiences accrued during large model development is given, covering every step of the process, including data construction, model architecture, evaluation, and applications. Such insights are hopefully valuable for fellow academics and developers. MindLLM consistently matches or surpasses the performance of other open-source larger models on some public benchmarks. We also introduce an innovative instruction tuning framework tailored for smaller models to enhance their capabilities efficiently. Moreover, we explore the application of MindLLM in specific vertical domains such as law and finance, underscoring the agility and adaptability of our lightweight models.
引用
收藏
页码:155 / 180
页数:26
相关论文
共 50 条
  • [1] Lightweight Model Pre-Training via Language Guided Knowledge Distillation
    Li, Mingsheng
    Zhang, Lin
    Zhu, Mingzhen
    Huang, Zilong
    Yu, Gang
    Fan, Jiayuan
    Chen, Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 10720 - 10730
  • [2] Subset selection for domain adaptive pre-training of language model
    Hwang, Junha
    Lee, Seungdong
    Kim, Haneul
    Jeong, Young-Seob
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [3] Pre-training and Evaluation of Numeracy-oriented Language Model
    Feng, Fuli
    Rui, Xilin
    Wang, Wenjie
    Cao, Yixin
    Chua, Tat-Seng
    ICAIF 2021: THE SECOND ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, 2021,
  • [4] Evaluation of pre-training large language models on leadership-class supercomputers
    Yin, Junqi
    Dash, Sajal
    Gounley, John
    Wang, Feiyi
    Tourassi, Georgia
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (18): : 20747 - 20768
  • [5] Evaluation of pre-training large language models on leadership-class supercomputers
    Junqi Yin
    Sajal Dash
    John Gounley
    Feiyi Wang
    Georgia Tourassi
    The Journal of Supercomputing, 2023, 79 : 20747 - 20768
  • [6] QUERT: Continual Pre-training of Language Model for Query Understanding in Travel Domain Search
    Xie, Jian
    Liang, Yidan
    Liu, Jingping
    Xiao, Yanghua
    Wu, Baohua
    Ni, Shenghua
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5282 - 5291
  • [7] An Empirical Investigation Towards Efficient Multi-Domain Language Model Pre-training
    Arumae, Kristjan
    Sun, Qing
    Bhatia, Parminder
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4854 - 4864
  • [8] Domain-Specific Language Model Pre-Training for Korean Tax Law Classification
    Gu, Yeong Hyeon
    Piao, Xianghua
    Yin, Helin
    Jin, Dong
    Zheng, Ri
    Yoo, Seong Joon
    IEEE ACCESS, 2022, 10 : 46342 - 46353
  • [9] FlauBERT: Unsupervised Language Model Pre-training for French
    Le, Hang
    Vial, Loic
    Frej, Jibril
    Segonne, Vincent
    Coavoux, Maximin
    Lecouteux, Benjamin
    Allauzen, Alexandre
    Crabbe, Benoit
    Besacier, Laurent
    Schwab, Didier
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2479 - 2490
  • [10] Soft Language Clustering for Multilingual Model Pre-training
    Zeng, Jiali
    Jiang, Yufan
    Yin, Yongjing
    Jing, Yi
    Meng, Fandong
    Lin, Binghuai
    Cao, Yunbo
    Zhou, Jie
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 7021 - 7035