Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
|
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [31] Knowledge-Augmented Reasoning Distillation for Small Language Models in Knowledge-Intensive Tasks
    Kang, Minki
    Lee, Seanie
    Baek, Jinheon
    Kawaguchi, Kenji
    Hwang, Sung Ju
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [32] Knowledge Representing: Efficient, Sparse Representation of Prior Knowledge for Knowledge Distillation
    Liu, Junjie
    Wen, Dongchao
    Gao, Hongxing
    Tao, Wei
    Chen, Tse-Wei
    Osa, Kinya
    Kato, Masami
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2019), 2019, : 648 - 656
  • [33] Exploring structure-property relationships in sparse data environments using mixture-of-experts models
    Cheenady, Amith Adoor
    Mukherjee, Arpan
    Dongol, Ruhil
    Rajan, Krishna
    MRS BULLETIN, 2025, 50 (01) : 32 - 43
  • [34] Causal Distillation for Language Models
    Wu, Zhengxuan
    Geiger, Atticus
    Rozner, Joshua
    Kreiss, Elisa
    Lu, Hanson
    Icard, Thomas
    Potts, Christopher
    Goodman, Noah
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4288 - 4295
  • [35] Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights
    Ballout, Mohamad
    Krumnack, Ulf
    Heidemann, Gunther
    Kuehnberger, Kai-Uwe
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT I, NLDB 2024, 2024, 14762 : 32 - 46
  • [36] Knowledge Base Grounded Pre-trained Language Models via Distillation
    Sourty, Raphael
    Moreno, Jose G.
    Servant, Francois-Paul
    Tamine, Lynda
    39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1617 - 1625
  • [37] Mixture-of-Linguistic-Experts Adapters for Improving and Interpreting Pre-trained Language Models
    Li, Raymond
    Murray, Gabriel
    Carenini, Giuseppe
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 9456 - 9469
  • [38] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
    Yuan, Xiaoming
    Kong, Weixuan
    Luo, Zhenyu
    Xu, Minrui
    ELECTRONICS, 2024, 13 (11)
  • [39] Low-resource knowledge graph completion based on knowledge distillation driven by large language models
    Hou, Wenlong
    Zhao, Weidong
    Jia, Ning
    Liu, Xianhui
    APPLIED SOFT COMPUTING, 2025, 169
  • [40] Asymptotic properties of mixture-of-experts models
    Olteanu, M.
    Rynkiewicz, J.
    NEUROCOMPUTING, 2011, 74 (09) : 1444 - 1449