Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引:0
|
作者
Xu, Haiyang [1 ]
Liu, Haoxiang [2 ]
Gong, Wei [1 ]
Wang, Hai [3 ]
Deng, Xianjun [4 ]
机构
[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China
[2] Alibaba Grp, Hangzhou, Peoples R China
[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Mixture of experts; Knowledge distillation; Language models;
D O I
10.1007/978-981-97-9437-9_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.
引用
收藏
页码:80 / 91
页数:12
相关论文
共 50 条
  • [11] Using Mixture of Experts to accelerate dataset distillation
    Xu, Zhi
    Fu, Zhenyong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
  • [12] Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
    West, Peter
    Bhagavatula, Chandra
    Hessel, Jack
    Hwang, Jena D.
    Jiang, Liwei
    Le Bras, Ronan
    Lu, Ximing
    Welleck, Sean
    Choi, Yejin
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4602 - 4625
  • [13] MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts
    Xie, Zhitian
    Zhang, Yinger
    Zhuang, Chenyi
    Shi, Qitao
    Liu, Zhining
    Gu, Jinjie
    Zhang, Guannan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 14, 2024, : 16067 - 16075
  • [14] Effective Compression of Language Models by Combining Pruning and Knowledge Distillation
    Chiu, Chi-Yu
    Hong, Ding-Yong
    Liu, Pangfeng
    Wu, Jan-Jan
    2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 429 - 438
  • [15] Efficient Routing in Sparse Mixture-of-Experts
    Shamsolmoali, Pourya (pshams55@gmail.com), 1600, Institute of Electrical and Electronics Engineers Inc.
  • [16] Dynamic Knowledge Distillation for Pre-trained Language Models
    Li, Lei
    Lin, Yankai
    Ren, Shuhuai
    Li, Peng
    Zhou, Jie
    Sun, Xu
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 379 - 389
  • [17] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
    Lu, Xudong
    Liu, Qi
    Xu, Yuhui
    Zhou, Aojun
    Huang, Siyuan
    Zhang, Bo
    Yan, Junchi
    Li, Hongsheng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
  • [18] SAKD: Sparse attention knowledge distillation
    Guo, Zhen
    Zhang, Pengzhou
    Liang, Peng
    IMAGE AND VISION COMPUTING, 2024, 146
  • [19] Scalable Syntax-Aware Language Models Using Knowledge Distillation
    Kuncoro, Adhiguna
    Dyer, Chris
    Rimell, Laura
    Clark, Stephen
    Blunsom, Phil
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3472 - 3484
  • [20] Web Content Filtering through knowledge distillation of Large Language Models
    Voros, Tamas
    Bergeron, Sean Paul
    Berlin, Konstantin
    2023 IEEE INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WI-IAT, 2023, : 357 - 361