Sparse Mixture of Experts Language Models Excel in Knowledge Distillation

被引：0

作者：

Xu, Haiyang ^{[1
]}

Liu, Haoxiang ^{[2
]}

Gong, Wei ^{[1
]}

Wang, Hai ^{[3
]}

Deng, Xianjun ^{[4
]}

机构：

[1] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

[3] Huazhong Univ Sci & Technol, Sch Cyber Sci & Engn, Wuhan, Peoples R China

[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024 | 2025年 / 15361卷

关键词：

Mixture of experts; Knowledge distillation; Language models;

D O I：

10.1007/978-981-97-9437-9_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Knowledge distillation is an effective method for reducing the computational overhead of large language models. However, recent optimization efforts in distilling large language models have primarily focused on loss functions and training methodologies, with limited attention given to structural improvements of student models. This is largely due to the challenges posed by cross-architecture distillation and the substantial computational resources required for modifying model structures. To address these issues, we introduce a novel method that integrates a sparse mixture of experts (MoE) architecture with low-rank adaptation (LoRA). This combination not only bolsters the capabilities of the student model but also facilitates knowledge distillation using MoE without the necessity of continued pretraining. Experimental results indicate that our approach enhances the model's capabilities compared to dense model distillation, achieving superior performance across a multitude of tasks. We will release our code at https://github.com/sprogxhy/MoE-KD-release.git.

引用

页码：80 / 91

页数：12

共 50 条

[11] Using Mixture of Experts to accelerate dataset distillation
Xu, Zhi
Fu, Zhenyong
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
[12] Symbolic Knowledge Distillation: from General Language Models to Commonsense Models
West, Peter
Bhagavatula, Chandra
Hessel, Jack
Hwang, Jena D.
Jiang, Liwei
Le Bras, Ronan
Lu, Ximing
Welleck, Sean
Choi, Yejin
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4602 - 4625
[13] MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts
Xie, Zhitian
Zhang, Yinger
Zhuang, Chenyi
Shi, Qitao
Liu, Zhining
Gu, Jinjie
Zhang, Guannan
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 14, 2024, : 16067 - 16075
[14] Effective Compression of Language Models by Combining Pruning and Knowledge Distillation
Chiu, Chi-Yu
Hong, Ding-Yong
Liu, Pangfeng
Wu, Jan-Jan
2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 429 - 438
[15] Efficient Routing in Sparse Mixture-of-Experts
Shamsolmoali, Pourya (pshams55@gmail.com), 1600, Institute of Electrical and Electronics Engineers Inc.
[16] Dynamic Knowledge Distillation for Pre-trained Language Models
Li, Lei
Lin, Yankai
Ren, Shuhuai
Li, Peng
Zhou, Jie
Sun, Xu
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 379 - 389
[17] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Lu, Xudong
Liu, Qi
Xu, Yuhui
Zhou, Aojun
Huang, Siyuan
Zhang, Bo
Yan, Junchi
Li, Hongsheng
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
[18] SAKD: Sparse attention knowledge distillation
Guo, Zhen
Zhang, Pengzhou
Liang, Peng
IMAGE AND VISION COMPUTING, 2024, 146
[19] Scalable Syntax-Aware Language Models Using Knowledge Distillation
Kuncoro, Adhiguna
Dyer, Chris
Rimell, Laura
Clark, Stephen
Blunsom, Phil
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3472 - 3484
[20] Web Content Filtering through knowledge distillation of Large Language Models
Voros, Tamas
Bergeron, Sean Paul
Berlin, Konstantin
2023 IEEE INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WI-IAT, 2023, : 357 - 361

← 1 2 3 4 5 →