LocMoE: A Low-Overhead MoE for Large Language Model Training

被引：0

作者：

Li, Jing ^{[1
]}

Sun, Zhijie ^{[1
]}

He, Xuan ^{[1
]}

Zeng, Li ^{[1
]}

Lin, Yi ^{[1
]}

Li, Entong ^{[1
]}

Zheng, Binfan ^{[1
]}

Zhao, Rongqian ^{[1
]}

Chen, Xin ^{[1
]}

机构：

[1] Huawei Technol Co Ltd, Shenzhen, Guangdong, Peoples R China

来源：

PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Mixtures-of-Experts (MoE) model is a widespread distributed and integrated learning method for large language models (LLM), which is favored due to its ability to sparsify and expand models efficiently. However, the performance of MoE is limited by load imbalance and high latency of All-to-All communication, along with relatively redundant computation owing to large expert capacity. Load imbalance may result from existing routing policies that consistently tend to select certain experts. The frequent inter-node communication in the All-to-All procedure also significantly prolongs the training time. To alleviate the above performance problems, we propose a novel routing strategy that combines load balance and locality by converting partial inter-node communication to that of intra-node. Notably, we elucidate that there is a minimum threshold for expert capacity, calculated through the maximal angular deviation between the gating weights of the experts and the assigned tokens. We port these modifications on the PanGuS model based on the MindSpore framework with multi-level routing and conduct experiments on Ascend clusters. The experiment results demonstrate that the proposed LocMoE reduces training time per epoch by 12.68% to 22.24% compared to classical routers, such as hash router and switch router, without impacting the model accuracy.

引用

页码：6377 / 6387

页数：11

共 50 条

[1] TruffleReloader: A Low-Overhead Language-Neutral Reloader
Pool, Tonis
Gregersen, Allan Raundahl
Vojdani, Vesal
PROCEEDINGS OF THE 11TH WORKSHOP ON IMPLEMENTATION, COMPILATION, OPTIMIZATION OF OBJECT-ORIENTED LANGUAGES, PROGRAMS AND SYSTEMS (ICOOOLPS'16), 2016,
[2] Low-Overhead Beam Training Scheme for Extremely Large-Scale RIS in Near Field
Liu, Wang
Pan, Cunhua
Ren, Hong
Shu, Feng
Jin, Shi
Wang, Jiangzhou
IEEE TRANSACTIONS ON COMMUNICATIONS, 2023, 71 (08) : 4924 - 4940
[3] Low-Overhead Deadlock Prediction
Cai, Yan
Meng, Ruijie
Palsberg, Jens
2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020), 2020, : 1298 - 1309
[4] Low-Overhead Paxos Replication
Guo J.
Chu J.
Cai P.
Zhou M.
Zhou A.
Data Science and Engineering, 2017, 2 (2) : 169 - 177
[5] Low-Overhead WiFi Fingerprinting
Jun, Junghyun
He, Liang
Gu, Yu
Jiang, Wenchao
Kushwaha, Gaurav
A, Vipin
Cheng, Long
Liu, Cong
Zhu, Ting
IEEE TRANSACTIONS ON MOBILE COMPUTING, 2018, 17 (03) : 590 - 603
[6] TripleID: A Low-Overhead Representation and Querying Using GPU for Large RDFs
Chantrapornchai, Chantana
Choksuchat, Chidchanok
Haidl, Michael
Gorlatch, Sergei
BEYOND DATABASES, ARCHITECTURES AND STRUCTURES, BDAS 2016, 2016, 613 : 400 - 415
[7] Complement Sparsification: Low-Overhead Model Pruning for Federated Learning
Jiang, Xiaopeng
Borcea, Cristian
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 7, 2023, : 8087 - 8095
[8] A Low-overhead Cooperative Failure Detector
Liu, Jiaxi
Dong, Jian
Wu, Zhibo
Wu, Jin
Lan, Jinghui
Yu, Jiaxin
2015 FIFTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC), 2015, : 811 - 815
[9] Low-Overhead Architecture for Security Tag
Shioya, Ryota
Kim, Daewung
Horio, Kazuo
Goshima, Masahiro
Sakai, Shuichi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2011, E94D (01): : 69 - 78
[10] Low-overhead scheduling of nested parallelism
Hummel, S.F.
Schonberg, E.
1600, (35): : 5 - 6

← 1 2 3 4 5 →