Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things

被引：1

作者：

Yuan, Xiaoming ^{[1
,2
]}

Kong, Weixuan ^{[1
]}

Luo, Zhenyu ^{[1
]}

Xu, Minrui ^{[3
]}

机构：

[1] Northeastern Univ Qinhuangdao, Hebei Key Lab Marine Percept Network & Data Proc, Qinhuangdao 066004, Peoples R China

[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China

[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore

来源：

ELECTRONICS | 2024年 / 13卷 / 11期

基金：

中国国家自然科学基金;

关键词：

large language models; efficient inference offloading; mixture-of-experts; Internet of Medical Things;

D O I：

10.3390/electronics13112077

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk.

引用

页数：17

共 50 条

[1] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Du, Nan
Huang, Yanping
Dai, Andrew M.
Tong, Simon
Lepikhin, Dmitry
Xu, Yuanzhong
Krikun, Maxim
Zhou, Yanqi
Yu, Adams Wei
Firat, Orhan
Zoph, Barret
Fedus, Liam
Bosma, Maarten
Zhou, Zongwei
Wang, Tao
Wang, Yu Emma
Webster, Kellie
Pellat, Marie
Robinson, Kevin
Meier-Hellstern, Kathleen
Duke, Toju
Dixon, Lucas
Zhang, Kun
Le, Quoc V.
Wu, Yonghui
Chen, Zhifeng
Cui, Claire
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[2] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Lu, Xudong
Liu, Qi
Xu, Yuhui
Zhou, Aojun
Huang, Siyuan
Zhang, Bo
Yan, Junchi
Li, Hongsheng
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
[3] MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services
Yu, Dianhai
Shen, Liang
Hao, Hongxiang
Gong, Weibao
Wu, Huachao
Bian, Jiang
Dai, Lirong
Xiong, Haoyi
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2024, 17 (05) : 2626 - 2639
[4] Adaptive Gating in Mixture-of-Experts based Language Models
Li, Jiamin
Su, Qiang
Yang, Yitao
Jiang, Yimin
Wang, Cong
Xu, Hong
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3577 - 3587
[5] Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Kudugunta, Sneha
Huang, Yanping
Bapna, Ankur
Krikun, Maxim
Lepikhin, Dmitry
Thang Luong
Firat, Orhan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3577 - 3599
[6] Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models
Zhu, Shaolin
Jian, Dong
Xiong, Deyi
INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
[7] Efficient Routing in Sparse Mixture-of-Experts
Shamsolmoali, Pourya (pshams55@gmail.com), 1600, Institute of Electrical and Electronics Engineers Inc.
[8] Asymptotic properties of mixture-of-experts models
Olteanu, M.
Rynkiewicz, J.
NEUROCOMPUTING, 2011, 74 (09) : 1444 - 1449
[9] A mixture-of-experts approach for gene regulatory network inference
Shao, Borong
Lavesson, Niklas
Boeva, Veselka
Shahzad, Raja Khurram
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2016, 14 (03) : 258 - 275
[10] A Universal Approximation Theorem for Mixture-of-Experts Models
Nguyen, Hien D.
Lloyd-Jones, Luke R.
McLachlan, Geoffrey J.
NEURAL COMPUTATION, 2016, 28 (12) : 2585 - 2593

← 1 2 3 4 5 →