Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things

被引:1
|
作者
Yuan, Xiaoming [1 ,2 ]
Kong, Weixuan [1 ]
Luo, Zhenyu [1 ]
Xu, Minrui [3 ]
机构
[1] Northeastern Univ Qinhuangdao, Hebei Key Lab Marine Percept Network & Data Proc, Qinhuangdao 066004, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[3] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
基金
中国国家自然科学基金;
关键词
large language models; efficient inference offloading; mixture-of-experts; Internet of Medical Things;
D O I
10.3390/electronics13112077
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite recent significant advancements in large language models (LLMs) for medical services, the deployment difficulties of LLMs in e-healthcare hinder complex medical applications in the Internet of Medical Things (IoMT). People are increasingly concerned about e-healthcare risks and privacy protection. Existing LLMs face difficulties in providing accurate medical questions and answers (Q&As) and meeting the deployment resource demands in the IoMT. To address these challenges, we propose MedMixtral 8x7B, a new medical LLM based on the mixture-of-experts (MoE) architecture with an offloading strategy, enabling deployment on the IoMT, improving the privacy protection for users. Additionally, we find that the significant factors affecting latency include the method of device interconnection, the location of offloading servers, and the speed of the disk.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] MoE-SLU: Towards ASR-Robust Spoken Language Understanding via Mixture-of-Experts
    Cheng, Xuxin
    Zhu, Zhihong
    Zhuang, Xianwei
    Chen, Zhanpeng
    Huang, Zhiqi
    Zou, Yuexian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14868 - 14879
  • [42] Efficient Deweather Mixture-of-Experts with Uncertainty-Aware Feature-Wise Linear Modulation
    Zhang, Rongyu
    Luo, Yulin
    Liu, Jiaming
    Yang, Huanrui
    Dong, Zhen
    Gudovskiy, Denis
    Okuno, Tomoyuki
    Nakata, Yohei
    Keutzer, Kurt
    Du, Yuan
    Zhang, Shanghang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16812 - 16820
  • [43] Efficient Bayesian inference for dynamic mixture models
    Gerlach, R
    Carter, C
    Kohn, R
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2000, 95 (451) : 819 - 828
  • [44] An internet of things malware classification method based on mixture of experts neural network
    Wang, Fangwei
    Yang, Shaojie
    Li, Qingru
    Wang, Changguan
    TRANSACTIONS ON EMERGING TELECOMMUNICATIONS TECHNOLOGIES, 2021, 32 (05)
  • [45] Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks
    Chowdhury, Mohammed Nowaz Rabbani
    Zhang, Shuai
    Wang, Meng
    Liu, Sijia
    Chen, Pin-Yu
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [46] Adapted large language models can outperform medical experts in clinical text summarization
    Van Veen, Dave
    Van Uden, Cara
    Blankemeier, Louis
    Delbrouck, Jean-Benoit
    Aali, Asad
    Bluethgen, Christian
    Pareek, Anuj
    Polacin, Malgorzata
    Reis, Eduardo Pontes
    Seehofnerova, Anna
    Rohatgi, Nidhi
    Hosamani, Poonam
    Collins, William
    Ahuja, Neera
    Langlotz, Curtis P.
    Hom, Jason
    Gatidis, Sergios
    Pauly, John
    Chaudhari, Akshay S.
    NATURE MEDICINE, 2024, 30 (04) : 1134 - 1142
  • [47] Layer-Condensed KV Cache for Efficient Inference of Large Language Models
    Wu, Haoyi
    Tu, Kewei
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 11175 - 11188
  • [48] Tabi: An Efficient Multi-Level Inference System for Large Language Models
    Wang, Yiding
    Chen, Kai
    Tan, Haisheng
    Guo, Kun
    PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023, 2023, : 233 - 248
  • [49] An efficient quantized GEMV implementation for large language models inference with matrix core
    Zhang, Yu
    Lu, Lu
    Zhao, Rong
    Guo, Yijie
    Yang, Zhanyu
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (03):
  • [50] Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach
    Yuan, Xingyu
    Li, He
    Ota, Kaoru
    Dong, Mianxiong
    20TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC 2024, 2024, : 244 - 249