Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

被引:0
|
作者
Jiang, Wenqi [1 ]
Zeller, Marco [1 ]
Waleffe, Roger [2 ]
Hoefler, Torsten [3 ]
Alonso, Gustavo [1 ]
机构
[1] Swiss Fed Inst Technol, Syst Grp, Zurich, Switzerland
[2] Univ Wisconsin Madison, Madison, WI USA
[3] Swiss Fed Inst Technol, SPCL, Zurich, Switzerland
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 18卷 / 01期
关键词
NEAREST-NEIGHBOR SEARCH; SIMILARITY SEARCH; QUANTIZATION; ENGINE; VECTOR;
D O I
10.14778/3696435.3696439
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context- specific knowledge during text generation. This strategy facilitates impressive generation quality even with smaller models, thus reducing computational demands by orders of magnitude. To serve RALMs efficiently and flexibly, we propose Chameleon, a heterogeneous accelerator system integrating both LLM and vector search accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both inference and retrieval, while the disaggregation allows independent scaling of LLM and vector search accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements vector search accelerators on FPGAs and assigns LLM inference to GPUs, with CPUs as cluster coordinators. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. The promising results pave the way for adopting heterogeneous accelerators for not only LLM inference but also vector search in future RALM systems.
引用
收藏
页码:42 / 52
页数:11
相关论文
共 50 条
  • [41] Optimized interaction with Large Language Models: A practical guide to Prompt Engineering and Retrieval-Augmented Generation
    Fink, Anna
    Rau, Alexander
    Kotter, Elmar
    Bamberg, Fabian
    Russe, Maximilian Frederik
    RADIOLOGIE, 2025,
  • [42] Quantitative Evaluation of Using Large Language Models and Retrieval-Augmented Generation in Computer Science Education
    Wang, Kevin Shukang
    Lawrence, Ramon
    PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 1, 2025, : 1183 - 1189
  • [43] Leveraging Retrieval-Augmented Generation for Swahili Language Conversation Systems
    Ndimbo, Edmund V.
    Luo, Qin
    Fernando, Gimo C.
    Yang, Xu
    Wang, Bang
    APPLIED SCIENCES-BASEL, 2025, 15 (02):
  • [44] A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model
    Yan, Mengyi
    Rene, Weilong
    Wang, Yaoshu
    Li, Jianxin
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2024, PT 2, 2025, 14851 : 341 - 356
  • [45] Enhanced Recommendation Systems with Retrieval-Augmented Large Language Model
    Wei, Chuyuan
    Duan, Ke
    Zhuo, Shengda
    Wang, Hongchun
    Huang, Shuqiang
    Liu, Jie
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2025, 82 : 1147 - 1173
  • [46] Performance of Retrieval-Augmented Large Language Models to Recommend Head and Neck Cancer Clinical Trials
    Hung, Tony K. W.
    Kuperman, Gilad J.
    Sherman, Eric J.
    Ho, Alan L.
    Weng, Chunhua
    Pfister, David G.
    Mao, Jun J.
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [47] Advancing Cyber Incident Timeline Analysis Through Retrieval-Augmented Generation and Large Language Models
    Loumachi, Fatma Yasmine
    Ghanem, Mohamed Chahine
    Ferrag, Mohamed Amine
    COMPUTERS, 2025, 14 (02)
  • [48] Towards a Search Engine for Machines: Unified Ranking for Multiple Retrieval-Augmented Large Language Models
    Salemi, Alireza
    Zamani, Hamed
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 741 - 751
  • [49] Leveraging Retrieval-Augmented Generation for Reliable Medical Question Answering Using Large Language Models
    Kharitonova, Ksenia
    Perez-Fernandez, David
    Gutierrez-Hernando, Javier
    Gutierrez-Fandino, Asier
    Callejas, Zoraida
    Griol, David
    HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, PT II, HAIS 2024, 2025, 14858 : 141 - 153
  • [50] OG-RAG: ONTOLOGY-GROUNDED RETRIEVAL-AUGMENTED GENERATION FOR LARGE LANGUAGE MODELS
    Sharma, Kartik
    Kumar, Peeyush
    Li, Yunqing
    arXiv,