Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

被引：0

作者：

Jiang, Wenqi ^{[1
]}

Zeller, Marco ^{[1
]}

Waleffe, Roger ^{[2
]}

Hoefler, Torsten ^{[3
]}

Alonso, Gustavo ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Syst Grp, Zurich, Switzerland

[2] Univ Wisconsin Madison, Madison, WI USA

[3] Swiss Fed Inst Technol, SPCL, Zurich, Switzerland

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 18卷 / 01期

关键词：

NEAREST-NEIGHBOR SEARCH; SIMILARITY SEARCH; QUANTIZATION; ENGINE; VECTOR;

D O I：

10.14778/3696435.3696439

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context- specific knowledge during text generation. This strategy facilitates impressive generation quality even with smaller models, thus reducing computational demands by orders of magnitude. To serve RALMs efficiently and flexibly, we propose Chameleon, a heterogeneous accelerator system integrating both LLM and vector search accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both inference and retrieval, while the disaggregation allows independent scaling of LLM and vector search accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements vector search accelerators on FPGAs and assigns LLM inference to GPUs, with CPUs as cluster coordinators. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. The promising results pave the way for adopting heterogeneous accelerators for not only LLM inference but also vector search in future RALM systems.

引用

页码：42 / 52

页数：11

共 50 条

[41] Optimized interaction with Large Language Models: A practical guide to Prompt Engineering and Retrieval-Augmented Generation
Fink, Anna
Rau, Alexander
Kotter, Elmar
Bamberg, Fabian
Russe, Maximilian Frederik
RADIOLOGIE, 2025,
[42] Quantitative Evaluation of Using Large Language Models and Retrieval-Augmented Generation in Computer Science Education
Wang, Kevin Shukang
Lawrence, Ramon
PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 1, 2025, : 1183 - 1189
[43] Leveraging Retrieval-Augmented Generation for Swahili Language Conversation Systems
Ndimbo, Edmund V.
Luo, Qin
Fernando, Gimo C.
Yang, Xu
Wang, Bang
APPLIED SCIENCES-BASEL, 2025, 15 (02):
[44] A Retrieval-Augmented Framework for Tabular Interpretation with Large Language Model
Yan, Mengyi
Rene, Weilong
Wang, Yaoshu
Li, Jianxin
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2024, PT 2, 2025, 14851 : 341 - 356
[45] Enhanced Recommendation Systems with Retrieval-Augmented Large Language Model
Wei, Chuyuan
Duan, Ke
Zhuo, Shengda
Wang, Hongchun
Huang, Shuqiang
Liu, Jie
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2025, 82 : 1147 - 1173
[46] Performance of Retrieval-Augmented Large Language Models to Recommend Head and Neck Cancer Clinical Trials
Hung, Tony K. W.
Kuperman, Gilad J.
Sherman, Eric J.
Ho, Alan L.
Weng, Chunhua
Pfister, David G.
Mao, Jun J.
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[47] Advancing Cyber Incident Timeline Analysis Through Retrieval-Augmented Generation and Large Language Models
Loumachi, Fatma Yasmine
Ghanem, Mohamed Chahine
Ferrag, Mohamed Amine
COMPUTERS, 2025, 14 (02)
[48] Towards a Search Engine for Machines: Unified Ranking for Multiple Retrieval-Augmented Large Language Models
Salemi, Alireza
Zamani, Hamed
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 741 - 751
[49] Leveraging Retrieval-Augmented Generation for Reliable Medical Question Answering Using Large Language Models
Kharitonova, Ksenia
Perez-Fernandez, David
Gutierrez-Hernando, Javier
Gutierrez-Fandino, Asier
Callejas, Zoraida
Griol, David
HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, PT II, HAIS 2024, 2025, 14858 : 141 - 153
[50] OG-RAG: ONTOLOGY-GROUNDED RETRIEVAL-AUGMENTED GENERATION FOR LARGE LANGUAGE MODELS
Sharma, Kartik
Kumar, Peeyush
Li, Yunqing
arXiv,

← 1 2 3 4 5 →