Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

被引：0

作者：

Jiang, Wenqi ^{[1
]}

Zeller, Marco ^{[1
]}

Waleffe, Roger ^{[2
]}

Hoefler, Torsten ^{[3
]}

Alonso, Gustavo ^{[1
]}

机构：

[1] Swiss Fed Inst Technol, Syst Grp, Zurich, Switzerland

[2] Univ Wisconsin Madison, Madison, WI USA

[3] Swiss Fed Inst Technol, SPCL, Zurich, Switzerland

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 18卷 / 01期

关键词：

NEAREST-NEIGHBOR SEARCH; SIMILARITY SEARCH; QUANTIZATION; ENGINE; VECTOR;

D O I：

10.14778/3696435.3696439

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context- specific knowledge during text generation. This strategy facilitates impressive generation quality even with smaller models, thus reducing computational demands by orders of magnitude. To serve RALMs efficiently and flexibly, we propose Chameleon, a heterogeneous accelerator system integrating both LLM and vector search accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both inference and retrieval, while the disaggregation allows independent scaling of LLM and vector search accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements vector search accelerators on FPGAs and assigns LLM inference to GPUs, with CPUs as cluster coordinators. Evaluated on various RALMs, Chameleon exhibits up to 2.16x reduction in latency and 3.18x speedup in throughput compared to the hybrid CPU-GPU architecture. The promising results pave the way for adopting heterogeneous accelerators for not only LLM inference but also vector search in future RALM systems.

引用

页码：42 / 52

页数：11

共 50 条

[1] In-Context Retrieval-Augmented Language Models
Ram, Ori
Levine, Yoav
Dalmedigos, Itay
Muhlgay, Dor
Shashua, Amnon
Leyton-Brown, Kevin
Shoham, Yoav
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1316 - 1331
[2] Query Rewriting for Retrieval-Augmented Large Language Models
Ma, Xinbei
Gong, Yeyun
He, Pengcheng
Zhao, Hai
Duan, Nan
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 5303 - 5315
[3] Benchmarking Large Language Models in Retrieval-Augmented Generation
Chen, Jiawei
Lin, Hongyu
Han, Xianpei
Sun, Le
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17754 - 17762
[4] LeanDojo: Theorem Proving with Retrieval-Augmented Language Models
Yang, Kaiyu
Swope, Aidan M.
Gu, Alex
Chalamala, Rahul
Song, Peiyang
Yu, Shixing
Godil, Saad
Prenger, Ryan
Anandkumar, Anima
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] Retrieval-augmented Recommender System: Enhancing Recommender Systems with Large Language Models
Di Palma, Dario
PROCEEDINGS OF THE 17TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2023, 2023, : 1369 - 1373
[6] Retrieval-Augmented Diffusion Models
Blattmann, Andreas
Rombach, Robin
Oktay, Kaan
Mueller, Jonas
Ommer, Bjoern
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[7] Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review
Zhang, Wan
Zhang, Jing
MATHEMATICS, 2025, 13 (05)
[8] Resolving Unseen Rumors with Retrieval-Augmented Large Language Models
Chen, Lei
Wei, Zhongyu
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT IV, NLPCC 2024, 2025, 15362 : 319 - 332
[9] Surface-Based Retrieval Reduces Perplexity of Retrieval-Augmented Language Models
Doostmohammadi, Ehsan
Norlund, Tobias
Kuhlmann, Marco
Johansson, Richard
61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 521 - 529
[10] Retrieval-augmented large language models for clinical trial screening.
He, Jianqiao
Gai, Shanglei
Ho, Si Xian
Chua, Shi Ling
Oo, Viviana
Zaw, Ma Wai Wai
Tan, Daniel Shao-Weng
Tan, Ryan
JOURNAL OF CLINICAL ONCOLOGY, 2024, 42 (23_SUPPL) : 157 - 157

← 1 2 3 4 5 →