WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

被引：0

作者：

Chitty-Venkata, Krishna Teja ^{[1
]}

Sastry, Varuni Katti ^{[1
]}

Emani, Murali ^{[1
]}

Vishwanath, Venkatram ^{[1
]}

Shanmugavelu, Sanjif ^{[2
]}

Howland, Sylvia ^{[3
]}

机构：

[1] Argonne Natl Lab, Lemont, IL 60439 USA

[2] Groq Inc, Mountain View, CA USA

[3] Cerebras Syst, Sunnyvale, CA USA

来源：

EURO-PAR 2024: PARALLEL PROCESSING, PART II, EURO-PAR 2024 | 2024年 / 14802卷

关键词：

D O I：

10.1007/978-3-031-69766-1_22

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.

引用

页码：317 / 331

页数：15

共 50 条

[21] Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach
Yuan, Xingyu
Li, He
Ota, Kaoru
Dong, Mianxiong
20TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC 2024, 2024, : 244 - 249
[22] Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
Alves, Duarte M.
Guerreirol, Nuno M.
Alves, Joao
Pombal, Jose
Rei, Ricardo
de Souza, Jose G. C.
Colombo, Pierre
Martins, Andre F. T.
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11127 - 11148
[23] Foundation Models, Generative AI, and Large Language Models
Ross, Angela
McGrow, Kathleen
Zhi, Degui
Rasmy, Laila
CIN-COMPUTERS INFORMATICS NURSING, 2024, 42 (05) : 377 - 387
[24] Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions
Li, Jiahuan
Zhou, Hao
Huang, Shujian
Cheng, Shanbo
Chen, Jiajun
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 576 - 592
[25] Inference to the Best Explanation in Large Language Models
Dalal, Dhairya
Valentino, Marco
Freitas, Andre
Buitelaar, Paul
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 217 - 235
[26] Assessing Inference Time in Large Language Models
Walkowiak, Bartosz
Walkowiak, Tomasz
SYSTEM DEPENDABILITY-THEORY AND APPLICATIONS, DEPCOS-RELCOMEX 2024, 2024, 1026 : 296 - 305
[27] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Lee, Wonbeom
Lee, Jungi
Seo, Junghwan
Sim, Jaewoong
PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2024, 2024, : 155 - 172
[28] Large Language Models Need Symbolic AI
Hammond, Kristian
Leake, David
NEURAL-SYMBOLIC LEARNING AND REASONING 2023, NESY 2023, 2023,
[29] Debiasing Large Language Models with Structured Knowledge
Ma, Congda
Zhao, Tianyu
Okumura, Manabu
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10274 - 10287
[30] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
Cao, Qingqing
Paranjape, Bhargavi
Hajishirzi, Hannaneh
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12890 - 12903

← 1 2 3 4 5 →