WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

被引：0

作者：

Chitty-Venkata, Krishna Teja ^{[1
]}

Sastry, Varuni Katti ^{[1
]}

Emani, Murali ^{[1
]}

Vishwanath, Venkatram ^{[1
]}

Shanmugavelu, Sanjif ^{[2
]}

Howland, Sylvia ^{[3
]}

机构：

[1] Argonne Natl Lab, Lemont, IL 60439 USA

[2] Groq Inc, Mountain View, CA USA

[3] Cerebras Syst, Sunnyvale, CA USA

来源：

EURO-PAR 2024: PARALLEL PROCESSING, PART II, EURO-PAR 2024 | 2024年 / 14802卷

关键词：

D O I：

10.1007/978-3-031-69766-1_22

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.

引用

页码：317 / 331

页数：15

共 50 条

[1] Resource-Efficient Transformer Pruning for Finetuning of Large Models
Ilhan, Fatih
Su, Gong
Tekin, Selim Furkan
Huang, Tiansheng
Hu, Sihao
Liu, Ling
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16206 - 16215
[2] Structured Pruning of Large Language Models
Wang, Ziheng
Wohlwend, Jeremy
Lei, Tao
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6151 - 6162
[3] On Finetuning Large Language Models
Wang, Yu
POLITICAL ANALYSIS, 2023,
[4] ZipLM: Inference-Aware Structured Pruning of Language Models
Kurtic, Eldar
Frantar, Elias
Alistarh, Dan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
Xu, Mengwei
Cai, Dongqi
Wu, Yaozong
Li, Xiang
Wang, Shangguang
PROCEEDINGS OF THE 2024 USENIX ANNUAL TECHNICAL CONFERENCE, ATC 2024, 2024, : 579 - 596
[6] Finetuning Large Language Models for Vulnerability Detection
Shestov, Aleksei
Levichev, Rodion
Mussabayev, Ravil
Maslov, Evgeny
Zadorozhny, Pavel
Cheshkov, Anton
Mussabayev, Rustam
Toleu, Alymzhan
Tolegen, Gulmira
Krassovitskiy, Alexander
IEEE ACCESS, 2025, 13 : 38889 - 38900
[7] Structured Pruning for Efficient Generative Pre-trained Language Models
Tao, Chaofan
Hou, Lu
Bai, Haoli
Wei, Jiansheng
Jiang, Xin
Liu, Qun
Lu, Ping
Wong, Ngai
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10880 - 10895
[8] Fluctuation-Based Adaptive Structured Pruning for Large Language Models
An, Yongqi
Zhao, Xu
Yu, Tao
Tang, Ming
Wang, Jinqiao
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 10865 - 10873
[9] Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
Zhao, Mengjie
Lin, Tao
Mi, Fei
Jaggi, Martin
Schutze, Hinrich
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2226 - 2241
[10] EBERT: Efficient BERT Inference with Dynamic Structured Pruning
Liu, Zejian
Li, Fanrong
Li, Gang
Cheng, Jian
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4814 - 4823

← 1 2 3 4 5 →