WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

被引：0

作者：

Chitty-Venkata, Krishna Teja ^{[1
]}

Sastry, Varuni Katti ^{[1
]}

Emani, Murali ^{[1
]}

Vishwanath, Venkatram ^{[1
]}

Shanmugavelu, Sanjif ^{[2
]}

Howland, Sylvia ^{[3
]}

机构：

[1] Argonne Natl Lab, Lemont, IL 60439 USA

[2] Groq Inc, Mountain View, CA USA

[3] Cerebras Syst, Sunnyvale, CA USA

来源：

EURO-PAR 2024: PARALLEL PROCESSING, PART II, EURO-PAR 2024 | 2024年 / 14802卷

关键词：

D O I：

10.1007/978-3-031-69766-1_22

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.

引用

页码：317 / 331

页数：15

共 50 条

[31] Audiences, automation, and AI: From structured news to language models
Caswell, David
AI MAGAZINE, 2024, 45 (02) : 174 - 186
[32] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Lu, Xudong
Liu, Qi
Xu, Yuhui
Zhou, Aojun
Huang, Siyuan
Zhang, Bo
Yan, Junchi
Li, Hongsheng
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
[33] Efficient Localized Inference for Large Graphical Models
Chen, Jinglin
Peng, Jian
Liu, Qiang
PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4987 - 4993
[34] LLM-Pruner: On the Structural Pruning of Large Language Models
Ma, Xinyin
Fang, Gongfan
Wang, Xinchao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[35] Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning
Li, Bingbing
Kong, Zhenglun
Zhang, Tianyun
Li, Ji
Li, Zhengang
Liu, Hang
Ding, Caiwen
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
[36] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
Yuan, Xiaoming
Kong, Weixuan
Luo, Zhenyu
Xu, Minrui
ELECTRONICS, 2024, 13 (11)
[37] Sources of Hallucination by Large Language Models on Inference Tasks
McKenna, Nick
Li, Tianyi
Cheng, Liang
Hosseini, Mohammad Javad
Johnson, Mark
Steedman, Mark
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2758 - 2774
[38] Fusing AI: Multimodal Language Models Inference Across Diverse Inputs
Jovanovic, Mladan
Campbell, Mark
COMPUTER, 2024, 57 (11) : 124 - 130
[39] Large Language Models and Generative AI, Oh My!
Zyda, Michael
COMPUTER, 2024, 57 (03) : 127 - 132
[40] Neurosymbolic AI Approach to Attribution in Large Language Models
Tilwani, Deepa
Venkataramanan, Revathy
Sheth, Amit P.
IEEE INTELLIGENT SYSTEMS, 2024, 39 (06) : 10 - 17

← 1 2 3 4 5 →