WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

被引:0
|
作者
Chitty-Venkata, Krishna Teja [1 ]
Sastry, Varuni Katti [1 ]
Emani, Murali [1 ]
Vishwanath, Venkatram [1 ]
Shanmugavelu, Sanjif [2 ]
Howland, Sylvia [3 ]
机构
[1] Argonne Natl Lab, Lemont, IL 60439 USA
[2] Groq Inc, Mountain View, CA USA
[3] Cerebras Syst, Sunnyvale, CA USA
关键词
D O I
10.1007/978-3-031-69766-1_22
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.
引用
收藏
页码:317 / 331
页数:15
相关论文
共 50 条
  • [1] Resource-Efficient Transformer Pruning for Finetuning of Large Models
    Ilhan, Fatih
    Su, Gong
    Tekin, Selim Furkan
    Huang, Tiansheng
    Hu, Sihao
    Liu, Ling
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16206 - 16215
  • [2] Structured Pruning of Large Language Models
    Wang, Ziheng
    Wohlwend, Jeremy
    Lei, Tao
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6151 - 6162
  • [3] On Finetuning Large Language Models
    Wang, Yu
    POLITICAL ANALYSIS, 2023,
  • [4] ZipLM: Inference-Aware Structured Pruning of Language Models
    Kurtic, Eldar
    Frantar, Elias
    Alistarh, Dan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences
    Xu, Mengwei
    Cai, Dongqi
    Wu, Yaozong
    Li, Xiang
    Wang, Shangguang
    PROCEEDINGS OF THE 2024 USENIX ANNUAL TECHNICAL CONFERENCE, ATC 2024, 2024, : 579 - 596
  • [6] Finetuning Large Language Models for Vulnerability Detection
    Shestov, Aleksei
    Levichev, Rodion
    Mussabayev, Ravil
    Maslov, Evgeny
    Zadorozhny, Pavel
    Cheshkov, Anton
    Mussabayev, Rustam
    Toleu, Alymzhan
    Tolegen, Gulmira
    Krassovitskiy, Alexander
    IEEE ACCESS, 2025, 13 : 38889 - 38900
  • [7] Structured Pruning for Efficient Generative Pre-trained Language Models
    Tao, Chaofan
    Hou, Lu
    Bai, Haoli
    Wei, Jiansheng
    Jiang, Xin
    Liu, Qun
    Lu, Ping
    Wong, Ngai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 10880 - 10895
  • [8] Fluctuation-Based Adaptive Structured Pruning for Large Language Models
    An, Yongqi
    Zhao, Xu
    Yu, Tao
    Tang, Ming
    Wang, Jinqiao
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 10, 2024, : 10865 - 10873
  • [9] Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
    Zhao, Mengjie
    Lin, Tao
    Mi, Fei
    Jaggi, Martin
    Schutze, Hinrich
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2226 - 2241
  • [10] EBERT: Efficient BERT Inference with Dynamic Structured Pruning
    Liu, Zejian
    Li, Fanrong
    Li, Gang
    Cheng, Jian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4814 - 4823