WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

被引:0
|
作者
Chitty-Venkata, Krishna Teja [1 ]
Sastry, Varuni Katti [1 ]
Emani, Murali [1 ]
Vishwanath, Venkatram [1 ]
Shanmugavelu, Sanjif [2 ]
Howland, Sylvia [3 ]
机构
[1] Argonne Natl Lab, Lemont, IL 60439 USA
[2] Groq Inc, Mountain View, CA USA
[3] Cerebras Syst, Sunnyvale, CA USA
关键词
D O I
10.1007/978-3-031-69766-1_22
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.
引用
收藏
页码:317 / 331
页数:15
相关论文
共 50 条
  • [21] Generative Inference of Large Language Models in Edge Computing: An Energy Efficient Approach
    Yuan, Xingyu
    Li, He
    Ota, Kaoru
    Dong, Mianxiong
    20TH INTERNATIONAL WIRELESS COMMUNICATIONS & MOBILE COMPUTING CONFERENCE, IWCMC 2024, 2024, : 244 - 249
  • [22] Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
    Alves, Duarte M.
    Guerreirol, Nuno M.
    Alves, Joao
    Pombal, Jose
    Rei, Ricardo
    de Souza, Jose G. C.
    Colombo, Pierre
    Martins, Andre F. T.
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 11127 - 11148
  • [23] Foundation Models, Generative AI, and Large Language Models
    Ross, Angela
    McGrow, Kathleen
    Zhi, Degui
    Rasmy, Laila
    CIN-COMPUTERS INFORMATICS NURSING, 2024, 42 (05) : 377 - 387
  • [24] Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions
    Li, Jiahuan
    Zhou, Hao
    Huang, Shujian
    Cheng, Shanbo
    Chen, Jiajun
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 576 - 592
  • [25] Inference to the Best Explanation in Large Language Models
    Dalal, Dhairya
    Valentino, Marco
    Freitas, Andre
    Buitelaar, Paul
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 217 - 235
  • [26] Assessing Inference Time in Large Language Models
    Walkowiak, Bartosz
    Walkowiak, Tomasz
    SYSTEM DEPENDABILITY-THEORY AND APPLICATIONS, DEPCOS-RELCOMEX 2024, 2024, 1026 : 296 - 305
  • [27] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
    Lee, Wonbeom
    Lee, Jungi
    Seo, Junghwan
    Sim, Jaewoong
    PROCEEDINGS OF THE 18TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, OSDI 2024, 2024, : 155 - 172
  • [28] Large Language Models Need Symbolic AI
    Hammond, Kristian
    Leake, David
    NEURAL-SYMBOLIC LEARNING AND REASONING 2023, NESY 2023, 2023,
  • [29] Debiasing Large Language Models with Structured Knowledge
    Ma, Congda
    Zhao, Tianyu
    Okumura, Manabu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10274 - 10287
  • [30] PuMer: Pruning and Merging Tokens for Efficient Vision Language Models
    Cao, Qingqing
    Paranjape, Bhargavi
    Hajishirzi, Hannaneh
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12890 - 12903