WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

被引:0
|
作者
Chitty-Venkata, Krishna Teja [1 ]
Sastry, Varuni Katti [1 ]
Emani, Murali [1 ]
Vishwanath, Venkatram [1 ]
Shanmugavelu, Sanjif [2 ]
Howland, Sylvia [3 ]
机构
[1] Argonne Natl Lab, Lemont, IL 60439 USA
[2] Groq Inc, Mountain View, CA USA
[3] Cerebras Syst, Sunnyvale, CA USA
关键词
D O I
10.1007/978-3-031-69766-1_22
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.
引用
收藏
页码:317 / 331
页数:15
相关论文
共 50 条
  • [31] Audiences, automation, and AI: From structured news to language models
    Caswell, David
    AI MAGAZINE, 2024, 45 (02) : 174 - 186
  • [32] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
    Lu, Xudong
    Liu, Qi
    Xu, Yuhui
    Zhou, Aojun
    Huang, Siyuan
    Zhang, Bo
    Yan, Junchi
    Li, Hongsheng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6159 - 6172
  • [33] Efficient Localized Inference for Large Graphical Models
    Chen, Jinglin
    Peng, Jian
    Liu, Qiang
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4987 - 4993
  • [34] LLM-Pruner: On the Structural Pruning of Large Language Models
    Ma, Xinyin
    Fang, Gongfan
    Wang, Xinchao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [35] Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning
    Li, Bingbing
    Kong, Zhenglun
    Zhang, Tianyun
    Li, Ji
    Li, Zhengang
    Liu, Hang
    Ding, Caiwen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [36] Efficient Inference Offloading for Mixture-of-Experts Large Language Models in Internet of Medical Things
    Yuan, Xiaoming
    Kong, Weixuan
    Luo, Zhenyu
    Xu, Minrui
    ELECTRONICS, 2024, 13 (11)
  • [37] Sources of Hallucination by Large Language Models on Inference Tasks
    McKenna, Nick
    Li, Tianyi
    Cheng, Liang
    Hosseini, Mohammad Javad
    Johnson, Mark
    Steedman, Mark
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2758 - 2774
  • [38] Fusing AI: Multimodal Language Models Inference Across Diverse Inputs
    Jovanovic, Mladan
    Campbell, Mark
    COMPUTER, 2024, 57 (11) : 124 - 130
  • [39] Large Language Models and Generative AI, Oh My!
    Zyda, Michael
    COMPUTER, 2024, 57 (03) : 127 - 132
  • [40] Neurosymbolic AI Approach to Attribution in Large Language Models
    Tilwani, Deepa
    Venkataramanan, Revathy
    Sheth, Amit P.
    IEEE INTELLIGENT SYSTEMS, 2024, 39 (06) : 10 - 17