WActiGrad: Structured Pruning for Efficient Finetuning and Inference of Large Language Models on AI Accelerators

被引:0
|
作者
Chitty-Venkata, Krishna Teja [1 ]
Sastry, Varuni Katti [1 ]
Emani, Murali [1 ]
Vishwanath, Venkatram [1 ]
Shanmugavelu, Sanjif [2 ]
Howland, Sylvia [3 ]
机构
[1] Argonne Natl Lab, Lemont, IL 60439 USA
[2] Groq Inc, Mountain View, CA USA
[3] Cerebras Syst, Sunnyvale, CA USA
关键词
D O I
10.1007/978-3-031-69766-1_22
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language Models (LLMs) have shown remarkable performance across various language processing applications. Nevertheless, their extensive computational requirements could hinder their deployment in real-time applications or resource-constrained environments. Pruning is a powerful technique to reduce the model size and make it computationally efficient. In this paper, we propose a structured pruning algorithm, Weight Activation and Gradient (WActiGrad), to obtain smaller LLMs from large pre-trained models. We investigate the level of granularity at which structured pruning techniques can be applied to an LLM and identify the challenges in applying these techniques across different parts of the transformer. Finally, based on these observations, we develop a pruning methodology that is adaptable to various attention and feedforward network modules. We comprehensively assess our WActiGrad method on state-of-the-art LLMs, LLaMA (7B and 13B), LLaMA-2 (7B and 13B), and Mistral-7B models across several language benchmarks for post-pretraining. This approach can prune close to 20% of the original model size without significantly compromising the model validation accuracy. We evaluate the hardware performance of our structurally pruned LLMs on different AI accelerators such as Nvidia A100 GPU, Groq LPU, Cerebras CS-2, and Graphcore Bow systems to show the effectiveness of the structured pruning technique. The findings presented in this paper offer insights into the integration of structured pruning techniques deployment on AI accelerators.
引用
收藏
页码:317 / 331
页数:15
相关论文
共 50 条
  • [41] The promise of AI Large Language Models for Epilepsy care
    Landais, Raphaelle
    Sultan, Mustafa
    Thomas, Rhys H.
    EPILEPSY & BEHAVIOR, 2024, 154
  • [42] AI Computing Systems for Large Language Models Training
    Zhang, Zhen-Xing
    Wen, Yuan-Bo
    Lyu, Han-Qi
    Liu, Chang
    Zhang, Rui
    Li, Xia-Qing
    Wang, Chao
    Du, Zi-Dong
    Guo, Qi
    Li, Ling
    Zhou, Xue-Hai
    Chen, Yun-Ji
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2025, 40 (01) : 6 - 41
  • [43] LAraBench: Benchmarking Arabic AI with Large Language Models
    Qatar Computing Research Institute, HBKU, Qatar
    不详
    arXiv, 1600,
  • [44] Death by AI: Will large language models diminish Wikipedia?
    Wagner, Christian
    Jiang, Ling
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2025,
  • [45] Large language models make AI usable for everyone!
    Bause, Fabian
    Konstruktion, 2024, 76 (04): : 3 - 5
  • [46] Large Language Models and Generative AI, Oh My!
    Cobb, Peter J.
    ADVANCES IN ARCHAEOLOGICAL PRACTICE, 2023, 11 (03): : 363 - 369
  • [47] Enhancing pretrained language models with structured commonsense knowledge for textual inference
    Du, Li
    Ding, Xiao
    Xiong, Kai
    Liu, Ting
    Qin, Bing
    KNOWLEDGE-BASED SYSTEMS, 2022, 254
  • [48] Large language models for structured reporting in radiology: comment
    Amnuay Kleebayoon
    Viroj Wiwanitkit
    La radiologia medica, 2023, 128 : 1440 - 1440
  • [49] SKILL: Structured Knowledge Infusion for Large Language Models
    Moiseev, Fedor
    Dong, Zhe
    Alfonseca, Enrique
    Jaggi, Martin
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 1581 - 1588
  • [50] Large language models for structured reporting in radiology: comment
    Kleebayoon, Amnuay
    Wiwanitkit, Viroj
    RADIOLOGIA MEDICA, 2023, 128 (11): : 1440 - 1440