Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

被引:20
|
作者
Hawks, Benjamin [1 ]
Duarte, Javier [2 ]
Fraser, Nicholas J. [3 ]
Pappalardo, Alessandro [3 ]
Nhan Tran [1 ,4 ]
Umuroglu, Yaman [3 ]
机构
[1] Fermilab Natl Accelerator Lab, POB 500, Batavia, IL 60510 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Xilinx Res, Dublin, Ireland
[4] Northwestern Univ, Evanston, IL USA
来源
基金
美国能源部;
关键词
pruning; quantization; neural networks; generalizability; regularization; batch normalization; MODEL COMPRESSION; ACCELERATION;
D O I
10.3389/frai.2021.676564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantizationaware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] GAZELLE: A Low Latency Framework for Secure Neural Network Inference
    Juvekar, Chiraag
    Vaikuntanathan, Vinod
    Chandrakasan, Anantha
    PROCEEDINGS OF THE 27TH USENIX SECURITY SYMPOSIUM, 2018, : 1651 - 1668
  • [22] Latency-Aware Inference on Convolutional Neural Network Over Homomorphic Encryption
    Ishiyama, Takumi
    Suzuki, Takuya
    Yamana, Hayato
    INFORMATION INTEGRATION AND WEB INTELLIGENCE, IIWAS 2022, 2022, 13635 : 324 - 337
  • [23] Communication-efficient ADMM using quantization-aware Gaussian process regression
    Duarte, Aldo
    Nghiem, Truong X.
    Wei, Shuangqing
    EURO JOURNAL ON COMPUTATIONAL OPTIMIZATION, 2024, 12
  • [24] Overflow Aware Quantization: Accelerating Neural Network Inference by Low-bit Multiply-Accumulate Operations
    Xie, Hongwei
    Song, Yafei
    Cai, Ling
    Li, Mingyang
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 868 - 875
  • [25] Pruning-Aware Merging for Efficient Multitask Inference
    He, Xiaoxi
    Gao, Dawei
    Zhou, Zimu
    Tong, Yongxin
    Thiele, Lothar
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 585 - 595
  • [26] Phase-limited quantization-aware training for diffractive deep neural networks
    Wang, Yu
    Sha, Qi
    Qi, Feng
    APPLIED OPTICS, 2025, 64 (06) : 1413 - 1419
  • [27] γ-Razor: Hardness-Aware Dataset Pruning for Efficient Neural Network Training
    Liu, Lei
    Zhang, Peng
    Liang, Yunji
    Liu, Junrui
    Morra, Lia
    Guo, Bin
    Yu, Zhiwen
    Zhang, Yanyong
    Zeng, Daniel D.
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024,
  • [28] Quantization-Aware Neural Architecture Search with Hyperparameter Optimization for Industrial Predictive Maintenance Applications
    van de Waterlaat, Nick
    Vogel, Sebastian
    Rodriguez, Hiram Rayo Torres
    Sanberg, Willem
    Daalderop, Gerardo
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [29] Pruning and Quantization Enhanced Densely Connected Neural Network for Efficient Acoustic Echo Cancellation
    Chen, Chen
    Yan, Sheng
    Hao, Chengpeng
    MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2024, 2025, 2312 : 200 - 211
  • [30] Quantization-Aware Interval Bound Propagation for Training Certifiably Robust Quantized Neural Networks
    Lechner, Mathias
    Zikelic, Dorde
    Chatterjee, Krishnendu
    Henzinger, Thomas A.
    Rus, Daniela
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 12, 2023, : 14964 - 14973