Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

被引:20
|
作者
Hawks, Benjamin [1 ]
Duarte, Javier [2 ]
Fraser, Nicholas J. [3 ]
Pappalardo, Alessandro [3 ]
Nhan Tran [1 ,4 ]
Umuroglu, Yaman [3 ]
机构
[1] Fermilab Natl Accelerator Lab, POB 500, Batavia, IL 60510 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Xilinx Res, Dublin, Ireland
[4] Northwestern Univ, Evanston, IL USA
来源
基金
美国能源部;
关键词
pruning; quantization; neural networks; generalizability; regularization; batch normalization; MODEL COMPRESSION; ACCELERATION;
D O I
10.3389/frai.2021.676564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantizationaware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] QPA: A Quantization-Aware Piecewise Polynomial Approximation Methodology for Hardware-Efficient Implementations
    Geng, Haoran
    Chen, Xiaoliang
    Zhao, Ning
    Du, Yuan
    Du, Li
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (07) : 931 - 944
  • [32] CNNBooster: Accelerating CNN Inference with Latency-aware Channel Pruning for GPU
    Zhu, Yuting
    Jiang, Flongxu
    Zhang, Runhua
    Zhang, Yonghua
    Dong, Dong
    2022 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING, ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM, 2022, : 355 - 362
  • [33] Intermittent-Aware Neural Network Pruning
    Lin, Chih-Chia
    Liu, Chia-Yin
    Yen, Chih-Hsuan
    Kuo, Tei-Wei
    Hsiu, Pi-Cheng
    2023 60TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC, 2023,
  • [34] Channel Pruning in Quantization-aware Training: an Adaptive Projection-gradient Descent-shrinkage-splitting Method
    Li, Zhijian
    Xin, Jack
    2022 5TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE FOR INDUSTRIES, AI4I, 2022, : 31 - 34
  • [35] Crossbar-Aware Neural Network Pruning
    Liang, Ling
    Deng, Lei
    Zeng, Yueling
    Hu, Xing
    Ji, Yu
    Ma, Xin
    Li, Guoqi
    Xie, Yuan
    IEEE ACCESS, 2018, 6 : 58324 - 58337
  • [36] DQI: A Dynamic Quantization Method for Efficient Convolutional Neural Network Inference Accelerators
    Wang, Yun
    Liu, Qiang
    Yan, Shun
    2022 IEEE 30TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2022), 2022, : 231 - 231
  • [37] Pruning and quantization for deep neural network acceleration: A survey
    Liang, Tailin
    Glossner, John
    Wang, Lei
    Shi, Shaobo
    Zhang, Xiaotong
    NEUROCOMPUTING, 2021, 461 : 370 - 403
  • [38] Latency and accuracy optimization for binary neural network inference with locality-aware operation skipping
    Lee, S. -J.
    Kim, T. -H.
    ELECTRONICS LETTERS, 2024, 60 (02)
  • [39] Dynamic Network Quantization for Efficient Video Inference
    Sun, Ximeng
    Panda, Rameswar
    Chen, Chun-Fu
    Oliva, Aude
    Feris, Rogerio
    Saenko, Kate
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7355 - 7365
  • [40] EdgeDRNN: Enabling Low-latency Recurrent Neural Network Edge Inference
    Gao, Chang
    Rios-Navarro, Antonio
    Chen, Xi
    Delbruck, Tobi
    Liu, Shih-Chii
    2020 2ND IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2020), 2020, : 41 - 45