Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

被引:20
|
作者
Hawks, Benjamin [1 ]
Duarte, Javier [2 ]
Fraser, Nicholas J. [3 ]
Pappalardo, Alessandro [3 ]
Nhan Tran [1 ,4 ]
Umuroglu, Yaman [3 ]
机构
[1] Fermilab Natl Accelerator Lab, POB 500, Batavia, IL 60510 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Xilinx Res, Dublin, Ireland
[4] Northwestern Univ, Evanston, IL USA
来源
基金
美国能源部;
关键词
pruning; quantization; neural networks; generalizability; regularization; batch normalization; MODEL COMPRESSION; ACCELERATION;
D O I
10.3389/frai.2021.676564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantizationaware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Class-Aware Pruning for Efficient Neural Networks
    Jiang, Mengnan
    Wang, Jingcun
    Eldebiky, Amro
    Yin, Xunzhao
    Zhuo, Cheng
    Lin, Ing-Chao
    Li Zhang, Grace
    2024 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2024,
  • [42] Value-Aware Quantization for Training and Inference of Neural Networks
    Park, Eunhyeok
    Yoo, Sungjoo
    Vajda, Peter
    COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 608 - 624
  • [43] Regularizing Activation Distribution for Ultra Low-bit Quantization-Aware Training of MobileNets
    Park, Seongmin
    Sung, Wonyong
    Choi, Jungwook
    2022 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS (SIPS), 2022, : 138 - 143
  • [44] Pruning of Rule Base of a Neural Fuzzy Inference Network
    Reel, Smarti
    Goel, Ashok Kumar
    CONTEMPORARY COMPUTING, 2011, 168 : 541 - +
  • [45] Membership Inference Attacks and Defenses in Neural Network Pruning
    Yuan, Xiaoyong
    Zhang, Lan
    PROCEEDINGS OF THE 31ST USENIX SECURITY SYMPOSIUM, 2022, : 4561 - 4578
  • [46] Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors
    Claudionor N. Coelho
    Aki Kuusela
    Shan Li
    Hao Zhuang
    Jennifer Ngadiuba
    Thea Klaeboe Aarrestad
    Vladimir Loncar
    Maurizio Pierini
    Adrian Alan Pol
    Sioni Summers
    Nature Machine Intelligence, 2021, 3 : 675 - 686
  • [47] Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors
    Coelho, Claudionor N., Jr.
    Kuusela, Aki
    Li, Shan
    Zhuang, Hao
    Ngadiuba, Jennifer
    Aarrestad, Thea Klaeboe
    Loncar, Vladimir
    Pierini, Maurizio
    Pol, Adrian Alan
    Summers, Sioni
    NATURE MACHINE INTELLIGENCE, 2021, 3 (08) : 675 - +
  • [48] Low-Latency Neural Network for Efficient Hyperspectral Image Classification
    Li, Chunchao
    Li, Jun
    Peng, Mingrui
    Rasti, Behnood
    Duan, Puhong
    Tang, Xuebin
    Ma, Xiaoguang
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 7374 - 7390
  • [49] Once Quantization-Aware Training: High Performance Extremely Low-bit Architecture Search
    Shen, Mingzhu
    Liang, Feng
    Gong, Ruihao
    Li, Yuhang
    Li, Chuming
    Lin, Chen
    Yu, Fengwei
    Yan, Junjie
    Ouyang, Wanli
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 5320 - 5329
  • [50] Inertial Measurement Unit Self-Calibration by Quantization-Aware and Memory-Parsimonious Neural Networks
    Cardoni, Matteo
    Pau, Danilo Pietro
    Rezaei, Kiarash
    Mura, Camilla
    ELECTRONICS, 2024, 13 (21)