Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

被引:0
|
作者
Zhao, Xiaotian [1 ]
Xu, Ruge [1 ]
Gao, Yimin [2 ]
Verma, Vaibhav [3 ,4 ]
Stan, Mircea R. [2 ]
Guo, Xinfei [1 ]
机构
[1] Shanghai Jiao Tong Univ, Univ Michigan & Shanghai Jiao Tong Univ Joint Inst, Shanghai 200240, Peoples R China
[2] Univ Virginia, Dept Elect & Comp Engn, Charlottesville, VA 22903 USA
[3] Univ Virginia, Charlottesville, VA 22903 USA
[4] Qualcomm, San Diego, CA 92121 USA
关键词
Quantization (signal); Hardware; Accuracy; Computer architecture; Virtual machine monitors; Training; Pipelines; Edge inference; mixed-precision quantization; ASIP; PTQ; QAT;
D O I
10.1109/TC.2024.3441860
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of 15.50x to 47.67x over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves 2.2%similar to 6.7% higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over 1.3% compared to a greedy-based search.
引用
收藏
页码:2504 / 2519
页数:16
相关论文
共 15 条
  • [1] Design Space Exploration of Layer-Wise Mixed-Precision Quantization with Tightly Integrated Edge Inference Units
    Zhao, Xiaotian
    Gao, Yimin
    Verma, Vaibhav
    Xu, Ruge
    Stan, Mircea
    Guo, Xinfei
    PROCEEDINGS OF THE GREAT LAKES SYMPOSIUM ON VLSI 2023, GLSVLSI 2023, 2023, : 467 - 471
  • [2] Mixed-Precision Neural Network Quantization via Learned Layer-Wise Importance
    Tang, Chen
    Ouyang, Kai
    Wang, Zhi
    Zhu, Yifei
    Ji, Wen
    Wang, Yaowei
    Zhu, Wenwu
    COMPUTER VISION, ECCV 2022, PT XI, 2022, 13671 : 259 - 275
  • [3] MPQ-YOLO: Ultra low mixed-precision quantization of YOLO for edge devices deployment
    Liu, Xinyu
    Wang, Tao
    Yang, Jiaming
    Tang, Chenwei
    Lv, Jiancheng
    NEUROCOMPUTING, 2024, 574
  • [4] Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes
    Risso, Matteo
    Burrello, Alessio
    Benini, Luca
    Macii, Enrico
    Poncino, Massimo
    Pagliari, Daniele Jahier
    2022 IEEE 13TH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2022, : 33 - 38
  • [5] AMED: Automatic Mixed-Precision Quantization for Edge Devices
    Kimhi, Moshe
    Rozen, Tal
    Mendelson, Avi
    Baskin, Chaim
    MATHEMATICS, 2024, 12 (12)
  • [6] Dynamic Split Computing-Aware Mixed-Precision Quantization for Efficient Deep Edge Intelligence
    Nagamatsu, Naoki
    Hara-Azumi, Yuko
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 2538 - 2545
  • [7] SQNR-based Layer-wise Mixed-Precision Schemes with Computational Complexity Consideration
    Kim, Ha-Na
    Eun, Hyun
    Choi, Jung Hwan
    Kim, Ji-Hoon
    2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 234 - 235
  • [8] ILD-MPQ: Learning-Free Mixed-Precision Quantization with Inter-Layer Dependency Awareness
    Xu, Ruge
    Duan, Qiang
    Chen, Qibin
    Guo, Xinfei
    2024 IEEE 6TH INTERNATIONAL CONFERENCE ON AI CIRCUITS AND SYSTEMS, AICAS 2024, 2024, : 512 - 516
  • [9] Complexity-Aware Layer-Wise Mixed-Precision Schemes With SQNR-Based Fast Analysis
    Kim, Hana
    Eun, Hyun
    Choi, Jung Hwan
    Kim, Ji-Hoon
    IEEE ACCESS, 2023, 11 : 117800 - 117809
  • [10] A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference
    Ottavi, Gianmarco
    Garofalo, Angelo
    Tagliavini, Giuseppe
    Conti, Francesco
    Benini, Luca
    Rossi, Davide
    2020 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2020), 2020, : 512 - 517