Edge-MPQ: Layer-Wise Mixed-Precision Quantization With Tightly Integrated Versatile Inference Units for Edge Computing

被引：0

作者：

Zhao, Xiaotian ^{[1
]}

Xu, Ruge ^{[1
]}

Gao, Yimin ^{[2
]}

Verma, Vaibhav ^{[3
,4
]}

Stan, Mircea R. ^{[2
]}

Guo, Xinfei ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Univ Michigan & Shanghai Jiao Tong Univ Joint Inst, Shanghai 200240, Peoples R China

[2] Univ Virginia, Dept Elect & Comp Engn, Charlottesville, VA 22903 USA

[3] Univ Virginia, Charlottesville, VA 22903 USA

[4] Qualcomm, San Diego, CA 92121 USA

来源：

IEEE TRANSACTIONS ON COMPUTERS | 2024年 / 73卷 / 11期

关键词：

Quantization (signal); Hardware; Accuracy; Computer architecture; Virtual machine monitors; Training; Pipelines; Edge inference; mixed-precision quantization; ASIP; PTQ; QAT;

D O I：

10.1109/TC.2024.3441860

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of 15.50x to 47.67x over the baseline RV64IMA core when running a single convolution layer kernel and achieves up to 2.86 GOPS performance. This work also achieves an energy efficiency at 20.51 TOPS/W, which not only exceeds contemporary state-of-the-art MPQ hardware solutions at the edge, but also marks a significant advancement in the field. We also propose a novel MPQ search algorithm that incorporates both hardware awareness and training necessity. The algorithm samples layer-wise sensitivities using a set of newly proposed metrics and runs a heuristics search. Evaluation results show that this search algorithm achieves 2.2%similar to 6.7% higher inference accuracy under similar hardware constraints compared to state-of-the-art MPQ strategies. Furthermore we expand the search space using a dynamic programming (DP) strategy to perform search with more fine-grained accuracy intervals and support multi-dimensional search. This further improves the inference accuracy by over 1.3% compared to a greedy-based search.

引用

页码：2504 / 2519

页数：16

共 15 条

[1] Design Space Exploration of Layer-Wise Mixed-Precision Quantization with Tightly Integrated Edge Inference Units
Zhao, Xiaotian
Gao, Yimin
Verma, Vaibhav
Xu, Ruge
Stan, Mircea
Guo, Xinfei
PROCEEDINGS OF THE GREAT LAKES SYMPOSIUM ON VLSI 2023, GLSVLSI 2023, 2023, : 467 - 471
[2] Mixed-Precision Neural Network Quantization via Learned Layer-Wise Importance
Tang, Chen
Ouyang, Kai
Wang, Zhi
Zhu, Yifei
Ji, Wen
Wang, Yaowei
Zhu, Wenwu
COMPUTER VISION, ECCV 2022, PT XI, 2022, 13671 : 259 - 275
[3] MPQ-YOLO: Ultra low mixed-precision quantization of YOLO for edge devices deployment
Liu, Xinyu
Wang, Tao
Yang, Jiaming
Tang, Chenwei
Lv, Jiancheng
NEUROCOMPUTING, 2024, 574
[4] Channel-wise Mixed-precision Assignment for DNN Inference on Constrained Edge Nodes
Risso, Matteo
Burrello, Alessio
Benini, Luca
Macii, Enrico
Poncino, Massimo
Pagliari, Daniele Jahier
2022 IEEE 13TH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2022, : 33 - 38
[5] AMED: Automatic Mixed-Precision Quantization for Edge Devices
Kimhi, Moshe
Rozen, Tal
Mendelson, Avi
Baskin, Chaim
MATHEMATICS, 2024, 12 (12)
[6] Dynamic Split Computing-Aware Mixed-Precision Quantization for Efficient Deep Edge Intelligence
Nagamatsu, Naoki
Hara-Azumi, Yuko
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 2538 - 2545
[7] SQNR-based Layer-wise Mixed-Precision Schemes with Computational Complexity Consideration
Kim, Ha-Na
Eun, Hyun
Choi, Jung Hwan
Kim, Ji-Hoon
2022 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS 22), 2022, : 234 - 235
[8] ILD-MPQ: Learning-Free Mixed-Precision Quantization with Inter-Layer Dependency Awareness
Xu, Ruge
Duan, Qiang
Chen, Qibin
Guo, Xinfei
2024 IEEE 6TH INTERNATIONAL CONFERENCE ON AI CIRCUITS AND SYSTEMS, AICAS 2024, 2024, : 512 - 516
[9] Complexity-Aware Layer-Wise Mixed-Precision Schemes With SQNR-Based Fast Analysis
Kim, Hana
Eun, Hyun
Choi, Jung Hwan
Kim, Ji-Hoon
IEEE ACCESS, 2023, 11 : 117800 - 117809
[10] A Mixed-Precision RISC-V Processor for Extreme-Edge DNN Inference
Ottavi, Gianmarco
Garofalo, Angelo
Tagliavini, Giuseppe
Conti, Francesco
Benini, Luca
Rossi, Davide
2020 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2020), 2020, : 512 - 517

← 1 2 →