4-bit CNN Quantization Method With Compact LUT-Based Multiplier Implementation on FPGA

被引:3
|
作者
Zhao, Bingrui [1 ]
Wang, Yaonan [1 ]
Zhang, Hui [1 ]
Zhang, Jinzhou [2 ]
Chen, Yurong [1 ]
Yang, Yimin [3 ]
机构
[1] Hunan Univ, Natl Engn Res Ctr Robot Visual Percept & Control T, Sch Robot, Changsha 410082, Hunan, Peoples R China
[2] Changsha Univ Sci & Technol, Sch Elect & Informat Engn, Changsha 410114, Hunan, Peoples R China
[3] Western Univ, Dept Elect & Comp Engn, London, ON N6A 3K7, Canada
基金
中国国家自然科学基金;
关键词
Quantization (signal); Field programmable gate arrays; Convolutional neural networks; Hardware; Convolution; Complexity theory; Power demand; Convolutional neural network (CNN); digital circuit; field-programmable gate array (FPGA); lookup table (LUT)-based multiplier; low-precision quantization; NEURAL-NETWORKS; YOLO CNN; ACCELERATION;
D O I
10.1109/TIM.2023.3324357
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
To address the challenge of deploying convolutional neural networks (CNNs) on edge devices with limited resources, this article presents an effective 4-bit quantization scheme for CNN and proposes a DSP-free multiplier solution for deploying quantized neural networks on field-programmable gate array (FPGA) devices. Specifically, we first introduce a threshold-aware quantization (TAQ) method with a mixed rounding strategy to compress the scale of the model while maintaining the accuracy of the original full-precision model. Experimental results demonstrate that the proposed quantization method retains a high classification accuracy for 4-bit quantized CNN models. In addition, we propose a compact lookup table-based multiplier (CLM) design that replaces numerical multiplication with a lookup table (LUT) of precomputed 4-bit multiplication results, leveraging LUT6 resources instead of scarce DSP blocks to improve the scalability of FPGA to implement multiplication-intensive CNN algorithms. The proposed 4-bit CLM only consumes 13 LUT6 resources, surpassing the existing LUT-based multipliers (LMULs) in terms of resource consumption. The proposed CNN quantization and CLM multiplier scheme effectively save FPGA resource consumption for FPGA implementation on image classification tasks, providing strong support for deep learning algorithms in unmanned systems, industrial inspection, and other relevant vision and measurement scenarios running on DSP-constrained edge devices.
引用
收藏
页码:1 / 10
页数:10
相关论文
共 36 条
  • [1] LUT-based FPGA Implementation of SMS4/AES/Camellia
    Gao, Xianwei
    Lu, Erhong
    Li, Li
    Lang, Kun
    SEC 2008: PROCEEDINGS OF THE FIFTH IEEE INTERNATIONAL SYMPOSIUM ON EMBEDDED COMPUTING, 2008, : 73 - 76
  • [2] Optimization of Serial-Serial Multiplier and Implementation of a 4-bit Multiplier
    Sabbagh, Sadegh
    Baseri, Javad
    2014 22nd Iranian Conference on Electrical Engineering (ICEE), 2014, : 476 - 479
  • [3] An Architecture Independent Packing Method for LUT-based Commercial FPGA
    Yang, Meng
    Lai, Jinmei
    Almaini, A. E. A.
    JOURNAL OF COMPUTERS, 2014, 9 (05) : 1131 - 1137
  • [4] Design and Analysis of Compact QCA Based 4-Bit Serial-Parallel Multiplier
    Premananda, B. S.
    Bhargav, U. K.
    Vineeth, Kaza Sai
    2018 3RD INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONICS, COMMUNICATION, COMPUTER, AND OPTIMIZATION TECHNIQUES (ICEECCOT - 2018), 2018, : 1014 - 1018
  • [5] 4-bit SFQ Multiplier Based on Booth Encoder
    Nakamoto, Ryosuke
    Sakuraba, Sakae
    Onomi, Takeshi
    Sato, Shigeo
    Nakajima, Koji
    IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, 2011, 21 (03) : 852 - 855
  • [6] Technology optimised fixed-point bit-parallel multiplier for LUT-based FPGAs
    Khurshid B.
    Naaz R.
    Int. J. High Perform. Syst. Archit., 1 (28-35): : 28 - 35
  • [7] FPGA-implementation of atan(Y/X) based on logarithmic transformation and LUT-based techniques
    Gutierrez, R.
    Torres, V.
    Valls, J.
    JOURNAL OF SYSTEMS ARCHITECTURE, 2010, 56 (11) : 588 - 596
  • [8] Implementation on FPGA of a lut-based ATAN(Y/X) operator suitable for synchronization algorithms
    Gutierrez, Roberto
    Valls, Javier
    2007 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS, PROCEEDINGS, VOLS 1 AND 2, 2007, : 472 - 475
  • [9] LUT-Based QCA Implementation of a 4x4 S-Box
    Amiri, Mohammad Amin
    Mahdavi, Mojdeh
    Mirzakuchaki, Sattar
    IEEE TIC-STH 09: 2009 IEEE TORONTO INTERNATIONAL CONFERENCE: SCIENCE AND TECHNOLOGY FOR HUMANITY, 2009, : 996 - 999
  • [10] Gate diffusion input based 4-bit Vedic multiplier design
    Garg, Ankit
    Joshi, Garima
    IET CIRCUITS DEVICES & SYSTEMS, 2018, 12 (06) : 764 - 770