A power efficiency enhancements of a multi-bit accelerator for memory prohibitive deep neural networks

被引：6

作者：

Shivapakash S. ^{[1
]}

Jain H. ^{[2
]}

Hellwich O. ^{[2
]}

Gerfers F. ^{[1
]}

机构：

[1] Department of Computer Engineering and Microelectronics, Chair of Mixed Signal Circuit Design, Technical University of Berlin, Berlin

[2] Department of Computer Engineering and Microelectronics, Computer Vision and Remote Sensing, Technical University of Berlin, Berlin

来源：

IEEE Open Journal of Circuits and Systems | 2021年 / 2卷

关键词：

AlexNet; ASIC; Deep neural network; EfficientNet; FPGA; MobileNet; multi-bit accelerator; SqueezeNet; truncation;

D O I：

10.1109/OJCAS.2020.3047225

中图分类号：

学科分类号：

摘要：

Convolutional Neural Networks (CNN) are widely employed in the contemporary artificial intelligence systems. However these models have millions of connections between the layers, that are both memory prohibitive and computationally expensive. Employing these models on an embedded mobile application is resource limited with high power consumption and significant bandwidth requirement to access the data from the off-chip DRAM. Reducing the data movement between the on-chip and off-chip DRAM is the main criteria to achieve high throughput and overall better energy efficiency. Our proposed multi-bit accelerator achieves these goals by employing the truncation of the partial sum (Psum) results of the preceding layer before feeding it into the next layer. We exhibit the architecture by inferencing 32-bits for the first convolution layers and sequentially truncate the bits on the MSB/LSB of integer and fractional part without any further training on the original network. At the last fully connected layer, the top-1 accuracy is maintained with the reduced bit width of 14 and top-5 accuracy upto 10-bit width. The computation engine consists of an systolic array of 1024 processing elements (PE). Large CNNs such as AlexNet, MobileNet, SqueezeNet and EfficientNet were used as benchmark CNN model and Virtex Ultrascale FPGA was used to test the architecture. The proposed truncation scheme has 49% power reduction and resource utilization was reduced by 73.25% for LUTs (Look-up tables), 68.76% for FFs (Flip-Flops), 74.60% for BRAMs (Block RAMs) and 79.425% for Digital Signal Processors (DSPs) when compared with the 32 bits architecture. The design achieves a performance of 223.69 GOPS on a Virtex Ultrascale FPGA, the design has a overall gain of 3.63 × throughput when compared to other prior FPGA accelerators. In addition, the overall power consumption is 4.5 × lower when compared to other prior architectures. The ASIC version of the accelerator was designed in 22nm FDSOI CMOS process to achieve a overall throughput of 2.03 TOPS/W with a total power consumption of 791 mW and with a area of 1 mm ×, 1.2 mm. © 2020 IEEE.

引用

页码：156 / 169

页数：13

共 50 条

[41] Training Multi-Bit Quantized and Binarized Networks with a Learnable Symmetric Quantizer
Pham, Phuoc
Abraham, Jacob A.
Chung, Jaeyong
IEEE ACCESS, 2021, 9 : 47194 - 47203
[42] Vesti: Energy-Efficient In-Memory Computing Accelerator for Deep Neural Networks
Yin, Shihui
Jiang, Zhewei
Kim, Minkyu
Gupta, Tushar
Seok, Mingoo
Seo, Jae-Sun
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2020, 28 (01) : 48 - 61
[43] Functionality-Based Processing-in-Memory Accelerator for Deep Convolutional Neural Networks
Kim, Min-Jae
Kim, Jeong-Geun
Yoon, Su-Kyung
Kim, Shin-Dug
IEEE ACCESS, 2021, 9 : 145098 - 145108
[44] AccUDNN: A GPU Memory Efficient Accelerator for Training Ultra-deep Neural Networks
Guo, Jinrong
Liu, Wantao
Wang, Wang
Yao, Chunrong
Han, Jizhong
Li, Ruixuan
Lu, Yijun
Hu, Songlin
2019 IEEE 37TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2019), 2019, : 65 - 72
[45] UAV Speed Estimation With Multi-bit Quantizer in Adaptive Power Control
Lee, Hyeon-Cheol
PROCEEDINGS OF THE SEVENTH INTERNATIONAL CONFERENCE ON AUTONOMIC AND AUTONOMOUS SYSTEMS (ICAS 2011), 2011, : 100 - 104
[46] An Efficient Accelerator for Deep Convolutional Neural Networks
Kuo, Yi-Xian
Lai, Yeong-Kang
2020 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TAIWAN), 2020,
[47] A Survey of Accelerator Architectures for Deep Neural Networks
Chen, Yiran
Xie, Yuan
Song, Linghao
Chen, Fan
Tang, Tianqi
ENGINEERING, 2020, 6 (03) : 264 - 274
[48] Utilization of Multi-Bit Flip-Flops for Clock Power Reduction
Chen, Zhi-Wei
Yan, Jin-Tai
2012 19TH IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, CIRCUITS AND SYSTEMS (ICECS), 2012, : 677 - 680
[49] Bit-Slicing FPGA Accelerator for Quantized Neural Networks
Bilaniuk, Olexa
Wagner, Sean
Savaria, Yvon
David, Jean-Pierre
2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2019,
[50] MBM PUF: A Multi-Bit Memory-Based Physical Unclonable Function
Dehghanzadeh, Peyman
Mandal, Soumyajit
Bhunia, Swarup
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2025,

← 1 2 3 4 5 →