SAF-CNN:A Sparse Acceleration Framework of Convolutional Neural Network for Embedded FPGAs

被引:0
|
作者
Xie K. [1 ,4 ]
Yi D. [2 ,4 ]
Liu Y. [2 ,4 ]
Liu H. [1 ,4 ]
He X. [2 ,4 ]
Gong C. [3 ]
Lu Y. [1 ,2 ,4 ,5 ]
机构
[1] College of Computer Science, Nankai University, Tianjin
[2] College of Cyber Science, Nankai University, Tianjin
[3] College of Software, Nankai University, Tianjin
[4] Tianjin Key Laboratory of Network and Data Security Technology, Nankai University, Tianjin
[5] State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
基金
中国国家自然科学基金;
关键词
accelerator design; computational graph; convolutional neural network; inference framework; model compression;
D O I
10.7544/issn1000-1239.202220735
中图分类号
学科分类号
摘要
When deploying models on resource-constrained FPGAs, traditional convolutional neural network accelerators and inference frameworks often face challenges such as various device types, extremely limited resources, insufficient data bandwidth utilization, complex operator types that are difficult to match operators and schedule computing task reasonably. In this paper, a sparse acceleration framework of convolutional neural network (SAF-CNN) for embedded FPGA is proposed. Through the method of software and hardware co-design, SAF-CNN is jointly optimized from the two perspectives of hardware accelerator design and software inference framework. SAF-CNN first constructs parallel computing array and designs parallel encoding and decoding scheme to realize single-period multi-data transmission and effectively reduce communication costs. Secondly, a fine-grained structured block partitioning pruning algorithm is designed to obtain a sparse and regular weight matrix by cutting the input channel dimension within the block, so as to significantly reduce the computation scale and the resource utilization of DSP multiplier. Then, the input channel dimension dynamic expansion method and runtime scheduling strategy compatible with depth-separable convolution is proposed to realize flexible adaptation of input channel parameters and resource reuse of point-wise convolution and depth-wise convolution. Finally, the computational graph reconstruction method and hardware operator fusion are used to improve the hardware execution efficiency. The experiments use two resource-limited low-end FPGA heterogeneous platforms, Intel CycloneV and Xilinx ZU3EG. The results show that the SAF-CNN accelerator can achieve the computational performance of 76.3GOPS and 494.3GOPS respectively. Compared with multi-core CPU, SAF-CNN can achieve 3.5x and 2.2x performance improvement on the object detection model of SSD_MobileNetV1, and the model inference speed is up to 26.5fps. © 2023 Science Press. All rights reserved.
引用
收藏
页码:1053 / 1072
页数:19
相关论文
共 53 条
  • [1] Shafique M, Theocharides T, Reddy V J, Et al., TinyML: Current progress, research challenges, and future roadmap[C], Proc of the 58th ACM/IEEE Design Automation Conf (DAC), pp. 1303-1306, (2021)
  • [2] Simonyan K, Zisserman A., Very deep convolutional networks for large-scale image recognition, (2014)
  • [3] Ren S, Kaiming He, Girshick R, Et al., Faster R-CNN: Towards real-time object detection with region proposal networks[J], IEEE Transactions on Pattern Analysis & Machine Intelligence, 39, 6, pp. 1137-1149, (2017)
  • [4] He Kaiming, Gkioxari G, Dollar P, Et al., Mask R-CNN, Proc of the 16th Int Conf on Computer Vision, pp. 2961-2969, (2017)
  • [5] Guang Li, Wang Peisong, Liu Zejian, Et al., Hardware acceleration of CNN with one-hot quantization of weights and activations [C], Proc of the 23rd Design, Automation & Test in Europe Conf & Exhibition (DATE), pp. 971-974, (2020)
  • [6] Cong J, Xiao Bingjun, Minimizing computation in convolutional neural networks[C], Proc of the 24th Int Conf on Artificial Neural Networks, pp. 281-290, (2014)
  • [7] Prost-Boucle A, Bourge A, Petrot F, Et al., Scalable high-performance architecture for convolutional ternary neural networks on FPGA, Proc of the 27th Int Conf on Field Programmable Logic and Applications (FPL), (2017)
  • [8] Mousouliotis P G, Petrou L P., CNN-Grinder: From algorithmic to high-level synthesis descriptions of CNNs for low-end-low-cost FPGA SoCs, Microprocessors and Microsystems, 73, (2020)
  • [9] Guilin Chen, Sheng Ma, Yang Guo, Survey on accelerating neural network with hardware[J], Journal of Computer Research and Development, 56, 2, (2019)
  • [10] Mishra R, Gupta H P, Dutta T., A survey on deep neural network compression: Challenges, overview, and solutions, (2020)