Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training

被引:0
|
作者
Kim, Jungwoo [1 ]
Na, Seonjin [1 ,2 ]
Lee, Sanghyeon [1 ]
Lee, Sunho [1 ]
Huh, Jaehyuk [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[2] Georgia Inst Technol, Atlanta, GA USA
来源
56TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE, MICRO 2023 | 2023年
关键词
DNN training; accelerators; on-chip memory; scheduling;
D O I
10.1145/3613424.3614299
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
During training tasks for machine learning models with neural processing units (NPUs), the most time-consuming part is the backward pass, which incurs significant overheads due to off-chip memory accesses. For NPUs, to mitigate the long latency and limited bandwidth of such off-chip DRAM accesses, the software-managed onchip scratchpad memory (SPM) plays a crucial role. As the backward pass computation must be optimized to improve the effectiveness of SPM, this study identifies a new data reuse pattern specific to the backward computation. The backward pass includes independent input and weight gradient computations sharing the same output gradient in each layer. Conventional sequential processing does not exploit the potential inter-operation data reuse opportunity within SPM. With this new opportunity of data reuse in the backward pass, this study proposes a novel data flow transformation scheme called interleaved gradient order, consisting of three techniques to enhance the utilization of NPU scratchpad memory. The first technique shuffles the input and weight gradient computations by interleaving two operations into a single fused operation to reduce redundant output gradient accesses. The second technique adjusts the tile access order for the interleaved gradient computations to maximize the potential data locality. However, since the best order is not fixed for all tensors, we propose a selection algorithm to find the most suitable order based on the tensor dimensions. The final technique further improves data reuse chances by using the best partitioning and mapping scheme for two gradient computations for single-core and multi-core NPUs. The simulation-based evaluation with single-core edge and server NPUs shows that the combined techniques can improve performance by 29.3% and 14.5% for edge and server NPUs respectively. Furthermore, with a quad-core server NPU, the proposed techniques reduce the execution time by 23.7%.
引用
收藏
页码:438 / 451
页数:14
相关论文
共 50 条
  • [21] Enhanced regularization for on-chip training using analog and temporary memory weights
    Singhal, Raghav
    Saraswat, Vivek
    Deshmukh, Shreyas
    Subramoney, Sreenivas
    Somappa, Laxmeesha
    Baghini, Maryam Shojaei
    Ganguly, Udayan
    NEURAL NETWORKS, 2023, 165 : 1050 - 1057
  • [22] Data movement optimization for software-controlled on-chip memory
    Fujita, M
    Kondo, M
    Nakamura, H
    EIGHTH WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS, 2004, : 120 - 127
  • [23] A preliminary study on data allocation of on-chip dual memory banks
    Cho, J
    Kim, J
    Paek, Y
    SIXTH ANNUAL WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS, 2002, : 68 - 76
  • [24] A CONTENT ADDRESSABLE MEMORY MANAGEMENT UNIT WITH ON-CHIP DATA CACHE
    GOKSEL, AK
    KRAMBECK, RH
    THOMAS, PP
    TSAY, MS
    CHEN, CY
    CLEMONS, DG
    LAROCCA, FD
    MAI, LP
    IEEE JOURNAL OF SOLID-STATE CIRCUITS, 1989, 24 (03) : 592 - 596
  • [25] Gradient Compression Supercharged High-Performance Data Parallel DNN Training
    Bai, Youhui
    Li, Cheng
    Zhou, Quan
    Yi, Jun
    Gong, Ping
    Yan, Feng
    Chen, Ruichuan
    Xu, Yinlong
    PROCEEDINGS OF THE 28TH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES, SOSP 2021, 2021, : 359 - 375
  • [26] Reducing access energy of on-chip data memory considering active data bitwidth
    Okuma, T
    Cao, Y
    Muroyama, M
    Yasuura, H
    ISLPED'02: PROCEEDINGS OF THE 2002 INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, 2002, : 88 - 91
  • [27] SPECIFICATION AND IMPLEMENTATION OF A DIGITAL HOPFIELD-TYPE ASSOCIATIVE MEMORY WITH ON-CHIP TRAINING
    JOHANNET, A
    PERSONNAZ, L
    DREYFUS, G
    GASCUEL, JD
    WEINFELD, M
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 1992, 3 (04): : 529 - 539
  • [28] Recursive Binary Neural Network Training Model for Efficient Usage of On-Chip Memory
    Guan, Tianchan
    Liu, Peiye
    Zeng, Xiaoyang
    Kim, Martha
    Seok, Mingoo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2019, 66 (07) : 2593 - 2605
  • [29] Resistive Processing Unit-based On-chip ANN Training with Digital Memory
    Deshmukh, Shreyas
    Patil, Shubham
    Biswas, Anmol
    Saraswat, Vivek
    Kadam, Abhishek
    Singh, Ajay K.
    Somappa, Laxmeesha
    Baghini, Maryam Shojaei
    Ganguly, Udayan
    2024 IEEE 6TH INTERNATIONAL CONFERENCE ON AI CIRCUITS AND SYSTEMS, AICAS 2024, 2024, : 462 - 466
  • [30] An Energy-Efficient Computing-in-Memory Neuromorphic System with On-Chip Training
    Zhao, Zhao
    Wang, Yuan
    Zhang, Xinyue
    Cui, Xiaoxin
    Huang, Ru
    2019 IEEE BIOMEDICAL CIRCUITS AND SYSTEMS CONFERENCE (BIOCAS 2019), 2019,