Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel Processors

被引:0
|
作者
Jeong, Jong Hyun [1 ]
Yoon, Myung Kuk [2 ]
Oh, Yunho [1 ]
Koo, Gunjae [1 ]
机构
[1] Korea Univ, Seoul, South Korea
[2] Ewha Womans Univ, Seoul, South Korea
基金
新加坡国家研究基金会;
关键词
GPU Architecture; Memory System; Memory Controller; CACHE MANAGEMENT; SUITE;
D O I
10.1145/3605573.3605645
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The performance of GPU's external memories is becoming more critical since a modern GPU runs thousands of concurrent threads that demand a huge volume of data. In order to utilize resources in the memory hierarchy more efficiently, GPU employs a memory coalescing scheme to reduce the number of demand requests created from a group of threads (i.e. a warp). However, GPU's memory coalescing does not work well for applications that exhibit irregular memory access patterns, thus a single warp can generate multiple memory transactions. Since memory requests are serviced by different hierarchy levels and/or memory partitions, multiple outstanding requests from a single warp exhibit diverged fetch latency. Considering the execution time of a load warp is decided by the slowest memory transaction, the diverged memory latency within a warp is a critical performance factor for load warps. In this paper, we propose a warp-aware memory controller scheme, called Warped-MC, to mitigate the memory latency divergence issues. Based on the in-depth analysis, we reveal the memory latency divergence within a warp is mainly caused by GPU memory controllers. While the conventional FR-FCFS memory controller can maximize the effective bandwidth of DRAM channels, the scheduling scheme of the conventional memory controller can exacerbate the memory latency divergence of a warp. Warped-MC employs a warp-aware scheduling scheme to alleviate the memory latency divergence, thus Warped-MC can tackle the long tail of the load warp execution time to improve the performance of memory-intensive applications. We implement Warped-MC on GPGPU-Sim configured with the modern GPU architecture, and our evaluation results exhibit Warped-MC can improve the performance of memory-intensive applications by 8.9% on average with a maximum of 45.8%.
引用
收藏
页码:546 / 555
页数:10
相关论文
共 24 条
  • [1] MRPB: Memory Request Prioritization for Massively Parallel Processors
    Jia, Wenhao
    Shaw, Kelly A.
    Martonosi, Margaret
    2014 20TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-20), 2014, : 272 - 283
  • [2] Massively Parallel Computation of Lattice Associative Memory Classifiers on Multicore Processors
    Ritter, Gerhard X.
    Schmalz, Mark S.
    Hayden, Eric T.
    MATHEMATICS OF DATA/IMAGE PATTERN CODING, COMPRESSION, AND ENCRYPTION WITH APPLICATIONS XIII, 2011, 8136
  • [3] Conflict-free parallel memory access scheme for FFT processors
    Takala, JH
    Järvinen, TS
    Sorokin, HT
    PROCEEDINGS OF THE 2003 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL IV: DIGITAL SIGNAL PROCESSING-COMPUTER AIDED NETWORK DESIGN-ADVANCED TECHNOLOGY, 2003, : 524 - 527
  • [4] Efficient Memory Management Scheme for Pipelined Shared-Memory FFT Processors
    Luo, Hsin-Fu
    Shieh, Ming-Der
    2015 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2015, : 178 - 179
  • [5] Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures
    Nguyen, Andy
    Helal, Ahmed E.
    Checconi, Fabio
    Laukemann, Jan
    Tithi, Jesmin Jahan
    Soh, Yongseok
    Ranadive, Teresa
    Petrini, Fabrizio
    Choi, Jee W.
    PROCEEDINGS OF THE 36TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2022, 2022,
  • [6] A Novel Conflict-Free Parallel Memory Access Scheme for FFT Processors
    Xing, Qian-Jian
    Ma, Zhen-Guo
    Xu, Ying-Ke
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2017, 64 (11) : 1347 - 1351
  • [7] A parallel iterative scheme for solving the Convection Diffusion equation on distributed memory processors
    Boukas, LA
    Missirlis, NM
    LARGE SCALE COMPUTATIONS IN AIR POLLUTION MODELLING, 1999, 57 : 79 - 88
  • [8] Efficient dynamic processor allocation for k-ary n-cube massively parallel processors
    Chen, HL
    King, CT
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 1997, 33 (08) : 59 - 73
  • [9] Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines
    Ueno K.
    Suzumura T.
    Maruyama N.
    Fujisawa K.
    Matsuoka S.
    Data Science and Engineering, 2017, 2 (1) : 22 - 35
  • [10] Shared memory multistage clustering structure, an efficient structure for massively parallel processing systems
    Electrical Engineering Department, Iran University of Science and Technology, Narmak
    Tehran
    16844, Iran
    不详
    VIC, Australia
    Proc. - Int. Conf./Exhib. High Perform. Comput. Asia-Pac. Reg., HPC-Asia, (22-27):