Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel Processors

被引：0

作者：

Jeong, Jong Hyun ^{[1
]}

Yoon, Myung Kuk ^{[2
]}

Oh, Yunho ^{[1
]}

Koo, Gunjae ^{[1
]}

机构：

[1] Korea Univ, Seoul, South Korea

[2] Ewha Womans Univ, Seoul, South Korea

来源：

PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023 | 2023年

基金：

新加坡国家研究基金会;

关键词：

GPU Architecture; Memory System; Memory Controller; CACHE MANAGEMENT; SUITE;

D O I：

10.1145/3605573.3605645

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

The performance of GPU's external memories is becoming more critical since a modern GPU runs thousands of concurrent threads that demand a huge volume of data. In order to utilize resources in the memory hierarchy more efficiently, GPU employs a memory coalescing scheme to reduce the number of demand requests created from a group of threads (i.e. a warp). However, GPU's memory coalescing does not work well for applications that exhibit irregular memory access patterns, thus a single warp can generate multiple memory transactions. Since memory requests are serviced by different hierarchy levels and/or memory partitions, multiple outstanding requests from a single warp exhibit diverged fetch latency. Considering the execution time of a load warp is decided by the slowest memory transaction, the diverged memory latency within a warp is a critical performance factor for load warps. In this paper, we propose a warp-aware memory controller scheme, called Warped-MC, to mitigate the memory latency divergence issues. Based on the in-depth analysis, we reveal the memory latency divergence within a warp is mainly caused by GPU memory controllers. While the conventional FR-FCFS memory controller can maximize the effective bandwidth of DRAM channels, the scheduling scheme of the conventional memory controller can exacerbate the memory latency divergence of a warp. Warped-MC employs a warp-aware scheduling scheme to alleviate the memory latency divergence, thus Warped-MC can tackle the long tail of the load warp execution time to improve the performance of memory-intensive applications. We implement Warped-MC on GPGPU-Sim configured with the modern GPU architecture, and our evaluation results exhibit Warped-MC can improve the performance of memory-intensive applications by 8.9% on average with a maximum of 45.8%.

引用

页码：546 / 555

页数：10

共 24 条

[1] MRPB: Memory Request Prioritization for Massively Parallel Processors
Jia, Wenhao
Shaw, Kelly A.
Martonosi, Margaret
2014 20TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA-20), 2014, : 272 - 283
[2] Massively Parallel Computation of Lattice Associative Memory Classifiers on Multicore Processors
Ritter, Gerhard X.
Schmalz, Mark S.
Hayden, Eric T.
MATHEMATICS OF DATA/IMAGE PATTERN CODING, COMPRESSION, AND ENCRYPTION WITH APPLICATIONS XIII, 2011, 8136
[3] Conflict-free parallel memory access scheme for FFT processors
Takala, JH
Järvinen, TS
Sorokin, HT
PROCEEDINGS OF THE 2003 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL IV: DIGITAL SIGNAL PROCESSING-COMPUTER AIDED NETWORK DESIGN-ADVANCED TECHNOLOGY, 2003, : 524 - 527
[4] Efficient Memory Management Scheme for Pipelined Shared-Memory FFT Processors
Luo, Hsin-Fu
Shieh, Ming-Der
2015 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2015, : 178 - 179
[5] Efficient, Out-of-Memory Sparse MTTKRP on Massively Parallel Architectures
Nguyen, Andy
Helal, Ahmed E.
Checconi, Fabio
Laukemann, Jan
Tithi, Jesmin Jahan
Soh, Yongseok
Ranadive, Teresa
Petrini, Fabrizio
Choi, Jee W.
PROCEEDINGS OF THE 36TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2022, 2022,
[6] A Novel Conflict-Free Parallel Memory Access Scheme for FFT Processors
Xing, Qian-Jian
Ma, Zhen-Guo
Xu, Ying-Ke
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2017, 64 (11) : 1347 - 1351
[7] A parallel iterative scheme for solving the Convection Diffusion equation on distributed memory processors
Boukas, LA
Missirlis, NM
LARGE SCALE COMPUTATIONS IN AIR POLLUTION MODELLING, 1999, 57 : 79 - 88
[8] Efficient dynamic processor allocation for k-ary n-cube massively parallel processors
Chen, HL
King, CT
COMPUTERS & MATHEMATICS WITH APPLICATIONS, 1997, 33 (08) : 59 - 73
[9] Efficient Breadth-First Search on Massively Parallel and Distributed-Memory Machines
Ueno K.
Suzumura T.
Maruyama N.
Fujisawa K.
Matsuoka S.
Data Science and Engineering, 2017, 2 (1) : 22 - 35
[10] Shared memory multistage clustering structure, an efficient structure for massively parallel processing systems
Electrical Engineering Department, Iran University of Science and Technology, Narmak
Tehran
16844, Iran
不详
VIC, Australia
Proc. - Int. Conf./Exhib. High Perform. Comput. Asia-Pac. Reg., HPC-Asia, (22-27):

← 1 2 3 →