Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引：2

作者：

Belayneh, Leul ^{[1
]}

Ye, Haojie ^{[1
]}

Chen, Kuan-Yu ^{[1
]}

Blaauw, David ^{[1
]}

Mudge, Trevor ^{[1
]}

Dreslinski, Ronald ^{[1
]}

Talati, Nishil ^{[1
]}

机构：

[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA

来源：

PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022 | 2022年

关键词：

GPGPU; multi-GPU; data movement; GPU cache management; CACHE;

D O I：

10.1145/3559009.3569649

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.

引用

页码：304 / 316

页数：13

共 50 条

[31] Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems
Acer, Seher
Boman, Erik G.
Glusa, Christian A.
Rajamanickam, Sivasankaran
PARALLEL COMPUTING, 2021, 106
[32] Data Parallel Skeletons for GPU Clusters and Multi-GPU Systems
Ernsting, Steffen
Kuchen, Herbert
APPLICATIONS, TOOLS AND TECHNIQUES ON THE ROAD TO EXASCALE COMPUTING, 2012, 22 : 509 - 518
[33] ScaleDNN: Data Movement Aware DNN Training on Multi-GPU
Xu, Weizheng
Pattnaik, Ashutosh
Yuan, Geng
Wang, Yanzhi
Zhang, Youtao
Tang, Xulong
2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
[34] NUMA-aware image compositing on multi-GPU platform
Pan Wang
Zhiquan Cheng
Ralph Martin
Huahai Liu
Xun Cai
Sikun Li
The Visual Computer, 2013, 29 : 639 - 649
[35] NUMA-aware image compositing on multi-GPU platform
Wang, Pan
Cheng, Zhiquan
Martin, Ralph
Liu, Huahai
Cai, Xun
Li, Sikun
VISUAL COMPUTER, 2013, 29 (6-8): : 639 - 649
[36] Tiresias: Optimizing NUMA Performance with CXL Memory and Locality-Aware Process Scheduling
Tang, Wenda
Ai, Tianxiang
Wu, Jie
PROCEEDINGS OF THE ACM TURING AWARD CELEBRATION CONFERENCE-CHINA 2024, ACM-TURC 2024, 2024, : 6 - 11
[37] Suffix Array Construction on Multi-GPU Systems
Bueren, Florian
Juenger, Daniel
Kobus, Robin
Hundt, Christian
Schmidt, Bertil
HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, : 183 - 194
[38] Multi-GPU codes for spin systems simulations
Bernaschi, M.
Fatica, M.
Parisi, G.
Parisi, L.
COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (07) : 1416 - 1421
[39] Multi-user predictive rendering on remote multi-GPU clusters
Randrianandrasana, J.
Chanonier, A.
Deleau, H.
Muller, T.
Porral, P.
Krajecki, M.
Lucas, L.
2018 IEEE FOURTH VR INTERNATIONAL WORKSHOP ON COLLABORATIVE VIRTUAL ENVIRONMENTS (3DCVE), 2018,
[40] A Multi-GPU PCISPH Implementation with Efficient Memory Transfers
Verma, Kevin
Peng, Chong
Szewc, Kamil
Wille, Robert
2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,

← 1 2 3 4 5 →