Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引：2

作者：

Belayneh, Leul ^{[1
]}

Ye, Haojie ^{[1
]}

Chen, Kuan-Yu ^{[1
]}

Blaauw, David ^{[1
]}

Mudge, Trevor ^{[1
]}

Dreslinski, Ronald ^{[1
]}

Talati, Nishil ^{[1
]}

机构：

[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA

来源：

PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022 | 2022年

关键词：

GPGPU; multi-GPU; data movement; GPU cache management; CACHE;

D O I：

10.1145/3559009.3569649

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.

引用

页码：304 / 316

页数：13

共 50 条

[21] Multi-GPU System Design with Memory Networks
Kim, Gwangsun
Lee, Minseok
Jeong, Jiyun
Kim, John
2014 47TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO), 2014, : 484 - 495
[22] Understanding Scalability of Multi-GPU Systems
Feng, Yuan
Jeon, Hyeran
15TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPU, GPGPU 2023, 2023, : 36 - 37
[23] Distributed texture memory in a Multi-GPU environment
Moerschell, Adam
Owens, John D.
COMPUTER GRAPHICS FORUM, 2008, 27 (01) : 130 - 151
[24] Locality-aware task scheduling for homogeneous parallel computing systems
Muhammad Khurram Bhatti
Isil Oz
Sarah Amin
Maria Mushtaq
Umer Farooq
Konstantin Popov
Mats Brorsson
Computing, 2018, 100 : 557 - 595
[25] Locality-aware task scheduling for homogeneous parallel computing systems
Bhatti, Muhammad Khurram
Oz, Isil
Amin, Sarah
Mushtaq, Maria
Farooq, Umer
Popov, Konstantin
Brorsson, Mats
COMPUTING, 2018, 100 (06) : 557 - 595
[26] Locality-aware fountain codes for massive distributed storage systems
Okpotse, Toritseju
Yousefi, Shahram
2015 IEEE 14TH CANADIAN WORKSHOP ON INFORMATION THEORY (CWIT), 2015, : 18 - 21
[27] A Locality-Aware Compression Scheme for Highly Reliable Embedded Systems
Hong, Juhyung
Kim, Jeongbin
Han, Sangwoo
Chung, Eui-Young
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2019, 38 (03) : 453 - 465
[28] Optimizing Locality-Aware Memory Management of Key-Value Caches
Hu, Xiameng
Wang, Xiaolin
Zhou, Lan
Luo, Yingwei
Ding, Chen
Jiang, Song
Wang, Zhenlin
IEEE TRANSACTIONS ON COMPUTERS, 2017, 66 (05) : 862 - 875
[29] Enhancing Content Distribution Performance of Locality-aware BitTorrent Systems
Li, Zhenyu
Xie, Gaogang
2010 IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE GLOBECOM 2010, 2010,
[30] Latency and accuracy optimization for binary neural network inference with locality-aware operation skipping
Lee, S. -J.
Kim, T. -H.
ELECTRONICS LETTERS, 2024, 60 (02)

← 1 2 3 4 5 →