Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems

被引:2
|
作者
Belayneh, Leul [1 ]
Ye, Haojie [1 ]
Chen, Kuan-Yu [1 ]
Blaauw, David [1 ]
Mudge, Trevor [1 ]
Dreslinski, Ronald [1 ]
Talati, Nishil [1 ]
机构
[1] Univ Michigan, Comp Sci & Engn, Ann Arbor, MI 48109 USA
关键词
GPGPU; multi-GPU; data movement; GPU cache management; CACHE;
D O I
10.1145/3559009.3569649
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
With generational gains from transistor scaling, GPUs have been able to accelerate traditional computation-intensive workloads. But with the obsolescence of Moore's Law, single GPU systems are no longer able to satisfy the computational and memory requirements of emerging workloads. To remedy this, prior works have proposed tightly-coupled multi-GPU systems. However, multi-GPU systems are hampered from efficiently utilizing their compute resources due to the Non-Uniform Memory Access (NUMA) bottleneck. In this paper, we propose DualOpt, a lightweight hardware-only solution that reduces the remote memory access latency by delivering optimizations catered to a workload's locality profile. DualOpt uses the spatio-temporal locality of remote memory accesses as a metric to classify workloads as cache insensitive and cache-friendly. Cache insensitive workloads exhibit low spatio-temporal locality, while cache-friendly workloads have ample locality that is not exploited well by the conventional cache subsystem of the GPU. For cache insensitive workloads, DualOpt proposes a fine-granularity transfer of remote data instead of the conventional cache line transfer. These remote data are then coalesced so as to efficiently utilize inter-GPU bandwidth. For cache-friendly workloads, DualOpt adds a remote-only cache that can exploit locality in remote accesses. Finally, a decision engine automatically identifies the class of workload and delivers the corresponding optimization, which improves overall performance by 2.5x on a 4-GPU system, with a small hardware overhead of 0.032%.
引用
收藏
页码:304 / 316
页数:13
相关论文
共 50 条
  • [31] Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems
    Acer, Seher
    Boman, Erik G.
    Glusa, Christian A.
    Rajamanickam, Sivasankaran
    PARALLEL COMPUTING, 2021, 106
  • [32] Data Parallel Skeletons for GPU Clusters and Multi-GPU Systems
    Ernsting, Steffen
    Kuchen, Herbert
    APPLICATIONS, TOOLS AND TECHNIQUES ON THE ROAD TO EXASCALE COMPUTING, 2012, 22 : 509 - 518
  • [33] ScaleDNN: Data Movement Aware DNN Training on Multi-GPU
    Xu, Weizheng
    Pattnaik, Ashutosh
    Yuan, Geng
    Wang, Yanzhi
    Zhang, Youtao
    Tang, Xulong
    2021 IEEE/ACM INTERNATIONAL CONFERENCE ON COMPUTER AIDED DESIGN (ICCAD), 2021,
  • [34] NUMA-aware image compositing on multi-GPU platform
    Pan Wang
    Zhiquan Cheng
    Ralph Martin
    Huahai Liu
    Xun Cai
    Sikun Li
    The Visual Computer, 2013, 29 : 639 - 649
  • [35] NUMA-aware image compositing on multi-GPU platform
    Wang, Pan
    Cheng, Zhiquan
    Martin, Ralph
    Liu, Huahai
    Cai, Xun
    Li, Sikun
    VISUAL COMPUTER, 2013, 29 (6-8): : 639 - 649
  • [36] Tiresias: Optimizing NUMA Performance with CXL Memory and Locality-Aware Process Scheduling
    Tang, Wenda
    Ai, Tianxiang
    Wu, Jie
    PROCEEDINGS OF THE ACM TURING AWARD CELEBRATION CONFERENCE-CHINA 2024, ACM-TURC 2024, 2024, : 6 - 11
  • [37] Suffix Array Construction on Multi-GPU Systems
    Bueren, Florian
    Juenger, Daniel
    Kobus, Robin
    Hundt, Christian
    Schmidt, Bertil
    HPDC'19: PROCEEDINGS OF THE 28TH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, 2019, : 183 - 194
  • [38] Multi-GPU codes for spin systems simulations
    Bernaschi, M.
    Fatica, M.
    Parisi, G.
    Parisi, L.
    COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (07) : 1416 - 1421
  • [39] Multi-user predictive rendering on remote multi-GPU clusters
    Randrianandrasana, J.
    Chanonier, A.
    Deleau, H.
    Muller, T.
    Porral, P.
    Krajecki, M.
    Lucas, L.
    2018 IEEE FOURTH VR INTERNATIONAL WORKSHOP ON COLLABORATIVE VIRTUAL ENVIRONMENTS (3DCVE), 2018,
  • [40] A Multi-GPU PCISPH Implementation with Efficient Memory Transfers
    Verma, Kevin
    Peng, Chong
    Szewc, Kamil
    Wille, Robert
    2018 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2018,