FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引:4
|
作者
Muthukrishnan, Harini [1 ,2 ]
Lustig, Daniel [1 ]
Villa, Oreste [1 ]
Wenisch, Thomas [2 ]
Nellans, David [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Univ Michigan, Ann Arbor, MI 48109 USA
关键词
MEMORY; MANAGEMENT; PLACEMENT;
D O I
10.1109/HPCA56546.2023.10070949
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.
引用
收藏
页码:516 / 529
页数:14
相关论文
共 50 条
  • [31] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
    Tan, Xiaodan Serina
    Golikov, Pavel
    Vijaykumar, Nandita
    Pekhimenko, Gennady
    PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022, 2022, : 317 - 332
  • [32] Fine-Grained Parallelization of a Vlasov-Poisson Application on GPU
    Latu, Guillaume
    EURO-PAR 2010 PARALLEL PROCESSING WORKSHOPS, 2011, 6586 : 127 - 135
  • [33] cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU
    Zhang, Jing
    Wang, Hao
    Lin, Heshan
    Feng, Wu-Chun
    2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
  • [34] A Fine-grained Parallel Intra Prediction for HEVC Based on GPU
    Jiang, Wenbin
    Chi, Ye
    Jin, Hai
    Liao, Xiaofei
    Zhang, Yangsong
    Ye, Geyan
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 778 - 784
  • [35] Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
    Yeh, Tsung Tai
    Sabne, Amit
    Sakdhnagool, Putt
    Eigenmann, Rudolf
    Rogers, Timothy G.
    ACM SIGPLAN NOTICES, 2017, 52 (08) : 221 - 233
  • [36] Accelerating a Lossy Compression Method with Fine-Grained Parallelism on a GPU
    Wu, Yifan
    Shen, Jingcheng
    Okita, Masao
    Ino, Fumihiko
    PAAP 2021: 2021 12TH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND PROGRAMMING, 2021, : 76 - 81
  • [37] Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems
    Belayneh, Leul
    Ye, Haojie
    Chen, Kuan-Yu
    Blaauw, David
    Mudge, Trevor
    Dreslinski, Ronald
    Talati, Nishil
    PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022, 2022, : 304 - 316
  • [38] Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing
    Wang, Zhenning
    Yang, Jun
    Melhem, Rami
    Childers, Bruce
    Zhang, Youtao
    Guo, Minyi
    PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA-22), 2016, : 358 - 369
  • [39] Fine-grained multiagent systems for the Internet
    Nangsue, P
    Conry, SE
    INTERNATIONAL CONFERENCE ON MULTI-AGENT SYSTEMS, PROCEEDINGS, 1998, : 198 - 205
  • [40] Improving Energy Efficiency of CGRAs with Low-Overhead Fine-Grained Power Domains
    Nayak, Ankita
    Zhang, Keyi
    Setaluri, Rajsekhar
    Carsello, Alex
    Mann, Makai
    Torng, Christopher
    Richardson, Stephen
    Bahr, Rick
    Hanrahan, Pat
    Horowitz, Mark
    Raina, Priyanka
    ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2023, 16 (02)