FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引:4
|
作者
Muthukrishnan, Harini [1 ,2 ]
Lustig, Daniel [1 ]
Villa, Oreste [1 ]
Wenisch, Thomas [2 ]
Nellans, David [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Univ Michigan, Ann Arbor, MI 48109 USA
关键词
MEMORY; MANAGEMENT; PLACEMENT;
D O I
10.1109/HPCA56546.2023.10070949
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.
引用
收藏
页码:516 / 529
页数:14
相关论文
共 50 条
  • [41] Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization
    Lustig, Daniel
    Martonosi, Margaret
    19TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA2013), 2013, : 354 - 365
  • [42] Dynamic Fine-Grained Workload Partitioning for Irregular Applications on Discrete CPU-GPU Systems
    Xiao, Chunhua
    Ran, Wei
    Lin, Fangzhu
    Zhang, Lin
    19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 1067 - 1074
  • [43] GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers
    Jahanshahi, Ali
    Sabzi, Hadi Zamani
    Lau, Chester
    Wong, Daniel
    IEEE COMPUTER ARCHITECTURE LETTERS, 2020, 19 (02) : 139 - 142
  • [44] Simulating cortical networks on heterogeneous multi-GPU systems
    Nere, Andrew
    Franey, Sean
    Hashmi, Atif
    Lipasti, Mikko
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (07) : 953 - 971
  • [45] Efficient Solving of Scan Primitive on Multi-GPU Systems
    Dieguez, Adrian P.
    Amor, Margarita
    Doallo, Ramon
    Nukada, Akira
    Matsuoka, Satoshi
    2018 32ND IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2018, : 794 - 803
  • [46] Fine-Grained Multi-human Parsing
    Jian Zhao
    Jianshu Li
    Hengzhu Liu
    Shuicheng Yan
    Jiashi Feng
    International Journal of Computer Vision, 2020, 128 : 2185 - 2203
  • [47] Fine-Grained Multi-human Parsing
    Zhao, Jian
    Li, Jianshu
    Liu, Hengzhu
    Yan, Shuicheng
    Feng, Jiashi
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (8-9) : 2185 - 2203
  • [48] On the Fine-Grained Distributed Routing and Data Scheduling for Interplanetary Data Transfers
    Tian, Xiaojian
    Zhu, Zuqing
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2024, 21 (01): : 451 - 462
  • [49] Accelerated MR Physics Simulations on multi-GPU systems
    Xanthis, Christos G.
    Venetis, Ioannis E.
    Aletras, Anthony H.
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2013,
  • [50] Performance Optimization of Allreduce Operation for Multi-GPU Systems
    Nukada, Akira
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 3107 - 3112