FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引：4

作者：

Muthukrishnan, Harini ^{[1
,2
]}

Lustig, Daniel ^{[1
]}

Villa, Oreste ^{[1
]}

Wenisch, Thomas ^{[2
]}

Nellans, David ^{[1
]}

机构：

[1] NVIDIA, Santa Clara, CA 95051 USA

[2] Univ Michigan, Ann Arbor, MI 48109 USA

来源：

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA | 2023年

关键词：

MEMORY; MANAGEMENT; PLACEMENT;

D O I：

10.1109/HPCA56546.2023.10070949

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.

引用

页码：516 / 529

页数：14

共 50 条

[41] Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization
Lustig, Daniel
Martonosi, Margaret
19TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE (HPCA2013), 2013, : 354 - 365
[42] Dynamic Fine-Grained Workload Partitioning for Irregular Applications on Discrete CPU-GPU Systems
Xiao, Chunhua
Ran, Wei
Lin, Fangzhu
Zhang, Lin
19TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2021), 2021, : 1067 - 1074
[43] GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers
Jahanshahi, Ali
Sabzi, Hadi Zamani
Lau, Chester
Wong, Daniel
IEEE COMPUTER ARCHITECTURE LETTERS, 2020, 19 (02) : 139 - 142
[44] Simulating cortical networks on heterogeneous multi-GPU systems
Nere, Andrew
Franey, Sean
Hashmi, Atif
Lipasti, Mikko
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (07) : 953 - 971
[45] Efficient Solving of Scan Primitive on Multi-GPU Systems
Dieguez, Adrian P.
Amor, Margarita
Doallo, Ramon
Nukada, Akira
Matsuoka, Satoshi
2018 32ND IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2018, : 794 - 803
[46] Fine-Grained Multi-human Parsing
Jian Zhao
Jianshu Li
Hengzhu Liu
Shuicheng Yan
Jiashi Feng
International Journal of Computer Vision, 2020, 128 : 2185 - 2203
[47] Fine-Grained Multi-human Parsing
Zhao, Jian
Li, Jianshu
Liu, Hengzhu
Yan, Shuicheng
Feng, Jiashi
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (8-9) : 2185 - 2203
[48] On the Fine-Grained Distributed Routing and Data Scheduling for Interplanetary Data Transfers
Tian, Xiaojian
Zhu, Zuqing
IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2024, 21 (01): : 451 - 462
[49] Accelerated MR Physics Simulations on multi-GPU systems
Xanthis, Christos G.
Venetis, Ioannis E.
Aletras, Anthony H.
2013 IEEE 13TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2013,
[50] Performance Optimization of Allreduce Operation for Multi-GPU Systems
Nukada, Akira
2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 3107 - 3112

← 1 2 3 4 5 →