FinePack: Transparently Improving the Efficiency of Fine-Grained Transfers in Multi-GPU Systems

被引：4

作者：

Muthukrishnan, Harini ^{[1
,2
]}

Lustig, Daniel ^{[1
]}

Villa, Oreste ^{[1
]}

Wenisch, Thomas ^{[2
]}

Nellans, David ^{[1
]}

机构：

[1] NVIDIA, Santa Clara, CA 95051 USA

[2] Univ Michigan, Ann Arbor, MI 48109 USA

来源：

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA | 2023年

关键词：

MEMORY; MANAGEMENT; PLACEMENT;

D O I：

10.1109/HPCA56546.2023.10070949

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent studies have shown that using fine-grained peer-to-peer (P2P) stores to communicate among devices in multi-GPU systems is a promising path to achieve strong performance scaling. In many irregular applications, such as graph algorithms and sparse linear algebra, small sub-cache line (4-32B) stores arise naturally when using the P2P paradigm. This is particularly problematic in multi-GPU systems because inter-GPU interconnects are optimized for bulk transfers rather than small operations. As a consequence, application developers either resort to complex programming techniques to work around this small transfer inefficiency or fall back to bulk inter-GPU DMA transfers that have limited performance scalability. We propose FinePack, a set of limited I/O interconnect and GPU hardware enhancements that enable small peer-to-peer stores to achieve interconnect efficiency that rivals bulk transfers while maintaining the simplicity of a peer-to-peer memory access programming model. Exploiting the GPU's weak memory model, FinePack dynamically coalesces and compresses small writes into a larger I/O message that reduces link-level protocol overhead. FinePack is fully transparent to software and requires no changes to the GPU's virtual memory system. We evaluate FinePack on a system comprising 4 Volta GPUs on a PCIe 4.0 interconnect to show FinePack improves interconnect efficiency for small peer-to-peer stores by 3x. This results in 4-GPU strong scaling performance 1.4x better than traditional DMA based multi-GPU programming and comes within 71% of the maximum achievable strong scaling performance.

引用

页码：516 / 529

页数：14

共 50 条

[31] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
Tan, Xiaodan Serina
Golikov, Pavel
Vijaykumar, Nandita
Pekhimenko, Gennady
PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022, 2022, : 317 - 332
[32] Fine-Grained Parallelization of a Vlasov-Poisson Application on GPU
Latu, Guillaume
EURO-PAR 2010 PARALLEL PROCESSING WORKSHOPS, 2011, 6586 : 127 - 135
[33] cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU
Zhang, Jing
Wang, Hao
Lin, Heshan
Feng, Wu-Chun
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[34] A Fine-grained Parallel Intra Prediction for HEVC Based on GPU
Jiang, Wenbin
Chi, Ye
Jin, Hai
Liao, Xiaofei
Zhang, Yangsong
Ye, Geyan
2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 778 - 784
[35] Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks
Yeh, Tsung Tai
Sabne, Amit
Sakdhnagool, Putt
Eigenmann, Rudolf
Rogers, Timothy G.
ACM SIGPLAN NOTICES, 2017, 52 (08) : 221 - 233
[36] Accelerating a Lossy Compression Method with Fine-Grained Parallelism on a GPU
Wu, Yifan
Shen, Jingcheng
Okita, Masao
Ino, Fumihiko
PAAP 2021: 2021 12TH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS AND PROGRAMMING, 2021, : 76 - 81
[37] Locality-aware Optimizations for Improving Remote Memory Latency in Multi-GPU Systems
Belayneh, Leul
Ye, Haojie
Chen, Kuan-Yu
Blaauw, David
Mudge, Trevor
Dreslinski, Ronald
Talati, Nishil
PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022, 2022, : 304 - 316
[38] Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing
Wang, Zhenning
Yang, Jun
Melhem, Rami
Childers, Bruce
Zhang, Youtao
Guo, Minyi
PROCEEDINGS OF THE 2016 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE (HPCA-22), 2016, : 358 - 369
[39] Fine-grained multiagent systems for the Internet
Nangsue, P
Conry, SE
INTERNATIONAL CONFERENCE ON MULTI-AGENT SYSTEMS, PROCEEDINGS, 1998, : 198 - 205
[40] Improving Energy Efficiency of CGRAs with Low-Overhead Fine-Grained Power Domains
Nayak, Ankita
Zhang, Keyi
Setaluri, Rajsekhar
Carsello, Alex
Mann, Makai
Torng, Christopher
Richardson, Stephen
Bahr, Rick
Hanrahan, Pat
Horowitz, Mark
Raina, Priyanka
ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2023, 16 (02)

← 1 2 3 4 5 →