Evaluating Multi-GPU Sorting with Modern Interconnects

被引：5

作者：

Maltenberger, Tobias ^{[1
]}

Ilic, Ivan ^{[1
]}

Tolovski, Ilin ^{[1
]}

Rabl, Tilmann ^{[1
]}

机构：

[1] Univ Potsdam, Hasso Plattner Inst, Potsdam, Germany

来源：

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22) | 2022年

关键词：

multi-GPU sorting; high-speed interconnects; database acceleration; ALGORITHM; JOINS; CORE;

D O I：

10.1145/3514221.3517842

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

GPUs have become a mainstream accelerator for database operations such as sorting. Most GPU sorting algorithms are single-GPU approaches. They neither harness the full computational power nor exploit the high-bandwidth P2P interconnects of modem multi-GPU platforms. The latest NVLink 2.0 and NVLink 3.0-based NVSwitch interconnects promise unparalleled multi-GPU acceleration. So far, multi-GPU sorting has only been evaluated on systems with PCIe 3.0. In this paper, we analyze serial, parallel, and bidirectional data transfer rates to, from, and between multiple GPUs on systems with PCIe 3.0/4.0, NVLink 2.0/3.0, and NVSwitch. We measure up to 35x higher parallel P2P throughput with NVLink 3.0-based NVSwitch over PCIe 3.0. To study GPU-accelerated sorting on today's hardware, we implement a P2P-based GPU-only (P2P sort) and a heterogeneous (HET sort) multi-GPU sorting algorithm and evaluate them on three modem platforms. We observe speedups over state-of-the-art parallel CPU radix sort of up to 14x for P2P sort and 9x for HET sort. On systems with fast P2P interconnects, P2P sort outperforms HET sort up to 1.65x. Finally, we show that overlapping GPU copy/compute operations does not mitigate the transfer bottleneck when sorting large out-of-core data.

引用

页码：1795 / 1809

页数：15

共 50 条

[31] Integrating Multi-GPU Execution in an OpenACC Compiler
Komoda, Toshiya
Miwa, Shinobu
Nakamura, Hiroshi
Maruyama, Naoya
2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2013, : 260 - 269
[32] MRI denoising by nonlocal means on multi-GPU
Granata, Donatella
Amato, Umberto
Alfano, Bruno
JOURNAL OF REAL-TIME IMAGE PROCESSING, 2019, 16 (02) : 523 - 533
[33] Multi-GPU codes for spin systems simulations
Bernaschi, M.
Fatica, M.
Parisi, G.
Parisi, L.
COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (07) : 1416 - 1421
[34] A novel architecture of multi-gpu computing card
Guo, S. (ybbsss1210@126.com), 1600, Universitas Ahmad Dahlan, Jalan Kapas 9, Semaki, Umbul Harjo,, Yogiakarta, 55165, Indonesia (11):
[35] MRI denoising by nonlocal means on multi-GPU
Donatella Granata
Umberto Amato
Bruno Alfano
Journal of Real-Time Image Processing, 2019, 16 : 523 - 533
[36] Feasibility Studies in Multi-GPU Target Offloading
Rydahl, Anton
Gammelmark, Mathias
Karlsson, Sven
OPENMP IN A MODERN WORLD: FROM MULTI-DEVICE SUPPORT TO META PROGRAMMING, 2022, 13527 : 81 - 93
[37] An adaptive methodology for multi-GPU programming in OpenCL
Cavalcanti Bueno, Andre Luis
Rodriguez, Noemi de La Rocque
Sotelino, Elisa Dominguez
ENGINEERING COMPUTATIONS, 2017, 34 (04) : 1277 - 1292
[38] Multi-GPU implementation of the lattice Boltzmann method
Obrecht, Christian
Kuznik, Frederic
Tourancheau, Bernard
Roux, Jean-Jacques
COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2013, 65 (02) : 252 - 261
[39] 多GPU(MULTI-GPU)：创建项尖游戏平台
新电脑, 2008, (04) : 216 - 216
[40] Accelerating MapReduce framework on multi-GPU systems
Jiang, Hai
Chen, Yi
Qiao, Zhi
Li, Kuan-Ching
Ro, WonWoo
Gaudiot, Jean-Luc
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2014, 17 (02): : 293 - 301

← 1 2 3 4 5 →