Evaluating Multi-GPU Sorting with Modern Interconnects

被引:5
|
作者
Maltenberger, Tobias [1 ]
Ilic, Ivan [1 ]
Tolovski, Ilin [1 ]
Rabl, Tilmann [1 ]
机构
[1] Univ Potsdam, Hasso Plattner Inst, Potsdam, Germany
关键词
multi-GPU sorting; high-speed interconnects; database acceleration; ALGORITHM; JOINS; CORE;
D O I
10.1145/3514221.3517842
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
GPUs have become a mainstream accelerator for database operations such as sorting. Most GPU sorting algorithms are single-GPU approaches. They neither harness the full computational power nor exploit the high-bandwidth P2P interconnects of modem multi-GPU platforms. The latest NVLink 2.0 and NVLink 3.0-based NVSwitch interconnects promise unparalleled multi-GPU acceleration. So far, multi-GPU sorting has only been evaluated on systems with PCIe 3.0. In this paper, we analyze serial, parallel, and bidirectional data transfer rates to, from, and between multiple GPUs on systems with PCIe 3.0/4.0, NVLink 2.0/3.0, and NVSwitch. We measure up to 35x higher parallel P2P throughput with NVLink 3.0-based NVSwitch over PCIe 3.0. To study GPU-accelerated sorting on today's hardware, we implement a P2P-based GPU-only (P2P sort) and a heterogeneous (HET sort) multi-GPU sorting algorithm and evaluate them on three modem platforms. We observe speedups over state-of-the-art parallel CPU radix sort of up to 14x for P2P sort and 9x for HET sort. On systems with fast P2P interconnects, P2P sort outperforms HET sort up to 1.65x. Finally, we show that overlapping GPU copy/compute operations does not mitigate the transfer bottleneck when sorting large out-of-core data.
引用
收藏
页码:1795 / 1809
页数:15
相关论文
共 50 条
  • [31] Integrating Multi-GPU Execution in an OpenACC Compiler
    Komoda, Toshiya
    Miwa, Shinobu
    Nakamura, Hiroshi
    Maruyama, Naoya
    2013 42ND ANNUAL INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2013, : 260 - 269
  • [32] MRI denoising by nonlocal means on multi-GPU
    Granata, Donatella
    Amato, Umberto
    Alfano, Bruno
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2019, 16 (02) : 523 - 533
  • [33] Multi-GPU codes for spin systems simulations
    Bernaschi, M.
    Fatica, M.
    Parisi, G.
    Parisi, L.
    COMPUTER PHYSICS COMMUNICATIONS, 2012, 183 (07) : 1416 - 1421
  • [34] A novel architecture of multi-gpu computing card
    Guo, S. (ybbsss1210@126.com), 1600, Universitas Ahmad Dahlan, Jalan Kapas 9, Semaki, Umbul Harjo,, Yogiakarta, 55165, Indonesia (11):
  • [35] MRI denoising by nonlocal means on multi-GPU
    Donatella Granata
    Umberto Amato
    Bruno Alfano
    Journal of Real-Time Image Processing, 2019, 16 : 523 - 533
  • [36] Feasibility Studies in Multi-GPU Target Offloading
    Rydahl, Anton
    Gammelmark, Mathias
    Karlsson, Sven
    OPENMP IN A MODERN WORLD: FROM MULTI-DEVICE SUPPORT TO META PROGRAMMING, 2022, 13527 : 81 - 93
  • [37] An adaptive methodology for multi-GPU programming in OpenCL
    Cavalcanti Bueno, Andre Luis
    Rodriguez, Noemi de La Rocque
    Sotelino, Elisa Dominguez
    ENGINEERING COMPUTATIONS, 2017, 34 (04) : 1277 - 1292
  • [38] Multi-GPU implementation of the lattice Boltzmann method
    Obrecht, Christian
    Kuznik, Frederic
    Tourancheau, Bernard
    Roux, Jean-Jacques
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2013, 65 (02) : 252 - 261
  • [40] Accelerating MapReduce framework on multi-GPU systems
    Jiang, Hai
    Chen, Yi
    Qiao, Zhi
    Li, Kuan-Ching
    Ro, WonWoo
    Gaudiot, Jean-Luc
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2014, 17 (02): : 293 - 301