Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution

被引:0
|
作者
Tian, Shilei [1 ]
Chapman, Barbara [1 ]
Doerfert, Johannes [2 ]
机构
[1] SUNY Stony Brook, Stony Brook, NY 11794 USA
[2] Lawrence Livermore Natl Lab, Livermore, CA USA
来源
PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS PROCEEDINGS, ICPP-W 2023 | 2023年
关键词
LLVM; OpenMP; accelerator offloading; GPU; ensemble execution;
D O I
10.1145/3605731.3606016
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
GPUs are renowned for their exceptional computational acceleration capabilities achieved through massive parallelism. However, utilizing GPUs for computation requires manual identification of code regions suitable for offloading, data transfer management, and synchronization. Recent advancements have capitalized on the LLVM/OpenMP portable target offloading interface, elevating GPU acceleration to new heights. This approach, known as the direct GPU compilation, involves compiling the entire host application for execution on the GPU, eliminating the need for explicit offloading directives. However, direct GPU compilation is limited to the thread parallelism a CPU application exposes, which is often not enough to saturate a modern GPU. This paper explores an alternative approach to enhance parallelism by enabling ensemble execution. We introduce a proof-ofconcept implementation that maps each invocation of an application on a different input to an individual team executed by the same GPU kernel. Our enhanced GPU loader can read command line arguments for different instances from a file to simplify the usability. Through extensive evaluation using four benchmarks, we observe up to 51X speedup for 64 instances. This demonstrate the effectiveness of ensemble execution in improving parallelism and optimizing GPU utilization for CPU programs compiled and executed directly on the GPU.
引用
收藏
页码:112 / 118
页数:7
相关论文
共 23 条
  • [1] WER: Maximizing Parallelism of Irregular Graph Applications Through GPU Warp EqualizeR
    Huang, En-Ming
    Cheng, Bo-Wun
    Lin, Meng-Hsien
    Lee, Chun-Yi
    Yeh, Tsung Tai
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 201 - 206
  • [2] Balanced Column-Wise Block Pruning for Maximizing GPU Parallelism
    Park, Cheonjun
    Park, Mincheol
    Oh, Hyun Jae
    Kim, Minkyu
    Yoon, Myung Kuk
    Kim, Suhyun
    Ro, Won Woo
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9398 - 9407
  • [3] A Distributed PTX Compilation and Execution System on Hybrid CPU/GPU Clusters
    Liang, Tyng-Yeu
    Li, Hung-Fu
    Chen, Bi-Shing
    INTELLIGENT SYSTEMS AND APPLICATIONS (ICS 2014), 2015, 274 : 1355 - 1364
  • [4] Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
    Yoon, Myung Kuk
    Kim, Keunsoo
    Lee, Sangpil
    Ro, Won Woo
    Annavaram, Murali
    2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 609 - 621
  • [5] Multi-GPU Efficient Indexing For Maximizing Parallelism of High Dimensional Range Query Services
    Kim, Mincheol
    Liu, Ling
    Choi, Wonik
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (05) : 2910 - 2924
  • [6] HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution
    Gong, Xun
    Gong, Xiang
    Yu, Leiming
    Kaeli, David
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2019, 16 (02)
  • [7] Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures
    Navarro, Angeles
    Vilches, Antonio
    Corbera, Francisco
    Asenjo, Rafael
    JOURNAL OF SUPERCOMPUTING, 2014, 70 (02): : 756 - 771
  • [8] Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures
    Angeles Navarro
    Antonio Vilches
    Francisco Corbera
    Rafael Asenjo
    The Journal of Supercomputing, 2014, 70 : 756 - 771
  • [9] Triple-A: Early Operand Collector Allocation for Maximizing GPU Register Bank Utilization
    Jeong, Ipoom
    Jeong, Eunbi
    Kim, Nam Sung
    Yoon, Myung Kuk
    IEEE EMBEDDED SYSTEMS LETTERS, 2024, 16 (02) : 206 - 209
  • [10] POSTER: Pagoda: A Runtime System to Maximize GPU Utilization in Data Parallel Tasks with Limited Parallelism
    Yeh, Tsung Tai
    Sabne, Amit
    Sakdhnagool, Putt
    Eigenmann, Rudolf
    Rogers, Timothy G.
    2016 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES (PACT), 2016, : 449 - 450