Maximizing Parallelism and GPU Utilization For Direct GPU Compilation Through Ensemble Execution

被引：0

作者：

Tian, Shilei ^{[1
]}

Chapman, Barbara ^{[1
]}

Doerfert, Johannes ^{[2
]}

机构：

[1] SUNY Stony Brook, Stony Brook, NY 11794 USA

[2] Lawrence Livermore Natl Lab, Livermore, CA USA

来源：

PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS PROCEEDINGS, ICPP-W 2023 | 2023年

关键词：

LLVM; OpenMP; accelerator offloading; GPU; ensemble execution;

D O I：

10.1145/3605731.3606016

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

GPUs are renowned for their exceptional computational acceleration capabilities achieved through massive parallelism. However, utilizing GPUs for computation requires manual identification of code regions suitable for offloading, data transfer management, and synchronization. Recent advancements have capitalized on the LLVM/OpenMP portable target offloading interface, elevating GPU acceleration to new heights. This approach, known as the direct GPU compilation, involves compiling the entire host application for execution on the GPU, eliminating the need for explicit offloading directives. However, direct GPU compilation is limited to the thread parallelism a CPU application exposes, which is often not enough to saturate a modern GPU. This paper explores an alternative approach to enhance parallelism by enabling ensemble execution. We introduce a proof-ofconcept implementation that maps each invocation of an application on a different input to an individual team executed by the same GPU kernel. Our enhanced GPU loader can read command line arguments for different instances from a file to simplify the usability. Through extensive evaluation using four benchmarks, we observe up to 51X speedup for 64 instances. This demonstrate the effectiveness of ensemble execution in improving parallelism and optimizing GPU utilization for CPU programs compiled and executed directly on the GPU.

引用

页码：112 / 118

页数：7

共 23 条

[1] WER: Maximizing Parallelism of Irregular Graph Applications Through GPU Warp EqualizeR
Huang, En-Ming
Cheng, Bo-Wun
Lin, Meng-Hsien
Lee, Chun-Yi
Yeh, Tsung Tai
29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 201 - 206
[2] Balanced Column-Wise Block Pruning for Maximizing GPU Parallelism
Park, Cheonjun
Park, Mincheol
Oh, Hyun Jae
Kim, Minkyu
Yoon, Myung Kuk
Kim, Suhyun
Ro, Won Woo
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 8, 2023, : 9398 - 9407
[3] A Distributed PTX Compilation and Execution System on Hybrid CPU/GPU Clusters
Liang, Tyng-Yeu
Li, Hung-Fu
Chen, Bi-Shing
INTELLIGENT SYSTEMS AND APPLICATIONS (ICS 2014), 2015, 274 : 1355 - 1364
[4] Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
Yoon, Myung Kuk
Kim, Keunsoo
Lee, Sangpil
Ro, Won Woo
Annavaram, Murali
2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 609 - 621
[5] Multi-GPU Efficient Indexing For Maximizing Parallelism of High Dimensional Range Query Services
Kim, Mincheol
Liu, Ling
Choi, Wonik
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2022, 15 (05) : 2910 - 2924
[6] HAWS: Accelerating GPU Wavefront Execution through Selective Out-of-order Execution
Gong, Xun
Gong, Xiang
Yu, Leiming
Kaeli, David
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2019, 16 (02)
[7] Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures
Navarro, Angeles
Vilches, Antonio
Corbera, Francisco
Asenjo, Rafael
JOURNAL OF SUPERCOMPUTING, 2014, 70 (02): : 756 - 771
[8] Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures
Angeles Navarro
Antonio Vilches
Francisco Corbera
Rafael Asenjo
The Journal of Supercomputing, 2014, 70 : 756 - 771
[9] Triple-A: Early Operand Collector Allocation for Maximizing GPU Register Bank Utilization
Jeong, Ipoom
Jeong, Eunbi
Kim, Nam Sung
Yoon, Myung Kuk
IEEE EMBEDDED SYSTEMS LETTERS, 2024, 16 (02) : 206 - 209
[10] POSTER: Pagoda: A Runtime System to Maximize GPU Utilization in Data Parallel Tasks with Limited Parallelism
Yeh, Tsung Tai
Sabne, Amit
Sakdhnagool, Putt
Eigenmann, Rudolf
Rogers, Timothy G.
2016 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES (PACT), 2016, : 449 - 450

← 1 2 3 →