CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

被引：11

作者：

Yang, Yi ^{[1
]}

Li, Chao ^{[2
]}

Zhou, Huiyang ^{[2
]}

机构：

[1] NEC Labs Amer, Dept Comp Syst Architecture, Princeton, NJ 08540 USA

[2] N Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA

来源：

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY | 2015年 / 30卷 / 01期

基金：

美国国家科学基金会;

关键词：

GPGPU; nested parallelism; compiler; local memory; OPENMP; PERFORMANCE; COMPILER; OPTIMIZATION; FRAMEWORK; DESIGN;

D O I：

10.1007/s11390-015-1500-y

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

引用

页码：3 / 19

页数：17

共 50 条

[41] OpenPro : A Dynamic Profiling Tool Set for Exploring Thread-Level Speculation Parallelism
Wang, Yaobin
An, Hong
Liang, Bo
Wang, Li
Guo, Rui
ICCEE 2008: PROCEEDINGS OF THE 2008 INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING, 2008, : 256 - +
[42] Dual-thread speculation: A simple approach to uncover thread-level parallelism on a simultaneous multithreaded processor
Warg, Fredrik
Stenstrom, Per
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2008, 36 (02) : 166 - 183
[43] A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
Yu, Yulong
Xiao, Weijun
He, Xubin
Guo, He
Wang, Yuxin
Chen, Xin
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, : 15 - 24
[44] Exploring Thread-level Parallelism Based on Cost-Driven Model for Irregular Programs
Li, Yuancheng
Liu, Bin
2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (ICSPCC), 2017,
[45] Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism
Wu, Chao-Chin
Ke, Jenn-Yang
Lin, Heshan
Feng, Wu-chun
2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 96 - 103
[46] P-DRAMSim2: Exploiting thread-level parallelism in DRAMSim2
Han, Miseon
Kima, Seon Wook
Kim, Minseong
Han, Youngsun
IEICE ELECTRONICS EXPRESS, 2017, 14 (15):
[47] Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory
Yu, Chao
Bai, Yuebin
Sun, Qingxiao
Yang, Hailong
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2019, 15 (04)
[48] Balancing Thread-level and Task-level Parallelism for Data-Intensive Workloads on Clusters and Clouds
Choudhury, Olivia
Rajan, Dinesh
Hazekamp, Nicholas
Gesing, Sandra
Thain, Douglas
Emrich, Scott
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 390 - 393
[49] Thread-level Value Speculation for Image-processing Applications
Wu, Jun-Si
Sheiue, Yuan-Fu
Chen, Peng-Sheng
2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, 2015, : 74 - 80
[50] MPI Thread-Level Checking for MPI plus OpenMP Applications
Saillard, Emmanuelle
Carribault, Patrick
Barthou, Denis
EURO-PAR 2015: PARALLEL PROCESSING, 2015, 9233 : 31 - 42

← 1 2 3 4 5 →