CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

被引：11

作者：

Yang, Yi ^{[1
]}

Li, Chao ^{[2
]}

Zhou, Huiyang ^{[2
]}

机构：

[1] NEC Labs Amer, Dept Comp Syst Architecture, Princeton, NJ 08540 USA

[2] N Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA

来源：

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY | 2015年 / 30卷 / 01期

基金：

美国国家科学基金会;

关键词：

GPGPU; nested parallelism; compiler; local memory; OPENMP; PERFORMANCE; COMPILER; OPTIMIZATION; FRAMEWORK; DESIGN;

D O I：

10.1007/s11390-015-1500-y

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.

引用

页码：3 / 19

页数：17

共 50 条

[21] Exploiting thread-level speculative parallelism with software value prediction
Li, XF
Yang, C
Du, ZH
Ngai, TF
ADVANCES IN COMPUTER SYSTEMS ARCHITECTURE, PROCEEDINGS, 2005, 3740 : 367 - 388
[22] Parallelization spectroscopy: Analysis of thread-level parallelism in HPC programs
Kejariwal, Arun
Cascaval, Calin
ACM SIGPLAN Notices, 2009, 44 (04): : 293 - 294
[23] Relational profiling: Enabling thread-level parallelism in virtual machines
Heil, T
Smith, JE
33RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE: MICRO-33 2000, PROCEEDINGS, 2000, : 281 - 290
[24] Exploiting the thread-level parallelism for BGP on Multi-core
Gao Lei
Lai Mingche
Gong Zhenghu
CNSR 2008: PROCEEDINGS OF THE 6TH ANNUAL COMMUNICATION NETWORKS AND SERVICES RESEARCH CONFERENCE, 2008, : 510 - 516
[25] Exploiting Thread-level Parallelism Based on Banlancing Load for Speculative Multithreading
Li Yuancheng
ADVANCES IN MECHATRONICS AND CONTROL ENGINEERING III, 2014, 678 : 8 - 11
[26] Exploiting thread-level parallelism in the iterative solution of sparse linear systems
Aliaga, Jose I.
Bollhoefer, Matthias
Martin, Alberto F.
Quintana-Orti, Enrique S.
PARALLEL COMPUTING, 2011, 37 (03) : 183 - 202
[27] Exploiting Thread-Level Parallelism on HEVC by Employing a Reference Dependency Graph
Kim, Minwoo
Kim, Deokho
Kim, Kyungah
Ro, Won Woo
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2016, 26 (04) : 736 - 749
[28] Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs
Kayiran, Onur
Jog, Adwait
Kandemir, Mahmut T.
Das, Chita R.
2013 22ND INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT), 2013, : 157 - 166
[29] Architecture optimization for multimedia application exploiting data and thread-level parallelism
Limousin, C
Sebot, J
Vartanian, A
Drach, N
JOURNAL OF SYSTEMS ARCHITECTURE, 2005, 51 (01) : 15 - 27
[30] Power-performance implications of thread-level parallelism on chip multiprocessors
Li, J
Martínez, JF
ISPASS 2005: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2005, : 124 - 134

← 1 2 3 4 5 →