CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

被引:11
|
作者
Yang, Yi [1 ]
Li, Chao [2 ]
Zhou, Huiyang [2 ]
机构
[1] NEC Labs Amer, Dept Comp Syst Architecture, Princeton, NJ 08540 USA
[2] N Carolina State Univ, Dept Elect & Comp Engn, Raleigh, NC 27606 USA
基金
美国国家科学基金会;
关键词
GPGPU; nested parallelism; compiler; local memory; OPENMP; PERFORMANCE; COMPILER; OPTIMIZATION; FRAMEWORK; DESIGN;
D O I
10.1007/s11390-015-1500-y
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Parallel programs consist of series of code sections with different thread-level parallelism (TLP). As a result, it is rather common that a thread in a parallel program, such as a GPU kernel in CUDA programs, still contains both sequential code and parallel loops. In order to leverage such parallel loops, the latest NVIDIA Kepler architecture introduces dynamic parallelism, which allows a GPU thread to start another GPU kernel, thereby reducing the overhead of launching kernels from a CPU. However, with dynamic parallelism, a parent thread can only communicate with its child threads through global memory and the overhead of launching GPU kernels is non-trivial even within GPUs. In this paper, we first study a set of GPGPU benchmarks that contain parallel loops, and highlight that these benchmarks do not have a very high loop count or high degree of TLP. Consequently, the benefits of leveraging such parallel loops using dynamic parallelism are too limited to offset its overhead. We then present our proposed solution to exploit nested parallelism in CUDA, referred to as CUDA-NP. With CUDA-NP, we initially enable a high number of threads when a GPU program starts, and use control flow to activate different numbers of threads for different code sections. We implement our proposed CUDA-NP framework using a directive-based compiler approach. For a GPU kernel, an application developer only needs to add OpenMP-like pragmas for parallelizable code sections. Then, our CUDA-NP compiler automatically generates the optimized GPU kernels. It supports both the reduction and the scan primitives, explores different ways to distribute parallel loop iterations into threads, and efficiently manages on-chip resource. Our experiments show that for a set of GPGPU benchmarks, which have already been optimized and contain nested parallelism, our proposed CUDA-NP framework further improves the performance by up to 6.69 times and 2.01 times on average.
引用
收藏
页码:3 / 19
页数:17
相关论文
共 50 条
  • [41] OpenPro : A Dynamic Profiling Tool Set for Exploring Thread-Level Speculation Parallelism
    Wang, Yaobin
    An, Hong
    Liang, Bo
    Wang, Li
    Guo, Rui
    ICCEE 2008: PROCEEDINGS OF THE 2008 INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING, 2008, : 256 - +
  • [42] Dual-thread speculation: A simple approach to uncover thread-level parallelism on a simultaneous multithreaded processor
    Warg, Fredrik
    Stenstrom, Per
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2008, 36 (02) : 166 - 183
  • [43] A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
    Yu, Yulong
    Xiao, Weijun
    He, Xubin
    Guo, He
    Wang, Yuxin
    Chen, Xin
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING (ICS'15), 2015, : 15 - 24
  • [44] Exploring Thread-level Parallelism Based on Cost-Driven Model for Irregular Programs
    Li, Yuancheng
    Liu, Bin
    2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMMUNICATIONS AND COMPUTING (ICSPCC), 2017,
  • [45] Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism
    Wu, Chao-Chin
    Ke, Jenn-Yang
    Lin, Heshan
    Feng, Wu-chun
    2011 IEEE 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2011, : 96 - 103
  • [46] P-DRAMSim2: Exploiting thread-level parallelism in DRAMSim2
    Han, Miseon
    Kima, Seon Wook
    Kim, Minseong
    Han, Youngsun
    IEICE ELECTRONICS EXPRESS, 2017, 14 (15):
  • [47] Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad Memory
    Yu, Chao
    Bai, Yuebin
    Sun, Qingxiao
    Yang, Hailong
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2019, 15 (04)
  • [48] Balancing Thread-level and Task-level Parallelism for Data-Intensive Workloads on Clusters and Clouds
    Choudhury, Olivia
    Rajan, Dinesh
    Hazekamp, Nicholas
    Gesing, Sandra
    Thain, Douglas
    Emrich, Scott
    2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 390 - 393
  • [49] Thread-level Value Speculation for Image-processing Applications
    Wu, Jun-Si
    Sheiue, Yuan-Fu
    Chen, Peng-Sheng
    2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, 2015, : 74 - 80
  • [50] MPI Thread-Level Checking for MPI plus OpenMP Applications
    Saillard, Emmanuelle
    Carribault, Patrick
    Barthou, Denis
    EURO-PAR 2015: PARALLEL PROCESSING, 2015, 9233 : 31 - 42