Exploiting Hyper-Loop Parallelism in Vectorization to Improve Memory Performance on CUDA GPGPU

被引:3
|
作者
Xu, Shixiong [1 ]
Gregg, David
机构
[1] Univ Dublin, Trinity Coll, Dept Comp Sci, Software Tools Grp, Dublin, Ireland
关键词
vectorization; hyper loop parallelism; thread coarsening; memory performance; CUDA GPU;
D O I
10.1109/Trustcom.2015.612
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Memory performance is of great importance to achieve high performance on the Nvidia CUDA GPU. Previous work has proposed specific optimizations such as thread coarsening, caching data in shared memory, and global data layout transformation. We argue that vectorization based on hyper loop parallelism can be used as a unified technique to optimize the memory performance. In this paper, we put forward a compiler framework based on the Cetus source-to-source compiler to improve the memory performance on the CUDA GPU by efficiently exploiting hyper loop parallelism in vectorization. We introduce abstractions of SIMD vectors and SIMD operations that match the execution model and memory model of the CUDA GPU, along with three different execution mapping strategies for efficiently offloading vectorized code to CUDA GPUs. In addition, as we employ the vectorization in C-to-CUDA with automatic parallelization, our technique further refines the mapping granularity between coarse-grain loop parallelism and GPU threads. We evaluated our proposed technique on two platforms, an embedded GPU system - Jetson TK1 - and a desktop GPU - GeForce GTX 645. The experimental results demonstrate that our vectorization technique based on hyper loop parallelism can yield performance speedups up to 2.5x compared to the direct coarse-grain loop parallelism mapping.
引用
收藏
页码:53 / 60
页数:8
相关论文
共 23 条
  • [1] Efficient Exploitation of Hyper Loop Parallelism in Vectorization
    Xu, Shixiong
    Gregg, David
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING (LCPC 2014), 2015, 8967 : 382 - 396
  • [2] Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
    Ausavarungnirun, Rachata
    Ghose, Saugata
    Kayiran, Onur
    Loh, Gabriel H.
    Das, Chita R.
    Kandemir, Mahmut T.
    Mutlu, Onur
    2015 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION (PACT), 2015, : 25 - 38
  • [3] Exploiting Request Characteristics and Internal Parallelism to Improve SSD Performance
    Mao, Bo
    Wu, Suzhen
    2015 33RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2015, : 447 - 450
  • [4] Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications
    Jarzabek, Lukasz
    Czarnul, Pawel
    JOURNAL OF SUPERCOMPUTING, 2017, 73 (12): : 5378 - 5401
  • [5] Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications
    Łukasz Jarząbek
    Paweł Czarnul
    The Journal of Supercomputing, 2017, 73 : 5378 - 5401
  • [6] A Two-Way Loop Algorithm for Exploiting Instruction-Level Parallelism in Memory System
    Misra, Sanjay
    Alfa, Abraham Ayegba
    Adewale, Sunday Olamide
    Akogbe, Michael Abogunde
    Olaniyi, Mikail Olayemi
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2014, PT V, 2014, 8583 : 255 - +
  • [7] Exploiting shared memory to improve parallel I/O performance
    Hastings, Andrew B.
    Choudhary, Alok
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2006, 4192 : 212 - 221
  • [8] Improving Performance of Simple Cores by Exploiting Loop-Level Parallelism through Value Prediction and Reconfiguration
    Suri, Tameesh
    Aggarwal, Aneesh
    CF'09: CONFERENCE ON COMPUTING FRONTIERS & WORKSHOPS, 2009, : 151 - 160
  • [9] Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures
    Jang, Byunghyun
    Schaa, Dana
    Mistry, Perhaad
    Kaeli, David
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2011, 22 (01) : 105 - 118
  • [10] Exploiting flash memory characteristics to improve performance of RAIS storage systems
    Linjun Mei
    Dan Feng
    Lingfang Zeng
    Jianxi Chen
    Jingning Liu
    Frontiers of Computer Science, 2019, 13 : 913 - 928