Compiler and runtime support for enabling reduction computations on heterogeneous systems

被引:3
|
作者
Ravi, Vignesh T. [1 ]
Ma, Wenjing [1 ]
Chiu, David [1 ]
Agrawal, Gagan [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
来源
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2012年 / 24卷 / 05期
关键词
D O I
10.1002/cpe.1848
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Presently, it has become very common for a desktop or a notebook computer to come equipped with both a multi-core CPU and a graphics processing unit (GPU). Capitalizing on the maximum computational power of such architectures (i.e., by simultaneously exploiting both the multi-core CPU and the GPU), starting from a high-level API, is a critical challenge. We believe that it would be highly desirable to support a simple way for programmers to realize the full potential of today's heterogeneous machines. This paper describes a compiler and runtime framework that can map a class of applications, namely those characterized by generalized reductions, to a system with a multi-core CPU and GPU. Starting with simple C functions with added annotations, we automatically generate the middleware API code for the multi-core, as well as CUDA code to exploit the GPU simultaneously. The runtime system provides efficient schemes for dynamically partitioning the work between CPU cores and the GPU. Our experimental results from two applications, for example, k-means clustering and principal component analysis, show that, through effectively harnessing the heterogeneous architecture, we can achieve significantly higher performance compared with using only the GPU or the multi-core CPU. In k-means clustering, the heterogeneous version with eight CPU cores and a GPU achieved a speedup of about 32.09x relative to one-thread CPU. When compared with the faster of CPU-only and GPU-only executions, we were able to achieve a performance gain of about 60%. In principal component analysis, the heterogeneous version attained a speedup of 10.4x relative to the one-thread CPU version. When compared with the faster of CPU-only and GPU-only versions, the heterogeneous version achieved a performance gain of about 63.8%. Copyright (C) 2011 John Wiley & Sons, Ltd.
引用
收藏
页码:463 / 480
页数:18
相关论文
共 50 条
  • [21] Dynamic Runtime Optimizations for Systems of Heterogeneous Architectures
    Tran, Geoffrey Phi C.
    Kang, Dong-In
    Crago, Stephen
    2014 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2014,
  • [22] Compiler and runtime support for running OpenMP programs on Pentium- and Itanium-architectures
    Tian, XM
    Girkar, M
    Shah, S
    Armstrong, D
    Su, E
    Petersen, P
    EIGHTH INTERNATIONAL WORKSHOP ON HIGH-LEVEL PARALLEL PROGRAMMING MODELS AND SUPPORTIVE ENVIRONMENTS, PROCEEDINGS, 2003, : 47 - 55
  • [23] Enabling Pipeline Parallelism in Heterogeneous Managed Runtime Environments via Batch Processing
    Blanaru, Florin
    Stratikopoulos, Athanasios
    Fumero, Juan
    Kotselidis, Christos
    PROCEEDINGS OF THE 18TH ACM SIGPLAN/SIGOPS INTERNATIONAL CONFERENCE ON VIRTUAL EXECUTION ENVIRONMENTS, VEE 2022, 2022, : 58 - 71
  • [24] Compiler support for scalable and efficient memory systems
    Barua, R
    Lee, W
    Amarasinghe, S
    Agarwal, A
    IEEE TRANSACTIONS ON COMPUTERS, 2001, 50 (11) : 1234 - 1247
  • [25] A MATLAB compiler for distributed, heterogeneous, reconfigurable computing systems
    Banerjee, P
    Shenoy, N
    Choudhary, A
    Hauck, S
    Bachmann, C
    Haldar, M
    Joisha, P
    Jones, A
    Kanhare, A
    Nayak, A
    Periyacheri, S
    Walkden, M
    Zaretsky, D
    2000 IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES, PROCEEDINGS, 2000, : 39 - 48
  • [26] Runtime support for multicore pocket processing systems
    Wolf, Tilman
    Weng, Ning
    Tai, Chia-Hui
    IEEE NETWORK, 2007, 21 (04): : 29 - 37
  • [27] Runtime and energy constrained work scheduling for heterogeneous systems
    Valon Raca
    Seeun William Umboh
    Eduard Mehofer
    Bernhard Scholz
    The Journal of Supercomputing, 2022, 78 : 17150 - 17177
  • [28] Runtime and energy constrained work scheduling for heterogeneous systems
    Raca, Valon
    Umboh, Seeun William
    Mehofer, Eduard
    Scholz, Bernhard
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (15): : 17150 - 17177
  • [29] Runtime Dependence Computation and Execution of Loops on Heterogeneous Systems
    Anantpur, Jayvant
    Govindarajan, R.
    PROCEEDINGS OF THE 2013 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2013, : 151 - 160
  • [30] Enabling PoCL-based runtime frameworks on the HSA for OpenCL 2.0 support
    Chang, Yuan-Ming
    Wang, Shao-Chung
    Yang, Chun-Chieh
    Hwang, Yuan-Shin
    Lee, Jenq-Kuen
    JOURNAL OF SYSTEMS ARCHITECTURE, 2017, 81 : 71 - 82