Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

被引:0
|
作者
Ghiglio, Pietro [1 ]
Dolinsky, Uwe [1 ]
Goli, Mehdi [1 ]
Narasimhan, Kumudha [1 ]
机构
[1] Codeplay Software Ltd, Edinburgh, Scotland
来源
基金
“创新英国”项目;
关键词
compiler optimizations; multi-cores; parallel programming; portability; software acceleration; standards; SYCL;
D O I
10.1002/cpe.7810
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs and so forth, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing data-centers. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run-time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel and so forth, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this article, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU-directed compilation flow, along with the integration of whole function vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach-the CPU-directed compilation flow, with an OpenCL backend for existing SYCL-based applications, with no code modification for BabelStream benchmark, Matmul from the ComputeCpp SDK, N-body simulation benchmarks and SYCL-BLAS (Aliaga et al. Proceedings of the 5th International Workshop on OpenCL; 2017.), on CPUs from different vendors and architectures. We report a performance improvement of up to 72%$$ 72\% $$ on BabelStream benchmarks, up to 63%$$ 63\% $$ on Matmul, up to 21%$$ 21\% $$ on the N-body simulation benchmark and up to 16% on SYCL-BLAS.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Improving Lateral Flow Assay Performance Using Computational Modeling
    Gasperino, David
    Baughman, Ted
    Hsieh, Helen V.
    Bell, David
    Weigl, Bernhard H.
    ANNUAL REVIEW OF ANALYTICAL CHEMISTRY, VOL 11, 2018, 11 : 219 - 244
  • [22] Accelerating High Performance Computing Applications Using CPUs, GPUs, Hybrid CPU/GPU, and FPGAs
    Liu, Bin
    Zydek, Dawid
    Selvaraj, Henry
    Gewali, Laxmi
    2012 13TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS, AND TECHNOLOGIES (PDCAT 2012), 2012, : 337 - 342
  • [23] N-body computations using skeletal frameworks on multicore CPU/graphics processing unit architectures: an empirical performance evaluation
    Goli, Mehdi
    Gonzalez-Velez, Horacio
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (04): : 972 - 986
  • [24] Improving the Performance of Passive Network Monitoring Applications using Locality Buffering
    Papadogiannakis, Antonis
    Antoniades, Demetres
    Polychronakis, Michalis
    Markatos, Evangelos P.
    PROCEEDINGS OF MASCOTS '07: 15TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, 2007, : 151 - 157
  • [25] Improving the performance of CMFD applications using multiple classifiers and a fusion framework
    Parikh, CR
    Pont, MJ
    Jones, NB
    Schlindwein, FS
    TRANSACTIONS OF THE INSTITUTE OF MEASUREMENT AND CONTROL, 2003, 25 (02) : 123 - 144
  • [26] Optimizing Iterative Data-Flow Scientific Applications Using Directed Cyclic Graphs
    Alvarez, David
    Beltran, Vicenc
    IEEE ACCESS, 2023, 11 : 51971 - 51984
  • [27] Improving etch performance using in situ gas flow monitoring and control
    Venkatesh, Mukund
    Monkowski, Joseph
    Boyd, Kevin
    SOLID STATE TECHNOLOGY, 2010, 53 (07) : 20 - 22
  • [28] Improving HPLC separation performance using parallel segmented flow chromatography
    Camenzuli, M.
    Ritchie, H. J.
    Shalliker, R. A.
    MICROCHEMICAL JOURNAL, 2013, 111 : 3 - 7
  • [29] Improving transient performance of adaptive control architectures using frequency-limited system error dynamics
    Yucelen, Tansel
    De La Torre, Gerardo
    Johnson, Eric N.
    INTERNATIONAL JOURNAL OF CONTROL, 2014, 87 (11) : 2383 - 2397
  • [30] Performance and scalability analysis of teraflop-scale parallel architectures using multidimensional wavefront applications
    Hoisie, A
    Lubeck, O
    Wasserman, H
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2000, 14 (04): : 330 - 346