Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

被引:0
|
作者
Ghiglio, Pietro [1 ]
Dolinsky, Uwe [1 ]
Goli, Mehdi [1 ]
Narasimhan, Kumudha [1 ]
机构
[1] Codeplay Software Ltd, Edinburgh, Scotland
来源
基金
“创新英国”项目;
关键词
compiler optimizations; multi-cores; parallel programming; portability; software acceleration; standards; SYCL;
D O I
10.1002/cpe.7810
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs and so forth, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing data-centers. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run-time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel and so forth, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this article, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU-directed compilation flow, along with the integration of whole function vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach-the CPU-directed compilation flow, with an OpenCL backend for existing SYCL-based applications, with no code modification for BabelStream benchmark, Matmul from the ComputeCpp SDK, N-body simulation benchmarks and SYCL-BLAS (Aliaga et al. Proceedings of the 5th International Workshop on OpenCL; 2017.), on CPUs from different vendors and architectures. We report a performance improvement of up to 72%$$ 72\% $$ on BabelStream benchmarks, up to 63%$$ 63\% $$ on Matmul, up to 21%$$ 21\% $$ on the N-body simulation benchmark and up to 16% on SYCL-BLAS.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow
    Ghiglio, Pietro
    Dolinsky, Uwe
    Goli, Mehdi
    Narasimhan, Kumudha
    PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL WORKSHOP ON PROGRAMMING MODELS AND APPLICATIONS FOR MULTICORES AND MANYCORES (PMAM '22), 2022, : 1 - 10
  • [2] Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
    Youssef Faqir-Rhazoui
    Carlos García
    The Journal of Supercomputing, 2023, 79 : 18480 - 18506
  • [3] Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
    Faqir-Rhazoui, Youssef
    Garcia, Carlos
    JOURNAL OF SUPERCOMPUTING, 2023, 79 (16): : 18480 - 18506
  • [4] CHIMPS: A C-LEVEL COMPILATION FLOW FOR HYBRID CPU-FPGA ARCHITECTURES
    Putnam, Andrew
    Bennett, Dave
    Dellinger, Eric
    Mason, Jeff
    Sundararajan, Prasanna
    Eggers, Susan
    2008 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE AND LOGIC APPLICATIONS, VOLS 1 AND 2, 2008, : 173 - 178
  • [5] Improving the performance of CPU architectures by reducing the Operating System overhead
    Zagan, Ionel
    PROCEEDINGS OF THE 2015 IEEE 3RD WORKSHOP ON ADVANCES IN INFORMATION, ELECTRONIC AND ELECTRICAL ENGINEERING (AIEEE 2015), 2015,
  • [6] Improving Performance of JNA by Using LLVM JIT Compiler
    Tsai, Yu-Hsin
    Wu, I-Wei
    Liu, I-Chun
    Shann, Jean Jyh-Jiun
    2013 IEEE/ACIS 12TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2013, : 483 - 488
  • [7] Improving the Performance of CPU Architectures by Reducing the Operating System Overhead (Extended Version)
    Zagan, Ionel
    Gaitan, Vasile Gheorghita
    ELECTRICAL CONTROL AND COMMUNICATION ENGINEERING, 2016, 10 (01) : 13 - 22
  • [8] Coordinating the Use of GPU and CPU for Improving Performance of Compute Intensive Applications
    Teodoro, George
    Sachetto, Rafael
    Sertel, Olcay
    Gurcan, Metin N.
    Meira, Wagner, Jr.
    Catalyurek, Umit
    Ferreira, Renato
    2009 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS, 2009, : 437 - +
  • [9] Performance Study of GPU applications using SYCL and CUDA on Tesla V100 GPU
    Kuncham, Goutham Kalikrishna Reddy
    Vaidya, Rahul
    Barve, Mahesh
    2021 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2021,
  • [10] Performance Prediction of Parallel CPU and GPU Applications Using Fractals<bold> </bold>
    Escobar, Rodrigo
    Boppana, Rajendra V.
    IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 610 - 617