Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

被引:0
|
作者
Ghiglio, Pietro [1 ]
Dolinsky, Uwe [1 ]
Goli, Mehdi [1 ]
Narasimhan, Kumudha [1 ]
机构
[1] Codeplay Software Ltd, Edinburgh, Scotland
来源
基金
“创新英国”项目;
关键词
compiler optimizations; multi-cores; parallel programming; portability; software acceleration; standards; SYCL;
D O I
10.1002/cpe.7810
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs and so forth, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing data-centers. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run-time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel and so forth, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this article, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU-directed compilation flow, along with the integration of whole function vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach-the CPU-directed compilation flow, with an OpenCL backend for existing SYCL-based applications, with no code modification for BabelStream benchmark, Matmul from the ComputeCpp SDK, N-body simulation benchmarks and SYCL-BLAS (Aliaga et al. Proceedings of the 5th International Workshop on OpenCL; 2017.), on CPUs from different vendors and architectures. We report a performance improvement of up to 72%$$ 72\% $$ on BabelStream benchmarks, up to 63%$$ 63\% $$ on Matmul, up to 21%$$ 21\% $$ on the N-body simulation benchmark and up to 16% on SYCL-BLAS.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Improving the design flow for parallel and heterogeneous architectures running real-time applications: The PHARAON FP7 project
    Posadas, Hector
    Nicolas, Alejandro
    Penil, Pablo
    Villar, Eugenio
    Broekaert, Florian
    Bourdelles, Michel
    Cohen, Albert
    Lazarescu, Mihai T.
    Lavagno, Luciano
    Terechko, Andrei
    Glassee, Miguel
    Prieto, Manuel
    MICROPROCESSORS AND MICROSYSTEMS, 2014, 38 (08) : 960 - 975
  • [32] A case study of improving milking cow performance and milking system performance with using a flow simulator
    Enokidani, Masafumi
    Kawai, Kazuhiro
    Shinozuka, Yasunori
    Kurumisawa, Tomomi
    ANIMAL SCIENCE JOURNAL, 2020, 91 (01)
  • [33] Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings
    John Mellor-Crummey
    David Whalley
    Ken Kennedy
    International Journal of Parallel Programming, 2001, 29 : 217 - 247
  • [34] Improving memory hierarchy performance for irregular applications using data and computation reorderings
    Mellor-Crummey, J
    Whalley, D
    Kennedy, K
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2001, 29 (03) : 217 - 247
  • [35] Improving the catalytic performance of laccase using a novel continuous-flow microreactor
    Lloret, L.
    Eibes, G.
    Moreira, M. T.
    Feijoo, G.
    Lema, J. M.
    Miyazaki, M.
    CHEMICAL ENGINEERING JOURNAL, 2013, 223 : 497 - 506
  • [36] Improving Performance of Long Short-Term Memory Networks for Sentiment Analysis Using Multicore and GPU Architectures
    Kunas, Cristiano A.
    Serpa, Matheus S.
    Padoin, Edson Luiz
    Navaux, Philippe O. A.
    HIGH PERFORMANCE COMPUTING, CARLA 2021, 2022, 1540 : 34 - 47
  • [37] P2Cache: An Application-Directed Page Cache for Improving Performance of Data-Intensive Applications
    Lee, Dusol
    Choi, Inhyuk
    Lee, Chanyoung
    Lee, Sungjin
    Kim, Jihong
    PROCEEDINGS OF THE 2023 15TH ACM WORKSHOP ON HOT TOPICS IN STORAGE AND FILE SYSTEMS, HOTSTORAGE 2023, 2023, : 31 - 36
  • [38] Improving a bacterial pyranose 2-oxidase using a combination of rational design and directed evolution for biosensor applications
    Santos, D.
    Mendes, S.
    Brissos, V.
    Van Berkel, W. J. H.
    Martins, L. O.
    NEW BIOTECHNOLOGY, 2018, 44 : S23 - S23
  • [39] Online Time Interference Detection in Mixed-Criticality Applications on Multicore Architectures using Performance Counters
    Esposito, Stefano
    Violante, Massimio
    Sozzi, Marco
    Terrone, Marco
    Traversone, Massimo
    2016 IEEE 22ND INTERNATIONAL SYMPOSIUM ON ON-LINE TESTING AND ROBUST SYSTEM DESIGN (IOLTS), 2016, : 213 - 214
  • [40] Applications of Virtual Machine Using Multi-Objective Optimization Scheduling Algorithm for Improving CPU Utilization and Energy Efficiency in Cloud Computing
    Choudhary, Rajkumar
    Perinpanayagam, Suresh
    ENERGIES, 2022, 15 (23)