Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow

被引：0

作者：

Ghiglio, Pietro ^{[1
]}

Dolinsky, Uwe ^{[1
]}

Goli, Mehdi ^{[1
]}

Narasimhan, Kumudha ^{[1
]}

机构：

[1] Codeplay Software Ltd, Edinburgh, Scotland

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2023年 / 35卷 / 27期

基金：

“创新英国”项目;

关键词：

compiler optimizations; multi-cores; parallel programming; portability; software acceleration; standards; SYCL;

D O I：

10.1002/cpe.7810

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs and so forth, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing data-centers. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run-time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel and so forth, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this article, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU-directed compilation flow, along with the integration of whole function vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach-the CPU-directed compilation flow, with an OpenCL backend for existing SYCL-based applications, with no code modification for BabelStream benchmark, Matmul from the ComputeCpp SDK, N-body simulation benchmarks and SYCL-BLAS (Aliaga et al. Proceedings of the 5th International Workshop on OpenCL; 2017.), on CPUs from different vendors and architectures. We report a performance improvement of up to 72%$$ 72\% $$ on BabelStream benchmarks, up to 63%$$ 63\% $$ on Matmul, up to 21%$$ 21\% $$ on the N-body simulation benchmark and up to 16% on SYCL-BLAS.

引用

页数：19

共 50 条

[1] Improving performance of SYCL applications on CPU architectures using LLVM-directed compilation flow
Ghiglio, Pietro
Dolinsky, Uwe
Goli, Mehdi
Narasimhan, Kumudha
PROCEEDINGS OF THE THIRTEENTH INTERNATIONAL WORKSHOP ON PROGRAMMING MODELS AND APPLICATIONS FOR MULTICORES AND MANYCORES (PMAM '22), 2022, : 1 - 10
[2] Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
Youssef Faqir-Rhazoui
Carlos García
The Journal of Supercomputing, 2023, 79 : 18480 - 18506
[3] Exploring the performance and portability of the k-means algorithm on SYCL across CPU and GPU architectures
Faqir-Rhazoui, Youssef
Garcia, Carlos
JOURNAL OF SUPERCOMPUTING, 2023, 79 (16): : 18480 - 18506
[4] CHIMPS: A C-LEVEL COMPILATION FLOW FOR HYBRID CPU-FPGA ARCHITECTURES
Putnam, Andrew
Bennett, Dave
Dellinger, Eric
Mason, Jeff
Sundararajan, Prasanna
Eggers, Susan
2008 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE AND LOGIC APPLICATIONS, VOLS 1 AND 2, 2008, : 173 - 178
[5] Improving the performance of CPU architectures by reducing the Operating System overhead
Zagan, Ionel
PROCEEDINGS OF THE 2015 IEEE 3RD WORKSHOP ON ADVANCES IN INFORMATION, ELECTRONIC AND ELECTRICAL ENGINEERING (AIEEE 2015), 2015,
[6] Improving Performance of JNA by Using LLVM JIT Compiler
Tsai, Yu-Hsin
Wu, I-Wei
Liu, I-Chun
Shann, Jean Jyh-Jiun
2013 IEEE/ACIS 12TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS), 2013, : 483 - 488
[7] Improving the Performance of CPU Architectures by Reducing the Operating System Overhead (Extended Version)
Zagan, Ionel
Gaitan, Vasile Gheorghita
ELECTRICAL CONTROL AND COMMUNICATION ENGINEERING, 2016, 10 (01) : 13 - 22
[8] Coordinating the Use of GPU and CPU for Improving Performance of Compute Intensive Applications
Teodoro, George
Sachetto, Rafael
Sertel, Olcay
Gurcan, Metin N.
Meira, Wagner, Jr.
Catalyurek, Umit
Ferreira, Renato
2009 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING AND WORKSHOPS, 2009, : 437 - +
[9] Performance Study of GPU applications using SYCL and CUDA on Tesla V100 GPU
Kuncham, Goutham Kalikrishna Reddy
Vaidya, Rahul
Barve, Mahesh
2021 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2021,
[10] Performance Prediction of Parallel CPU and GPU Applications Using Fractals<bold> </bold>
Escobar, Rodrigo
Boppana, Rajendra V.
IEEE 20TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS / IEEE 16TH INTERNATIONAL CONFERENCE ON SMART CITY / IEEE 4TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND SYSTEMS (HPCC/SMARTCITY/DSS), 2018, : 610 - 617

← 1 2 3 4 5 →