Compiler-Assisted Compaction/Restoration of SIMD Instructions

被引：1

作者：

Cebrian, Juan M. ^{[1
]}

Balem, Thibaud ^{[2
]}

Barredo, Adrian ^{[3
]}

Casas, Marc ^{[3
]}

Moreto, Miquel ^{[3
]}

Ros, Alberto ^{[1
]}

Jimborean, Alexandra ^{[1
]}

机构：

[1] Univ Murcia, Comp Engn Dept, E-30100 Murcia, Spain

[2] ENS Rennes, F-35170 Rennes, France

[3] Barcelona Supercomp Ctr, Barcelona 08034, Spain

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2022年 / 33卷 / 04期

基金：

欧洲研究理事会; 欧盟第七框架计划;

关键词：

Registers; Parallel processing; Hardware; Computer architecture; Out of order; Delays; Energy consumption; SIMD; predication; LLVM; density-time performance;

D O I：

10.1109/TPDS.2021.3091015

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Vector processors (e.g., SIMD or GPUs) are ubiquitous in high performance systems. All the supercomputers in the world exploit data-level parallelism (DLP), for example by using single instructions to operate over several data elements. Improving vector processing is therefore key for exascale computing. However, despite its potential, vector code generation and execution have significant challenges. Among these challenges, control flow divergence is one of the main performance limiting factors. Most modern vector instruction sets, including SIMD, rely on predication to support divergence control. Nevertheless, the performance and energy consumption in predicated codes is usually insensitive to the number of active elements in a predicated mask. Since the trend is that vector register size increases, the energy efficiency of exascale computing systems will become sub-optimal. This article proposes a novel approach to improve execution efficiency in predicated vector codes, the Compiler-Assisted Compaction/Restoration (CACR) technique. Baseline CR delays predicated SIMD instructions with inactive elements, compacting active elements from instances of the same instruction of consecutive loop iterations. Compacted elements form an equivalent dense vector instruction. After executing the dense instructions, their results are restored to the original instructions. However, CR has a significant performance and energy penalty when it fails to find active elements, either due to lack of resources when unrolling or because of inter-loop dependencies. In CACR, the compiler analyzes the code looking for key information required to configure CR. Then, it passes this information to the processor via new instructions inserted in the code. This prevents CR from waiting for active elements on scenarios when it would fail to form dense instructions. Simulated results (gem5) show that CACR improves performance by up to 29 percent and reduces dynamic energy by up to 24.2 percent on average, for a a set of applications with predicated execution. The baseline CR only achieves 18.6 percent performance and 14 percent energy improvements for the same configuration and applications.

引用

页码：779 / 791

页数：13

共 50 条

[21] Compiler-assisted power optimization for clustered VLIW architectures
Nagpal, Rahul
Srikant, Y. N.
PARALLEL COMPUTING, 2011, 37 (01) : 42 - 59
[22] Compiler-assisted energy optimization for clustered VLIW processors
Nagpal, Rahul
Srikant, Y. N.
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2012, 72 (08) : 944 - 959
[23] Compiler-Assisted Test Acceleration on GPUs for Embedded Software
Yaneva, Vanya
Rajan, Ajitha
Dubach, Christophe
PROCEEDINGS OF THE 26TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS (ISSTA'17), 2017, : 35 - 45
[24] Compiler-Assisted Scheduling for Multi-Instance GPUs
Porter, Chris
Chen, Chao
Pande, Santosh
14TH WORKSHOP ON GENERAL PURPOSE PROCESSING USING GPU (GPGPU 2022), 2022, : 19 - 24
[25] Compiler-Assisted Data Streaming for Regular Code Structures
Neves, Nuno
Tomas, Pedro
Roma, Nuno
IEEE TRANSACTIONS ON COMPUTERS, 2021, 70 (03) : 483 - 494
[26] Compiler-Assisted Overlapping of Communication and Computation in MPI Applications
Guo, Jichi
Yi, Qing
Meng, Jiayuan
Zhang, Junchao
Balaji, Pavan
2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 60 - 69
[27] A COMPILER-ASSISTED SCHEME FOR ADAPTIVE CACHE COHERENCE ENFORCEMENT
NGUYEN, TN
MOUNESTOUSSI, F
LILJA, DJ
LI, ZY
PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, 1994, 50 : 69 - 78
[28] Compiler-Assisted Selection of a Software Transactional Memory System
Schindewolf, Martin
Esselson, Alexander
Karl, Wolfgang
ARCHITECTURE OF COMPUTING SYSTEMS - ARCS 2011, 2011, 6566 : 147 - 157
[29] Automated Development of Cooperative MAC ProtocolsA Compiler-Assisted Approach
Hermann Simon Lichte
Stefan Valentin
Holger Karl
Mobile Networks and Applications, 2010, 15 : 769 - 785
[30] Compiler-Assisted Value Correlation for Indirect Branch Prediction
Tan Mingxing
Liu Xianhua
Zhang Jiyu
Tong Dong
Cheng Xu
CHINESE JOURNAL OF ELECTRONICS, 2012, 21 (03): : 414 - 418

← 1 2 3 4 5 →