Outer-Loop Vectorization - Revisited for Short SIMD Architectures

被引:61
|
作者
Nuzman, Dorit [1 ]
Zaks, Ayal [1 ]
机构
[1] IBM Corp, Haifa Res Lab, Haifa, Israel
关键词
SIMD; vectorization; subword parallelism; data reuse;
D O I
10.1145/1454115.1454119
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Vectorization has been an important method of using data-level parallelism of accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multimedia and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures. In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.
引用
收藏
页码:2 / 11
页数:10
相关论文
共 50 条
  • [1] Outer-Loop Auto-Vectorization for SIMD Architectures Based on Open64 Compiler
    Wang Dong
    Zhao Rongcai
    Wang Qi
    Li Yingying
    2016 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT), 2016, : 19 - 23
  • [2] Vectorization for SIMD Architectures with alignment constraints
    Eichenberger, AE
    Wu, P
    O'Brien, K
    ACM SIGPLAN NOTICES, 2004, 39 (06) : 82 - 93
  • [3] Data Layout Transformation for Structure Vectorization on SIMD Architectures
    Li, Peng-yuan
    Zhang, Qing-hua
    Zhao, Rong-cai
    Yu, Hai-ning
    2015 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2015, : 403 - 409
  • [4] SIMD Vectorization of Nested Loop Based on Strip Mining
    Xu, Jinlong
    Sun, Huihui
    Zhao, Rongcai
    2015 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD), 2015, : 49 - 55
  • [5] Implementation of predictive controllers as outer-loop controllers
    Tadeo, F
    Alvarez, T
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON CONTROL APPLICATIONS, VOLS 1 AND 2, 1996, : 1150 - 1154
  • [6] Outer-loop force control of industrial robots
    Kovács, LL
    Stépán, G
    Insperger, T
    ELEVENTH WORLD CONGRESS IN MECHANISM AND MACHINE SCIENCE, VOLS 1-5, PROCEEDINGS, 2004, : 1746 - 1750
  • [7] Implementation of predictive controllers as outer-loop controllers
    Tadeo, F.
    Alvarez, T.
    IET CONTROL THEORY AND APPLICATIONS, 2009, 3 (03): : 261 - 269
  • [8] Loop-Oriented Pointer Analysis for Automatic SIMD Vectorization
    Sui, Yulei
    Fan, Xiaokang
    Zhou, Hao
    Xue, Jingling
    ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2018, 17 (02)
  • [9] An Inner-Loop/Outer-Loop Architecture for an Adaptive Missile Autopilot
    Sobolic, Frantisek M.
    Cruz, Gerardo
    Bernstein, Dennis S.
    2015 AMERICAN CONTROL CONFERENCE (ACC), 2015, : 850 - 855
  • [10] On the usefulness of outer-loop power control with successive interference cancellation
    Buehrer, RM
    Mahajan, R
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2003, 51 (12) : 2091 - 2102