Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

被引:16
|
作者
Kim, Raehyun [1 ]
Choi, Jaeyoung [1 ]
Lee, Myungho [2 ]
机构
[1] Soongsil Univ, Seoul, South Korea
[2] Myongji Univ, Yongin, Gyeonggi, South Korea
关键词
Manycore; Intel Xeon; Intel Xeon Phi; Autotuning; matrix-matrix multiplication; AVX-512;
D O I
10.1145/3293320.3293334
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents the optimal implementations of single-and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.
引用
收藏
页码:101 / 110
页数:10
相关论文
共 50 条
  • [31] SWIMM 2.0: Enhanced Smith-Waterman on Intel's Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
    Rucci, Enzo
    Garcia Sanchez, Carlos
    Botella Juan, Guillermo
    De Giusti, Armando
    Naiouf, Marcelo
    Prieto-Matias, Manuel
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2019, 47 (02) : 296 - 316
  • [32] Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi
    Tang, Wai Teng
    Zhao, Ruizhe
    Lu, Mian
    Liang, Yun
    Huynh Phung Huyng
    Li, Xibai
    Goh, Rick Siow Mong
    2015 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2015, : 136 - 145
  • [33] Accelerating Large Integer Multiplication Using Intel AVX-512IFMA
    Edamatsu, Takuya
    Takahashi, Daisuke
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING (ICA3PP 2019), PT I, 2020, 11944 : 60 - 74
  • [34] Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors
    Park, Yoosang
    Kim, Raehyun
    Nguyen, Thi My Tuyen
    Choi, Jaeyoung
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2023, 26 (05): : 2539 - 2549
  • [35] Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors
    Yoosang Park
    Raehyun Kim
    Thi My Tuyen Nguyen
    Jaeyoung Choi
    Cluster Computing, 2023, 26 : 2539 - 2549
  • [36] Optimization of a sparse grid-based data mining kernel for architectures using AVX-512
    Sarbu, Paul-Cristian
    Bungartz, Hans-Joachim
    2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 364 - 371
  • [37] Formal Techniques for Development and Auto-tuning of Parallel Programs
    Doroshenko A.
    Ivanenko P.
    Yatsenko O.
    SN Computer Science, 4 (2)
  • [38] Taming Parallel I/O Complexity with Auto-Tuning
    Behzad, Babak
    Huong Vu Thanh Luu
    Huchette, Joseph
    Byna, Surendra
    Prabhat
    Aydt, Ruth
    Koziol, Quincey
    Snir, Marc
    2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
  • [39] Knowledge discovery in auto-tuning parallel numerical library
    Kuroda, Hisayasu
    Katagiri, Takahiro
    Kanada, Yasumasa
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2002, 2281 : 628 - 639
  • [40] Auto-tuning Mapping Strategy for Parallel CFD Program
    Liu Fang
    Wang Zhenghua
    Che Yonggang
    2012 FIFTH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2012), VOL 1, 2012, : 222 - 226