Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512

被引：16

作者：

Kim, Raehyun ^{[1
]}

Choi, Jaeyoung ^{[1
]}

Lee, Myungho ^{[2
]}

机构：

[1] Soongsil Univ, Seoul, South Korea

[2] Myongji Univ, Yongin, Gyeonggi, South Korea

来源：

PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2019) | 2019年

关键词：

Manycore; Intel Xeon; Intel Xeon Phi; Autotuning; matrix-matrix multiplication; AVX-512;

D O I：

10.1145/3293320.3293334

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This paper presents the optimal implementations of single-and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-tuning approach with the Intel AVX-512 intrinsic functions. Our auto-tuning approach precisely determines the parameters reflecting the target architectural features. Our approach significantly reduces the search space and derives optimal parameter sets including the size of submatrices, prefetch distances, loop unrolling depth, and parallelization scheme. Without a single line of assembly code, our GEMM kernels show the comparable performance results to the Intel MKL and outperform other open-source BLAS libraries.

引用

页码：101 / 110

页数：10

共 50 条

[31] SWIMM 2.0: Enhanced Smith-Waterman on Intel's Multicore and Manycore Architectures Based on AVX-512 Vector Extensions
Rucci, Enzo
Garcia Sanchez, Carlos
Botella Juan, Guillermo
De Giusti, Armando
Naiouf, Marcelo
Prieto-Matias, Manuel
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2019, 47 (02) : 296 - 316
[32] Optimizing and Auto-Tuning Scale-Free Sparse Matrix-Vector Multiplication on Intel Xeon Phi
Tang, Wai Teng
Zhao, Ruizhe
Lu, Mian
Liang, Yun
Huynh Phung Huyng
Li, Xibai
Goh, Rick Siow Mong
2015 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION (CGO), 2015, : 136 - 145
[33] Accelerating Large Integer Multiplication Using Intel AVX-512IFMA
Edamatsu, Takuya
Takahashi, Daisuke
ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING (ICA3PP 2019), PT I, 2020, 11944 : 60 - 74
[34] Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors
Park, Yoosang
Kim, Raehyun
Nguyen, Thi My Tuyen
Choi, Jaeyoung
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2023, 26 (05): : 2539 - 2549
[35] Improving blocked matrix-matrix multiplication routine by utilizing AVX-512 instructions on intel knights landing and xeon scalable processors
Yoosang Park
Raehyun Kim
Thi My Tuyen Nguyen
Jaeyoung Choi
Cluster Computing, 2023, 26 : 2539 - 2549
[36] Optimization of a sparse grid-based data mining kernel for architectures using AVX-512
Sarbu, Paul-Cristian
Bungartz, Hans-Joachim
2018 30TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD 2018), 2018, : 364 - 371
[37] Formal Techniques for Development and Auto-tuning of Parallel Programs
Doroshenko A.
Ivanenko P.
Yatsenko O.
SN Computer Science, 4 (2)
[38] Taming Parallel I/O Complexity with Auto-Tuning
Behzad, Babak
Huong Vu Thanh Luu
Huchette, Joseph
Byna, Surendra
Prabhat
Aydt, Ruth
Koziol, Quincey
Snir, Marc
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
[39] Knowledge discovery in auto-tuning parallel numerical library
Kuroda, Hisayasu
Katagiri, Takahiro
Kanada, Yasumasa
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2002, 2281 : 628 - 639
[40] Auto-tuning Mapping Strategy for Parallel CFD Program
Liu Fang
Wang Zhenghua
Che Yonggang
2012 FIFTH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2012), VOL 1, 2012, : 222 - 226

← 1 2 3 4 5 →