Design of a high-performance tensor-matrix multiplication with BLAS

被引：0

作者：

Bassoy, Cem Savas ^{[1
]}

机构：

[1] Hamburg Univ Technol, Schwarzenberg str 95, D-21071 Hamburg, Germany

来源：

JOURNAL OF COMPUTATIONAL SCIENCE | 2025年 / 87卷

关键词：

Tensor contraction; Tensor-times-matrix multiplication; High-performance computing; Tensor methods;

D O I：

10.1016/j.jocs.2025.102568

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The tensor-matrix multiplication (TTM) is a basic tensor operation required by various tensor methods such as the HOSVD. This paper presents flexible high-performance algorithms that compute the tensor-matrix product according to the Loops-over-GEMM (LOG) approach. The proposed algorithms can process dense tensors with any linear tensor layout, arbitrary tensor order and dimensions all of which can be runtime variable. The paper discusses two slicing methods with orthogonal parallelization strategies and propose four algorithms that call BLAS with subtensors or tensor slices. It also provides a simple heuristic which selects one of the four proposed algorithms at runtime. All algorithms have been evaluated on a large set of tensors with various tensor shapes and linear tensor layouts. In case of large tensor slices, our best-performing algorithm achieves a median performance of 2.47 TFLOPS on an Intel Xeon Gold 5318Y and 2.93 TFLOPS an AMD EPYC 9354. Furthermore, it outperforms batched GEMM implementation of Intel MKL by a factor of 2.57 with large tensor slices. Our runtime tests show that our best-performing algorithm is, on average, at least 6.21% and up to 334.31% faster than frameworks implementing state-of-the-art approaches, including actively developed libraries such as Libtorch and Eigen. For the majority of tensor shapes, it is on par with TBLIS which uses optimized kernels for the TTM computation. Our algorithm performs better than all other competing implementations for the majority of real world tensors from the SDRBench, reaching a speedup of 2x or more for some tensor instances. This work is an extended version of "Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS" (Ba & scedil;soy 2024).

引用

页数：13

共 50 条

[1] Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS
Bassoy, Cem Savas
COMPUTATIONAL SCIENCE, ICCS 2024, PT I, 2024, 14832 : 256 - 271
[2] Design of a High-Performance Tensor-Vector Multiplication with BLAS
Bassoy, Cem
COMPUTATIONAL SCIENCE - ICCS 2019, PT I, 2019, 11536 : 32 - 45
[3] SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication
Smith, Shaden
Ravindran, Niranjay
Sidiropoulos, Nicholas D.
Karypis, George
2015 IEEE 29TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2015, : 61 - 70
[4] A Pipelined Implementation of the n-mode Tensor-Matrix Multiplication
Ragusa, Edoardo
Gianoglio, Christian
Zunino, Rodolfo
Valle, Maurizio
Gastaldo, Paolo
2022 29TH IEEE INTERNATIONAL CONFERENCE ON ELECTRONICS, CIRCUITS AND SYSTEMS (IEEE ICECS 2022), 2022,
[5] Design of a High-Performance GEMM-like Tensor-Tensor Multiplication
Springer, Paul
Bientinesi, Paolo
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2018, 44 (03):
[6] Efficient Digital Implementation of n-mode Tensor-Matrix Multiplication
Gianoglio, Christian
Ragusa, Edoardo
Zunino, Rodolfo
Gastaldo, Paolo
2021 IEEE 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS), 2021,
[7] Anatomy of high-performance matrix multiplication
Goto, Kazushige
Van De Geijn, Robert A.
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 34 (03):
[8] A family of high-performance matrix multiplication algorithms
Gunnels, JA
Gustavson, FG
Henry, GM
van de Geijn, RA
APPLIED PARALLEL COMPUTING: STATE OF THE ART IN SCIENTIFIC COMPUTING, 2006, 3732 : 256 - 265
[9] The Design and Performance of Batched BLAS on Modern High-Performance Computing Systems
Dongarra, Jack
Hammarling, Sven
Higham, Nicholas J.
Relton, Samuel D.
Valero-Lara, Pedro
Zounon, Mawussi
INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 495 - 504
[10] Design of three high-performance concurrent systolic arrays for band matrix multiplication
Yang, Y
Zhao, WQ
CHINESE JOURNAL OF ELECTRONICS, 2005, 14 (04): : 559 - 563

← 1 2 3 4 5 →