Design of a high-performance tensor-matrix multiplication with BLAS

被引:0
|
作者
Bassoy, Cem Savas [1 ]
机构
[1] Hamburg Univ Technol, Schwarzenberg str 95, D-21071 Hamburg, Germany
关键词
Tensor contraction; Tensor-times-matrix multiplication; High-performance computing; Tensor methods;
D O I
10.1016/j.jocs.2025.102568
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The tensor-matrix multiplication (TTM) is a basic tensor operation required by various tensor methods such as the HOSVD. This paper presents flexible high-performance algorithms that compute the tensor-matrix product according to the Loops-over-GEMM (LOG) approach. The proposed algorithms can process dense tensors with any linear tensor layout, arbitrary tensor order and dimensions all of which can be runtime variable. The paper discusses two slicing methods with orthogonal parallelization strategies and propose four algorithms that call BLAS with subtensors or tensor slices. It also provides a simple heuristic which selects one of the four proposed algorithms at runtime. All algorithms have been evaluated on a large set of tensors with various tensor shapes and linear tensor layouts. In case of large tensor slices, our best-performing algorithm achieves a median performance of 2.47 TFLOPS on an Intel Xeon Gold 5318Y and 2.93 TFLOPS an AMD EPYC 9354. Furthermore, it outperforms batched GEMM implementation of Intel MKL by a factor of 2.57 with large tensor slices. Our runtime tests show that our best-performing algorithm is, on average, at least 6.21% and up to 334.31% faster than frameworks implementing state-of-the-art approaches, including actively developed libraries such as Libtorch and Eigen. For the majority of tensor shapes, it is on par with TBLIS which uses optimized kernels for the TTM computation. Our algorithm performs better than all other competing implementations for the majority of real world tensors from the SDRBench, reaching a speedup of 2x or more for some tensor instances. This work is an extended version of "Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS" (Ba & scedil;soy 2024).
引用
收藏
页数:13
相关论文
共 50 条
  • [41] High Performance Matrix Multiplication on Many Cores
    Yuan, Nan
    Zhou, Yongbin
    Tan, Guangming
    Zhang, Junchao
    Fan, Dongrui
    EURO-PAR 2009: PARALLEL PROCESSING, PROCEEDINGS, 2009, 5704 : 948 - 959
  • [42] High-Performance Tensor Contractions for GPUs
    Abdelfattah, A.
    Baboulin, M.
    Dobrev, V.
    Dongarra, J.
    Earl, C.
    Falcou, J.
    Haidar, A.
    Karlin, I.
    Kolev, Tz.
    Masliah, I.
    Tomov, S.
    INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE 2016 (ICCS 2016), 2016, 80 : 108 - 118
  • [43] EXPLOITING FAST MATRIX MULTIPLICATION WITHIN THE LEVEL 3-BLAS
    HIGHAM, NJ
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 1990, 16 (04): : 352 - 368
  • [44] TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs
    Rivera, Cody
    Chen, Jieyang
    Xiong, Nan
    Zhang, Jing
    Song, Shuaiwen Leon
    Tao, Dingwen
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 151 : 70 - 85
  • [45] High-performance and Memory-saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU
    Nagasaka, Yusuke
    Nukada, Akira
    Matsuoka, Satoshi
    2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 101 - 110
  • [46] High-Performance Modular Multiplication on the Cell Processor
    Bos, Joppe W.
    ARITHMETIC OF FINITE FIELDS, PROCEEDINGS, 2010, 6087 : 7 - 24
  • [47] Design Fast Matrix Algorithms on High-Performance Cloud Platforms
    Kao, Quey-Liang
    Lee, Che-Rung
    2012 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM), 2012,
  • [48] On the Performance Prediction of BLAS-based Tensor Contractions
    Peise, Elmar
    Fabregat-Traver, Diego
    Bientinesi, Paolo
    HIGH PERFORMANCE COMPUTING SYSTEMS: PERFORMANCE MODELING, BENCHMARKING, AND SIMULATION, 2015, 8966 : 193 - 212
  • [49] HIGH PERFORMANCE REARRANGEMENT AND MULTIPLICATION ROUTINES FOR SPARSE TENSOR ARITHMETIC
    Harrison, A. P.
    Joseph, D.
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2018, 40 (02): : C258 - C281
  • [50] Coupled matrix-matrix and coupled tensor-matrix completion methods for predicting drug-target interactions
    Bagherian, Maryam
    Kim, Renaid B.
    Jiang, Cheng
    Sartor, Maureen A.
    Derksen, Harm
    Najarian, Kayvan
    BRIEFINGS IN BIOINFORMATICS, 2021, 22 (02) : 2161 - 2171