Design of a high-performance tensor-matrix multiplication with BLAS

被引:0
|
作者
Bassoy, Cem Savas [1 ]
机构
[1] Hamburg Univ Technol, Schwarzenberg str 95, D-21071 Hamburg, Germany
关键词
Tensor contraction; Tensor-times-matrix multiplication; High-performance computing; Tensor methods;
D O I
10.1016/j.jocs.2025.102568
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The tensor-matrix multiplication (TTM) is a basic tensor operation required by various tensor methods such as the HOSVD. This paper presents flexible high-performance algorithms that compute the tensor-matrix product according to the Loops-over-GEMM (LOG) approach. The proposed algorithms can process dense tensors with any linear tensor layout, arbitrary tensor order and dimensions all of which can be runtime variable. The paper discusses two slicing methods with orthogonal parallelization strategies and propose four algorithms that call BLAS with subtensors or tensor slices. It also provides a simple heuristic which selects one of the four proposed algorithms at runtime. All algorithms have been evaluated on a large set of tensors with various tensor shapes and linear tensor layouts. In case of large tensor slices, our best-performing algorithm achieves a median performance of 2.47 TFLOPS on an Intel Xeon Gold 5318Y and 2.93 TFLOPS an AMD EPYC 9354. Furthermore, it outperforms batched GEMM implementation of Intel MKL by a factor of 2.57 with large tensor slices. Our runtime tests show that our best-performing algorithm is, on average, at least 6.21% and up to 334.31% faster than frameworks implementing state-of-the-art approaches, including actively developed libraries such as Libtorch and Eigen. For the majority of tensor shapes, it is on par with TBLIS which uses optimized kernels for the TTM computation. Our algorithm performs better than all other competing implementations for the majority of real world tensors from the SDRBench, reaching a speedup of 2x or more for some tensor instances. This work is an extended version of "Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS" (Ba & scedil;soy 2024).
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Tight and efficient enclosure of matrix multiplication by using optimized BLAS
    Ozaki, Katsuhisa
    Ogita, Takeshi
    Oishi, Shin'ichi
    NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS, 2011, 18 (02) : 237 - 248
  • [32] Design and implementation of high performance BLAS for Pentium Pro
    Li, Zhongze
    Chen, Jin
    Long, Xiang
    Li, Wei
    Ruan Jian Xue Bao/Journal of Software, 1998, 9 (05): : 454 - 457
  • [33] High-performance BLAS formulation of the adaptive Fast Multipole Method
    Coulaud, O.
    Fortin, P.
    Roman, J.
    Advances in Computational Methods in Sciences and Engineering 2005, Vols 4 A & 4 B, 2005, 4A-4B : 1796 - 1799
  • [34] DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems
    Yang, Shuai
    Zhang, Changyou
    Ma, Ji
    2023 IEEE 30TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC 2023, 2023, : 194 - 202
  • [35] A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution
    Wang, Ruimin
    Yang, Zhiwei
    Xu, Hao
    Lu, Lu
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (02): : 1741 - 1758
  • [36] SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms
    Elafrou, Athena
    Karakasis, Vasileios
    Gkountouvas, Theodoros
    Kourtis, Kornilios
    Goumas, Georgios
    Koziris, Nectarios
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2018, 44 (03):
  • [37] IMPLEMENTING HIGH-PERFORMANCE COMPLEX MATRIX MULTIPLICATION VIA THE 1M METHOD
    Van Zee, Field G.
    SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2020, 42 (05): : C221 - C244
  • [38] A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution
    Ruimin Wang
    Zhiwei Yang
    Hao Xu
    Lu Lu
    The Journal of Supercomputing, 2022, 78 : 1741 - 1758
  • [39] Color-based feature extraction with application to facial recognition using tensor-matrix and tensor-tensor analysis
    Hassan Rahmanian Koushkaki
    Mohammad Reza Salehi
    Ebrahim Abiri
    Multimedia Tools and Applications, 2020, 79 : 5829 - 5858
  • [40] Color-based feature extraction with application to facial recognition using tensor-matrix and tensor-tensor analysis
    Koushkaki, Hassan Rahmanian
    Salehi, Mohammad Reza
    Abiri, Ebrahim
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (9-10) : 5829 - 5858