Design of a high-performance tensor-matrix multiplication with BLAS

被引:0
|
作者
Bassoy, Cem Savas [1 ]
机构
[1] Hamburg Univ Technol, Schwarzenberg str 95, D-21071 Hamburg, Germany
关键词
Tensor contraction; Tensor-times-matrix multiplication; High-performance computing; Tensor methods;
D O I
10.1016/j.jocs.2025.102568
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The tensor-matrix multiplication (TTM) is a basic tensor operation required by various tensor methods such as the HOSVD. This paper presents flexible high-performance algorithms that compute the tensor-matrix product according to the Loops-over-GEMM (LOG) approach. The proposed algorithms can process dense tensors with any linear tensor layout, arbitrary tensor order and dimensions all of which can be runtime variable. The paper discusses two slicing methods with orthogonal parallelization strategies and propose four algorithms that call BLAS with subtensors or tensor slices. It also provides a simple heuristic which selects one of the four proposed algorithms at runtime. All algorithms have been evaluated on a large set of tensors with various tensor shapes and linear tensor layouts. In case of large tensor slices, our best-performing algorithm achieves a median performance of 2.47 TFLOPS on an Intel Xeon Gold 5318Y and 2.93 TFLOPS an AMD EPYC 9354. Furthermore, it outperforms batched GEMM implementation of Intel MKL by a factor of 2.57 with large tensor slices. Our runtime tests show that our best-performing algorithm is, on average, at least 6.21% and up to 334.31% faster than frameworks implementing state-of-the-art approaches, including actively developed libraries such as Libtorch and Eigen. For the majority of tensor shapes, it is on par with TBLIS which uses optimized kernels for the TTM computation. Our algorithm performs better than all other competing implementations for the majority of real world tensors from the SDRBench, reaching a speedup of 2x or more for some tensor instances. This work is an extended version of "Fast and Layout-Oblivious Tensor-Matrix Multiplication with BLAS" (Ba & scedil;soy 2024).
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Fault-tolerant high-performance matrix multiplication:: Theory and practice
    Gunnels, JA
    Katz, DS
    Quintana-Ortí, ES
    van de Geijn, RA
    INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2001, : 47 - 56
  • [22] High-performance FIR filter design based on sharing multiplication
    Park, J
    Muhammad, K
    Roy, K
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2003, 11 (02) : 244 - 253
  • [23] DESIGN CONSIDERATIONS OR HIGH-PERFORMANCE AVALANCHE PHOTODIODE MULTIPLICATION LAYERS
    CHANDRAMOULI, V
    MAZIAR, CM
    CAMPBELL, JC
    IEEE TRANSACTIONS ON ELECTRON DEVICES, 1994, 41 (05) : 648 - 654
  • [24] Matrix Multiplication with Guaranteed Accuracy by Level 3 BLAS
    Ozaki, Katsuhisa
    Ogita, Takeshi
    Oishi, Shin'ichi
    INTERNATIONAL CONFERENCE OF COMPUTATIONAL METHODS IN SCIENCES AND ENGINEERING 2009 (ICCMSE 2009), 2012, 1504 : 1128 - 1133
  • [25] Design Patterns for High-Performance Matrix Computations
    Son, Hoang M.
    MODELING, SIMULATION AND OPTIMIZATION OF COMPLEX PROCESSES, 2008, : 509 - 519
  • [26] High-performance implementation of the level-3 BLAS
    Goto, Kazushige
    Van De Geijn, Robert
    ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 35 (01): : 1 - 14
  • [27] Is search really necessary to generate high-performance BLAS?
    Yotov, K
    Li, XM
    Ren, G
    Garzarán, M
    Padua, D
    Pingali, K
    Stodghill, P
    PROCEEDINGS OF THE IEEE, 2005, 93 (02) : 358 - 386
  • [28] SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs
    Wu, Di
    Cao, Wei
    Wang, Lingli
    2019 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT 2019), 2019, : 255 - 258
  • [29] SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator
    Wu, Di
    Fan, Xitian
    Cao, Wei
    Wang, Lingli
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2021, 29 (05) : 936 - 949
  • [30] Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor
    Tao, Xiaohan
    Zhu, Yu
    Wang, Boyang
    Xu, Jinlong
    Pang, Jianmin
    Zhao, Jie
    51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,