On the Performance Prediction of BLAS-based Tensor Contractions

被引:11
|
作者
Peise, Elmar [1 ]
Fabregat-Traver, Diego [1 ]
Bientinesi, Paolo [1 ]
机构
[1] Rhein Westfal TH Aachen, AICES, D-52062 Aachen, Germany
关键词
SET;
D O I
10.1007/978-3-319-17248-4_10
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of high-performance kernels for such operations is known to be a challenging task. While for operations on one-and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating high-performance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.
引用
收藏
页码:193 / 212
页数:20
相关论文
共 50 条
  • [41] PERFORMANCE OF PARALLEL CHOLESKY FACTORIZATION ALGORITHMS USING BLAS
    LUECKE, GR
    YUN, JH
    SMITH, PW
    JOURNAL OF SUPERCOMPUTING, 1992, 6 (3-4): : 315 - 329
  • [42] GPU accelerated tensor contractions in the plaquette renormalization scheme
    Yu, J. F.
    Hsiao, H. -C.
    Kao, Ying-Jer
    COMPUTERS & FLUIDS, 2011, 45 (01) : 55 - 58
  • [43] Design and implementation of high performance BLAS for Pentium Pro
    Li, Zhongze
    Chen, Jin
    Long, Xiang
    Li, Wei
    Ruan Jian Xue Bao/Journal of Software, 1998, 9 (05): : 454 - 457
  • [44] THE PERFORMANCE OF THE BLAS AND LAPACK ON A SHARED MEMORY SCALAR MULTIPROCESSOR
    PHILLIPS, C
    PARALLEL COMPUTING, 1991, 17 (6-7) : 751 - 761
  • [45] REACTIONS OF EIGHT SAN BLAS INDIANS TO PERFORMANCE TESTS
    Allen, Grace
    AMERICAN JOURNAL OF PHYSICAL ANTHROPOLOGY, 1926, 9 (01) : 81 - 85
  • [46] Optimized BLAS and its effect on performance of parallel programs
    Long, X.
    Li, Z.Z.
    Chen, J.
    Beijing Hangkong Hangtian Daxue Xuebao/Journal of Beijing University of Aeronautics and Astronautics, 2001, 27 (01): : 79 - 82
  • [47] Performance modeling and optimal block size selection for a BLAS-3 based tridiagonalization algorithm
    Yamamoto, Yusaku
    Eighth International Conference on High-Performance Computing in Asia-Pacific Region, Proceedings, 2005, : 249 - 256
  • [48] Individual traffic prediction in cellular networks based on tensor completion
    Liu, Chunsheng
    Wu, Tao
    Li, Zhifei
    Wang, Bin
    INTERNATIONAL JOURNAL OF COMMUNICATION SYSTEMS, 2021, 34 (16)
  • [49] Retweeting Prediction Based on Social Hotspots and Dynamic Tensor Decomposition
    Li, Qian
    Li, Xiaojuan
    Wu, Bin
    Xiao, Yunpeng
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (05): : 1380 - 1392
  • [50] Exploiting Temporal Dimension in Tensor-Based Link Prediction
    Kuchar, Jaroslav
    Dojchinovski, Milan
    Vitvar, Tomas
    WEB INFORMATION SYSTEMS AND TECHNOLOGIES, WEBIST 2015, 2016, 246 : 211 - 231