A High-dimensional Algorithm-Based Fault Tolerance Scheme

被引:1
|
作者
Fu, Xiang [1 ]
Tang, Hao [1 ]
Liao, Huimin [1 ]
Huang, Xin [1 ]
Xu, Wubiao [1 ]
Meng, Shiman [1 ]
Zhang, Weiping [1 ]
Guo, Luanzheng [2 ]
Sato, Kento [3 ]
机构
[1] Nanchang Hangkong Univ, Nanchang, Jiangxi, Peoples R China
[2] Pacific Northwest Natl Lab, Richland, WA USA
[3] RIKEN, RCCS, Kobe, Hyogo, Japan
关键词
D O I
10.1109/IPDPSW59300.2023.00061
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Tensor Algebra is a powerful tool for carrying out high-order data analytics in scientific applications, such as finite element analysis, N-body simulation, and quantum chemistry. Many of these applications are critical in terms of correctness and safety. Since these applications often run on High Performance Computing (HPC) systems, which are susceptible to soft errors caused by cosmic rays, unstable voltage, etc., we must ensure that the execution of these applications is reliable and resilient, and the execution outcome is highly trustworthy. However, traditional fault tolerance methods like error-correcting codes cannot protect computations. Checkpointing and redundancy techniques like triple modular redundancy (TMR) suffer from high-performance overhead, while existing algorithm-based fault tolerance (ABFT) approaches focus only on 2D linear algebra computations that are inefficient for tensor algebra computations. We understand that high-level tensor algebra computations can be decomposed into 2D linear algebra computations to be protected by existing ABFT methods, but this often introduces unacceptable performance overhead. Hence, for the first time, we propose a collection of different ABFT algorithms for addressing three fundamental tensor algebra operations. We make the best use of the algorithmic semantics of these tensor algebra computations to achieve better performance.
引用
收藏
页码:326 / 330
页数:5
相关论文
共 50 条
  • [21] An evolutionary algorithm-based classification method for high-dimensional imbalanced mixed data with missing information
    Liu, Yi
    Li, Gengsong
    Zheng, Qibin
    Yang, Guoli
    Liu, Kun
    Qin, Wei
    ELECTRONICS LETTERS, 2024, 60 (20)
  • [22] ALMOST CERTAIN FAULT-DIAGNOSIS THROUGH ALGORITHM-BASED FAULT-TOLERANCE
    BLOUGH, DM
    PELC, A
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1994, 5 (05) : 532 - 539
  • [23] Algorithm-based Fault-Tolerance on Many-Core Architectures
    Braun, Claus
    Wunderlich, Hans-Joachim
    IT-INFORMATION TECHNOLOGY, 2010, 52 (04): : 209 - 215
  • [24] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS.
    Banerjee, Prithviraj
    Abraham, Jacob A.
    IEEE Transactions on Computers, 1986, C-35 (04) : 296 - 306
  • [25] Algorithm-based fault tolerance for discrete wavelet transform implemented on GPUs
    Bao, Chong
    Zhang, Shancong
    JOURNAL OF SYSTEMS ARCHITECTURE, 2020, 108
  • [26] Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes
    Nair, VSS
    Abraham, JA
    Banerjee, P
    IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (04) : 499 - 503
  • [27] ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX-INVERSION WITH MAXIMUM PIVOTING
    YEH, YM
    FENG, TY
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1992, 14 (04) : 373 - 389
  • [28] Evaluating reliability improvements of fault tolerant array processors using algorithm-based fault tolerance
    Tao, DL
    Kantawala, K
    IEEE TRANSACTIONS ON COMPUTERS, 1997, 46 (06) : 725 - 730
  • [29] Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Du, Peng
    Dongarra, Jack
    ACM Transactions on Parallel Computing, 2015, 1 (02)
  • [30] Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition
    Hakkarinen, Doug
    Wu, Panruo
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (05) : 1323 - 1335