A High-dimensional Algorithm-Based Fault Tolerance Scheme

被引:1
|
作者
Fu, Xiang [1 ]
Tang, Hao [1 ]
Liao, Huimin [1 ]
Huang, Xin [1 ]
Xu, Wubiao [1 ]
Meng, Shiman [1 ]
Zhang, Weiping [1 ]
Guo, Luanzheng [2 ]
Sato, Kento [3 ]
机构
[1] Nanchang Hangkong Univ, Nanchang, Jiangxi, Peoples R China
[2] Pacific Northwest Natl Lab, Richland, WA USA
[3] RIKEN, RCCS, Kobe, Hyogo, Japan
关键词
D O I
10.1109/IPDPSW59300.2023.00061
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Tensor Algebra is a powerful tool for carrying out high-order data analytics in scientific applications, such as finite element analysis, N-body simulation, and quantum chemistry. Many of these applications are critical in terms of correctness and safety. Since these applications often run on High Performance Computing (HPC) systems, which are susceptible to soft errors caused by cosmic rays, unstable voltage, etc., we must ensure that the execution of these applications is reliable and resilient, and the execution outcome is highly trustworthy. However, traditional fault tolerance methods like error-correcting codes cannot protect computations. Checkpointing and redundancy techniques like triple modular redundancy (TMR) suffer from high-performance overhead, while existing algorithm-based fault tolerance (ABFT) approaches focus only on 2D linear algebra computations that are inefficient for tensor algebra computations. We understand that high-level tensor algebra computations can be decomposed into 2D linear algebra computations to be protected by existing ABFT methods, but this often introduces unacceptable performance overhead. Hence, for the first time, we propose a collection of different ABFT algorithms for addressing three fundamental tensor algebra operations. We make the best use of the algorithmic semantics of these tensor algebra computations to achieve better performance.
引用
收藏
页码:326 / 330
页数:5
相关论文
共 50 条
  • [31] Combinatorial analysis of check set construction for algorithm-based fault tolerance systems
    Wang, DQ
    Zhao, LC
    JOURNAL OF ELECTRONIC TESTING-THEORY AND APPLICATIONS, 1998, 12 (03): : 255 - 260
  • [32] FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks
    Zhao, Kai
    Di, Sheng
    Li, Sihuan
    Liang, Xin
    Zhai, Yujia
    Chen, Jieyang
    Ouyang, Kaiming
    Cappello, Franck
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (07) : 1677 - 1689
  • [33] Generalized Algorithm-Based Fault Tolerance: Error correction via Kalman estimation
    Redinbo, GR
    IEEE TRANSACTIONS ON COMPUTERS, 1998, 47 (06) : 639 - 655
  • [34] Combinatorial Analysis of Check Set Construction for Algorithm-Based Fault Tolerance Systems
    De-Qiang Wang
    Lian-Chang Zhao
    Journal of Electronic Testing, 1998, 12 : 255 - 260
  • [35] Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers
    Oboril, Fabian
    Tahoori, Mehdi B.
    Heuveline, Vincent
    Lukarski, Dimitar
    Weiss, Jan-Philipp
    2011 IEEE 17TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING (PRDC), 2011, : 144 - 153
  • [36] Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
    Chen, Jieyang
    Liang, Xin
    Chen, Zizhong
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 993 - 1002
  • [37] Experimental Evaluation of GPUs Radiation Sensitivity and Algorithm-Based Fault Tolerance Efficiency
    Rech, P.
    Carro, L.
    PROCEEDINGS OF THE 2013 IEEE 19TH INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2013, : 244 - 247
  • [38] Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach
    Li, Dong
    Chen, Zizhong
    Wu, Panruo
    Vetter, Jeffrey S.
    2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
  • [39] Towards Reliable AI Applications via Algorithm-Based Fault Tolerance on NVDLA
    Sanic, Mustafa Tarik
    Guo, Cong
    Leng, Jingwen
    Guo, Minyi
    Ma, Weiyin
    2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 736 - 743
  • [40] Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments
    Chen, Zizhong
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 336 - +