Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers

被引:11
|
作者
Oboril, Fabian [1 ]
Tahoori, Mehdi B. [1 ]
Heuveline, Vincent [2 ]
Lukarski, Dimitar [3 ]
Weiss, Jan-Philipp [3 ]
机构
[1] Karlsruhe Inst Technol KIT, Chair Dependable Nano Comp CDNC, Karlsruhe, Germany
[2] Karlsruhe Inst Technol KIT, Engn Math & Comp Lab EMCL, Karlsruhe, Germany
[3] Karlsruhe Inst Technol KIT, Shared Res Grp New Frontiers High Performance, Comp Exploit Multicore & Coprocessor Technol, Karlsruhe, Germany
关键词
algorithm-based fault tolerance; defect correction; conjugated gradient; triple modular redundancy; checkpointing;
D O I
10.1109/PRDC.2011.26
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an Algorithm-based Fault Tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is "pay as you go", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good fault tolerance and no occurrence of silent data corruption.
引用
收藏
页码:144 / 153
页数:10
相关论文
共 50 条
  • [31] Evaluating reliability improvements of fault tolerant array processors using algorithm-based fault tolerance
    Tao, DL
    Kantawala, K
    IEEE TRANSACTIONS ON COMPUTERS, 1997, 46 (06) : 725 - 730
  • [32] Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Du, Peng
    Dongarra, Jack
    ACM Transactions on Parallel Computing, 2015, 1 (02)
  • [33] Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition
    Hakkarinen, Doug
    Wu, Panruo
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (05) : 1323 - 1335
  • [34] Combinatorial analysis of check set construction for algorithm-based fault tolerance systems
    Wang, DQ
    Zhao, LC
    JOURNAL OF ELECTRONIC TESTING-THEORY AND APPLICATIONS, 1998, 12 (03): : 255 - 260
  • [35] FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks
    Zhao, Kai
    Di, Sheng
    Li, Sihuan
    Liang, Xin
    Zhai, Yujia
    Chen, Jieyang
    Ouyang, Kaiming
    Cappello, Franck
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (07) : 1677 - 1689
  • [36] Combinatorial Analysis of Check Set Construction for Algorithm-Based Fault Tolerance Systems
    De-Qiang Wang
    Lian-Chang Zhao
    Journal of Electronic Testing, 1998, 12 : 255 - 260
  • [37] Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
    Chen, Jieyang
    Liang, Xin
    Chen, Zizhong
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 993 - 1002
  • [38] Experimental Evaluation of GPUs Radiation Sensitivity and Algorithm-Based Fault Tolerance Efficiency
    Rech, P.
    Carro, L.
    PROCEEDINGS OF THE 2013 IEEE 19TH INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2013, : 244 - 247
  • [39] Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach
    Li, Dong
    Chen, Zizhong
    Wu, Panruo
    Vetter, Jeffrey S.
    2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
  • [40] Towards Reliable AI Applications via Algorithm-Based Fault Tolerance on NVDLA
    Sanic, Mustafa Tarik
    Guo, Cong
    Leng, Jingwen
    Guo, Minyi
    Ma, Weiyin
    2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 736 - 743