Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers

被引:11
|
作者
Oboril, Fabian [1 ]
Tahoori, Mehdi B. [1 ]
Heuveline, Vincent [2 ]
Lukarski, Dimitar [3 ]
Weiss, Jan-Philipp [3 ]
机构
[1] Karlsruhe Inst Technol KIT, Chair Dependable Nano Comp CDNC, Karlsruhe, Germany
[2] Karlsruhe Inst Technol KIT, Engn Math & Comp Lab EMCL, Karlsruhe, Germany
[3] Karlsruhe Inst Technol KIT, Shared Res Grp New Frontiers High Performance, Comp Exploit Multicore & Coprocessor Technol, Karlsruhe, Germany
关键词
algorithm-based fault tolerance; defect correction; conjugated gradient; triple modular redundancy; checkpointing;
D O I
10.1109/PRDC.2011.26
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an Algorithm-based Fault Tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is "pay as you go", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good fault tolerance and no occurrence of silent data corruption.
引用
收藏
页码:144 / 153
页数:10
相关论文
共 50 条
  • [41] GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs
    Chen, Jieyang
    Li, Sihuan
    Chen, Zizhong
    2016 IEEE INTERNATIONAL CONFERENCE ON NETWORKING ARCHITECTURE AND STORAGE (NAS), 2016,
  • [42] Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators
    Davis, James J.
    Cheung, Peter Y. K.
    APPLIED RECONFIGURABLE COMPUTING, ARC 2016, 2016, : 361 - 368
  • [43] Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance
    Coti, Camille
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY), IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC), AND IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2016, : 214 - 219
  • [44] Mantissa-preserving operations and robust algorithm-based fault tolerance for matrix computations
    Dutt, S
    Assaad, FT
    IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (04) : 408 - 424
  • [45] Development of the RLS algorithm based on the iterative equation solvers
    Khokhar, Muhammad Jawad
    Younis, Muhammad Shahzad
    PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 272 - 275
  • [46] Performance and Fault Tolerance of Preconditioned Iterative Solvers on Low-Power ARM Architectures
    Aliaga, Jose, I
    Catalan, Sandra
    Chalios, Charalampos
    Nikolopoulos, Dimitrios S.
    Quintana-Orti, Enrique S.
    PARALLEL COMPUTING: ON THE ROAD TO EXASCALE, 2016, 27 : 711 - 720
  • [47] Genetic algorithm-based clustering technique
    Maulik, U
    Bandyopadhyay, S
    PATTERN RECOGNITION, 2000, 33 (09) : 1455 - 1465
  • [48] Genetic Algorithm-based Electromagnetic Fault Injection
    Maldini, Antun
    Samwel, Niels
    Picek, Stjepan
    Batina, Lejla
    2018 WORKSHOP ON FAULT DIAGNOSIS AND TOLERANCE IN CRYPTOGRAPHY (FDTC), 2018, : 35 - 42
  • [49] New clustering algorithm-based fault diagnosis using compensation distance evaluation technique
    Lei, Yaguo
    He, Zhengjia
    Zi, Yanyang
    Chen, Xuefeng
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2008, 22 (02) : 419 - 435
  • [50] An algorithm based fault tolerance technique for safety-critical applications
    Smith, DT
    DeLong, TA
    Johnson, BW
    Profeta, JA
    ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM - 1997 PROCEEDINGS: THE INTERNATIONAL SYMPOSIUM ON PRODUCT QUALITY & INTEGRITY, 1997, : 278 - 285