Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers

被引：11

作者：

Oboril, Fabian ^{[1
]}

Tahoori, Mehdi B. ^{[1
]}

Heuveline, Vincent ^{[2
]}

Lukarski, Dimitar ^{[3
]}

Weiss, Jan-Philipp ^{[3
]}

机构：

[1] Karlsruhe Inst Technol KIT, Chair Dependable Nano Comp CDNC, Karlsruhe, Germany

[2] Karlsruhe Inst Technol KIT, Engn Math & Comp Lab EMCL, Karlsruhe, Germany

[3] Karlsruhe Inst Technol KIT, Shared Res Grp New Frontiers High Performance, Comp Exploit Multicore & Coprocessor Technol, Karlsruhe, Germany

来源：

2011 IEEE 17TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING (PRDC) | 2011年

关键词：

algorithm-based fault tolerance; defect correction; conjugated gradient; triple modular redundancy; checkpointing;

D O I：

10.1109/PRDC.2011.26

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As hardware devices like processor cores and memory sub-systems based on nano-scale technology nodes become more unreliable, the need for fault tolerant numerical computing engines, as used in many critical applications with long computation/mission times, is becoming pronounced. In this paper, we present an Algorithm-based Fault Tolerance (ABFT) scheme for an iterative linear solver engine based on the Conjugated Gradient method (CG) by taking the advantage of numerical defect correction. This method is "pay as you go", meaning that there is practically only a runtime overhead if errors occur and a correction is performed. Our experimental comparison with software-based Triple Modular Redundancy (TMR) clearly shows the runtime benefit of the proposed approach, good fault tolerance and no occurrence of silent data corruption.

引用

页码：144 / 153

页数：10

共 50 条

[21] CONSTRUCTION OF CHECK SETS FOR ALGORITHM-BASED FAULT-TOLERANCE
GU, DC
ROSENKRANTZ, DJ
RAVI, SS
IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (06) : 641 - 650
[22] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS
BANERJEE, P
ABRAHAM, JA
IEEE TRANSACTIONS ON COMPUTERS, 1986, 35 (04) : 296 - 306
[23] Algorithm-Based Fault Tolerance for Fail-Stop Failures
Chen, Zizhong
Dongarra, Jack
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (12) : 1628 - 1641
[24] ALMOST CERTAIN FAULT-DIAGNOSIS THROUGH ALGORITHM-BASED FAULT-TOLERANCE
BLOUGH, DM
PELC, A
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1994, 5 (05) : 532 - 539
[25] Algorithm-based Fault-Tolerance on Many-Core Architectures
Braun, Claus
Wunderlich, Hans-Joachim
IT-INFORMATION TECHNOLOGY, 2010, 52 (04): : 209 - 215
[26] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS.
Banerjee, Prithviraj
Abraham, Jacob A.
IEEE Transactions on Computers, 1986, C-35 (04) : 296 - 306
[27] Algorithm-based fault tolerance for discrete wavelet transform implemented on GPUs
Bao, Chong
Zhang, Shancong
JOURNAL OF SYSTEMS ARCHITECTURE, 2020, 108
[28] Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes
Nair, VSS
Abraham, JA
Banerjee, P
IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (04) : 499 - 503
[29] ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX-INVERSION WITH MAXIMUM PIVOTING
YEH, YM
FENG, TY
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1992, 14 (04) : 373 - 389
[30] A Fast iterative shrinkage and threshold algorithm-based least-squares deconvolution technique with application to bearing fault signals
Xu, Yuanbo
Wei, Yu
Wang, Youming
Zhang, Ni
JOURNAL OF VIBRATION AND CONTROL, 2024,

← 1 2 3 4 5 →