Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators

被引：0

作者：

Davis, James J. ^{[1
]}

Cheung, Peter Y. K. ^{[1
]}

机构：

[1] Imperial Coll London, London SW7 2AZ, England

来源：

APPLIED RECONFIGURABLE COMPUTING, ARC 2016 | 2016年

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1007/978-3-319-30481-6_31

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 x 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model frnax. These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.

引用

页码：361 / 368

页数：8

共 50 条

[21] Algorithm-based fault tolerance applied to high performance computing
Bosilca, George
Delmas, Remi
Dongarra, Jack
Langou, Julien
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (04) : 410 - 416
[22] Efficacy and Efficiency of Algorithm-Based Fault-Tolerance on GPUs
Wunderlich, Hans-Joachim
Braun, Claus
Raider, Sebastian
PROCEEDINGS OF THE 2013 IEEE 19TH INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2013, : 240 - 243
[23] CONSTRUCTION OF CHECK SETS FOR ALGORITHM-BASED FAULT-TOLERANCE
GU, DC
ROSENKRANTZ, DJ
RAVI, SS
IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (06) : 641 - 650
[24] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS
BANERJEE, P
ABRAHAM, JA
IEEE TRANSACTIONS ON COMPUTERS, 1986, 35 (04) : 296 - 306
[25] Algorithm-Based Fault Tolerance for Fail-Stop Failures
Chen, Zizhong
Dongarra, Jack
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (12) : 1628 - 1641
[26] ALMOST CERTAIN FAULT-DIAGNOSIS THROUGH ALGORITHM-BASED FAULT-TOLERANCE
BLOUGH, DM
PELC, A
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1994, 5 (05) : 532 - 539
[27] Algorithm-based Fault-Tolerance on Many-Core Architectures
Braun, Claus
Wunderlich, Hans-Joachim
IT-INFORMATION TECHNOLOGY, 2010, 52 (04): : 209 - 215
[28] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS.
Banerjee, Prithviraj
Abraham, Jacob A.
IEEE Transactions on Computers, 1986, C-35 (04) : 296 - 306
[29] Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes
Nair, VSS
Abraham, JA
Banerjee, P
IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (04) : 499 - 503
[30] ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX-INVERSION WITH MAXIMUM PIVOTING
YEH, YM
FENG, TY
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1992, 14 (04) : 373 - 389

← 1 2 3 4 5 →