Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators

被引:0
|
作者
Davis, James J. [1 ]
Cheung, Peter Y. K. [1 ]
机构
[1] Imperial Coll London, London SW7 2AZ, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1007/978-3-319-30481-6_31
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 x 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model frnax. These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.
引用
收藏
页码:361 / 368
页数:8
相关论文
共 50 条
  • [21] Algorithm-based fault tolerance applied to high performance computing
    Bosilca, George
    Delmas, Remi
    Dongarra, Jack
    Langou, Julien
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2009, 69 (04) : 410 - 416
  • [22] Efficacy and Efficiency of Algorithm-Based Fault-Tolerance on GPUs
    Wunderlich, Hans-Joachim
    Braun, Claus
    Raider, Sebastian
    PROCEEDINGS OF THE 2013 IEEE 19TH INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2013, : 240 - 243
  • [23] CONSTRUCTION OF CHECK SETS FOR ALGORITHM-BASED FAULT-TOLERANCE
    GU, DC
    ROSENKRANTZ, DJ
    RAVI, SS
    IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (06) : 641 - 650
  • [24] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS
    BANERJEE, P
    ABRAHAM, JA
    IEEE TRANSACTIONS ON COMPUTERS, 1986, 35 (04) : 296 - 306
  • [25] Algorithm-Based Fault Tolerance for Fail-Stop Failures
    Chen, Zizhong
    Dongarra, Jack
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2008, 19 (12) : 1628 - 1641
  • [26] ALMOST CERTAIN FAULT-DIAGNOSIS THROUGH ALGORITHM-BASED FAULT-TOLERANCE
    BLOUGH, DM
    PELC, A
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1994, 5 (05) : 532 - 539
  • [27] Algorithm-based Fault-Tolerance on Many-Core Architectures
    Braun, Claus
    Wunderlich, Hans-Joachim
    IT-INFORMATION TECHNOLOGY, 2010, 52 (04): : 209 - 215
  • [28] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS.
    Banerjee, Prithviraj
    Abraham, Jacob A.
    IEEE Transactions on Computers, 1986, C-35 (04) : 296 - 306
  • [29] Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes
    Nair, VSS
    Abraham, JA
    Banerjee, P
    IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (04) : 499 - 503
  • [30] ALGORITHM-BASED FAULT TOLERANCE FOR MATRIX-INVERSION WITH MAXIMUM PIVOTING
    YEH, YM
    FENG, TY
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1992, 14 (04) : 373 - 389