Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators

被引:0
|
作者
Davis, James J. [1 ]
Cheung, Peter Y. K. [1 ]
机构
[1] Imperial Coll London, London SW7 2AZ, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
10.1007/978-3-319-30481-6_31
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 x 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model frnax. These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.
引用
收藏
页码:361 / 368
页数:8
相关论文
共 50 条
  • [31] Evaluating reliability improvements of fault tolerant array processors using algorithm-based fault tolerance
    Tao, DL
    Kantawala, K
    IEEE TRANSACTIONS ON COMPUTERS, 1997, 46 (06) : 725 - 730
  • [32] FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks
    Zhao, Kai
    Di, Sheng
    Li, Sihuan
    Liang, Xin
    Zhai, Yujia
    Chen, Jieyang
    Ouyang, Kaiming
    Cappello, Franck
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2021, 32 (07) : 1677 - 1689
  • [33] Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition
    Hakkarinen, Doug
    Wu, Panruo
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (05) : 1323 - 1335
  • [34] Combinatorial analysis of check set construction for algorithm-based fault tolerance systems
    Wang, DQ
    Zhao, LC
    JOURNAL OF ELECTRONIC TESTING-THEORY AND APPLICATIONS, 1998, 12 (03): : 255 - 260
  • [35] Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Du, Peng
    Dongarra, Jack
    ACM Transactions on Parallel Computing, 2015, 1 (02)
  • [36] Generalized Algorithm-Based Fault Tolerance: Error correction via Kalman estimation
    Redinbo, GR
    IEEE TRANSACTIONS ON COMPUTERS, 1998, 47 (06) : 639 - 655
  • [37] Combinatorial Analysis of Check Set Construction for Algorithm-Based Fault Tolerance Systems
    De-Qiang Wang
    Lian-Chang Zhao
    Journal of Electronic Testing, 1998, 12 : 255 - 260
  • [38] Numerical Defect Correction as an Algorithm-Based Fault Tolerance Technique for Iterative Solvers
    Oboril, Fabian
    Tahoori, Mehdi B.
    Heuveline, Vincent
    Lukarski, Dimitar
    Weiss, Jan-Philipp
    2011 IEEE 17TH PACIFIC RIM INTERNATIONAL SYMPOSIUM ON DEPENDABLE COMPUTING (PRDC), 2011, : 144 - 153
  • [39] Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs
    Chen, Jieyang
    Liang, Xin
    Chen, Zizhong
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2016), 2016, : 993 - 1002
  • [40] Experimental Evaluation of GPUs Radiation Sensitivity and Algorithm-Based Fault Tolerance Efficiency
    Rech, P.
    Carro, L.
    PROCEEDINGS OF THE 2013 IEEE 19TH INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2013, : 244 - 247