Reduced-precision Algorithm-based Fault Tolerance for FPGA-implemented Accelerators

被引：0

作者：

Davis, James J. ^{[1
]}

Cheung, Peter Y. K. ^{[1
]}

机构：

[1] Imperial Coll London, London SW7 2AZ, England

来源：

APPLIED RECONFIGURABLE COMPUTING, ARC 2016 | 2016年

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

10.1007/978-3-319-30481-6_31

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As the threat of fault susceptibility caused by mechanisms including variation and degradation increases, engineers must give growing consideration to error detection and correction. While the use of common fault tolerance strategies frequently causes the incursion of significant overheads in area, performance and/or power consumption, options exist that buck these trends. In particular, algorithm-based fault tolerance embodies a proven family of low-overhead error mitigation techniques able to be built upon to create self-verifying circuitry. In this paper, we present our research into the application of algorithm-based fault tolerance (ABFT) in FPGA-implemented accelerators at reduced levels of precision. This allows for the introduction of a previously unexplored tradeoff: sacrificing the observability of faults associated with low-magnitude errors for gains in area, performance and efficiency by reducing the bit-widths of logic used for error detection. We describe the implementation of a novel checksum truncation technique, analysing its effects upon overheads and allowed error. Our findings include that bit-width reduction of ABFT circuitry within a fault-tolerant accelerator used for multiplying pairs of 32 x 32 matrices resulted in the reduction of incurred area overhead by 16.7% and recovery of 8.27% of timing model frnax. These came at the cost of introducing average and maximum absolute output errors of 0.430% and 0.927%, respectively, of the maximum absolute output value under transient fault injection.

引用

页码：361 / 368

页数：8

共 50 条

[41] Rethinking Algorithm-Based Fault Tolerance with a Cooperative Software-Hardware Approach
Li, Dong
Chen, Zizhong
Wu, Panruo
Vetter, Jeffrey S.
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
[42] Towards Reliable AI Applications via Algorithm-Based Fault Tolerance on NVDLA
Sanic, Mustafa Tarik
Guo, Cong
Leng, Jingwen
Guo, Minyi
Ma, Weiyin
2022 18TH INTERNATIONAL CONFERENCE ON MOBILITY, SENSING AND NETWORKING, MSN, 2022, : 736 - 743
[43] Effects of Runtime Reconfiguration on PUFs Implemented as FPGA-Based Accelerators
Nassar, Hassan
Bauer, Lars
Henkel, Joerg
IEEE EMBEDDED SYSTEMS LETTERS, 2023, 15 (04) : 174 - 177
[44] Employment of Reduced Precision Redundancy for Fault Tolerant FPGA Applications
Sullivan, Margaret A.
Loomis, Herschel H.
Ross, Alan A.
PROCEEDINGS OF THE 2009 17TH IEEE SYMPOSIUM ON FIELD PROGRAMMABLE CUSTOM COMPUTING MACHINES, 2009, : 283 - 286
[45] GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs
Chen, Jieyang
Li, Sihuan
Chen, Zizhong
2016 IEEE INTERNATIONAL CONFERENCE ON NETWORKING ARCHITECTURE AND STORAGE (NAS), 2016,
[46] Exploiting Redundant Computation in Communication-Avoiding Algorithms for Algorithm-Based Fault Tolerance
Coti, Camille
2016 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY), IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC), AND IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2016, : 214 - 219
[47] Mantissa-preserving operations and robust algorithm-based fault tolerance for matrix computations
Dutt, S
Assaad, FT
IEEE TRANSACTIONS ON COMPUTERS, 1996, 45 (04) : 408 - 424
[48] Design of FPGA-Implemented Reed-Solomon Erasure Code (RS-EC) Decoders With Fault Detection and Location on User Memory
Gao, Zhen
Zhang, Lingling
Cheng, Yinghao
Guo, Kangkang
Ullah, Anees
Reviriego, Pedro
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2021, 29 (06) : 1073 - 1082
[49] Genetic Algorithm-based Electromagnetic Fault Injection
Maldini, Antun
Samwel, Niels
Picek, Stjepan
Batina, Lejla
2018 WORKSHOP ON FAULT DIAGNOSIS AND TOLERANCE IN CRYPTOGRAPHY (FDTC), 2018, : 35 - 42
[50] Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance
Yao, Erlin
Zhang, Jiutian
Chen, Mingyu
Tan, Guangming
Sun, Ninghui
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2015, 29 (04): : 422 - 436

← 1 2 3 4 5 →