REPAIR: Hard-Error Recovery via Re-Execution

被引：0

作者：

Soman, Jyothish ^{[1
]}

Miralaei, Negar ^{[1
]}

Mycroft, Alan ^{[1
]}

Jones, Timothy M. ^{[1
]}

机构：

[1] Univ Cambridge, Comp Lab, Cambridge CB2 1TN, England

来源：

PROCEEDINGS OF THE 2015 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFTS) | 2015年

基金：

英国工程与自然科学研究理事会;

关键词：

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wearout leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68x of a fully functioning system.

引用

页码：76 / 79

页数：4

共 50 条

[21] High Performance Fault Tolerance Through Predictive Instruction Re-Execution
Soman, Jyothish
Jones, Timothy M.
2017 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT), 2017, : 147 - 150
[22] Re-execution of distributed programs to detect bugs hidden by racing messages
Kilgore, R
Chase, C
THIRTIETH HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOL 1: SOFTWARE TECHNOLOGY AND ARCHITECTURE, 1997, : 423 - 432
[23] Finding missing synchronization in a distributed computation using controlled re-execution
Mittal, N
Garg, VK
DISTRIBUTED COMPUTING, 2004, 17 (02) : 107 - 130
[24] Partial re-execution: Reconciling transactions to increase concurrency in object-bases
Hadaegh, AR
Barker, K
INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, PROCEEDINGS, 1999, : 1469 - 1475
[25] Using instruction result locality and re-execution to mitigate silent data corruptions
Tajary, Alireza
Zarandi, Hamid R.
MICROELECTRONICS RELIABILITY, 2016, 62 : 178 - 190
[26] Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance
Powell, Michael D.
Biswas, Arijit
Gupta, Shantanu
Mukherjee, Shubhendu S.
ISCA 2009: 36TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 2009, : 93 - 104
[27] Store Vulnerability Window (SVW): Re-execution filtering for enhanced load optimization
Roth, A
32ND INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2005, : 458 - 468
[28] Exploring the Potential for Collaborative Data Compression and Hard-Error Tolerance in PCM Memories
Jadidi, Amin
Arjomand, Mohammad
Tavana, Mohammad Khavari
Kaeli, David R.
Kandemir, Mahmut T.
Das, Chita R.
2017 47TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2017, : 85 - 96
[29] HTFabric: A Fast Re-ordering and Parallel Re-execution Method for a High-Throughput Blockchain
Song, Jaeyub
Jeong, Juyeong
Lee, Jemin
Na, Inju
Kim, Min-Soo
PROCEEDINGS OF THE 33RD ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2024, 2024, : 2118 - 2127
[30] Task-Level Re-Execution Framework for Improving Fault Tolerance on Symmetry Multiprocessors
Baek, Hyeongboo
Lee, Jaewoo
SYMMETRY-BASEL, 2019, 11 (05):

← 1 2 3 4 5 →