REPAIR: Hard-Error Recovery via Re-Execution

被引:0
|
作者
Soman, Jyothish [1 ]
Miralaei, Negar [1 ]
Mycroft, Alan [1 ]
Jones, Timothy M. [1 ]
机构
[1] Univ Cambridge, Comp Lab, Cambridge CB2 1TN, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wearout leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68x of a fully functioning system.
引用
收藏
页码:76 / 79
页数:4
相关论文
共 50 条
  • [21] High Performance Fault Tolerance Through Predictive Instruction Re-Execution
    Soman, Jyothish
    Jones, Timothy M.
    2017 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFT), 2017, : 147 - 150
  • [22] Re-execution of distributed programs to detect bugs hidden by racing messages
    Kilgore, R
    Chase, C
    THIRTIETH HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOL 1: SOFTWARE TECHNOLOGY AND ARCHITECTURE, 1997, : 423 - 432
  • [23] Finding missing synchronization in a distributed computation using controlled re-execution
    Mittal, N
    Garg, VK
    DISTRIBUTED COMPUTING, 2004, 17 (02) : 107 - 130
  • [24] Partial re-execution: Reconciling transactions to increase concurrency in object-bases
    Hadaegh, AR
    Barker, K
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-V, PROCEEDINGS, 1999, : 1469 - 1475
  • [25] Using instruction result locality and re-execution to mitigate silent data corruptions
    Tajary, Alireza
    Zarandi, Hamid R.
    MICROELECTRONICS RELIABILITY, 2016, 62 : 178 - 190
  • [26] Architectural Core Salvaging in a Multi-Core Processor for Hard-Error Tolerance
    Powell, Michael D.
    Biswas, Arijit
    Gupta, Shantanu
    Mukherjee, Shubhendu S.
    ISCA 2009: 36TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 2009, : 93 - 104
  • [27] Store Vulnerability Window (SVW): Re-execution filtering for enhanced load optimization
    Roth, A
    32ND INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS, 2005, : 458 - 468
  • [28] Exploring the Potential for Collaborative Data Compression and Hard-Error Tolerance in PCM Memories
    Jadidi, Amin
    Arjomand, Mohammad
    Tavana, Mohammad Khavari
    Kaeli, David R.
    Kandemir, Mahmut T.
    Das, Chita R.
    2017 47TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2017, : 85 - 96
  • [29] HTFabric: A Fast Re-ordering and Parallel Re-execution Method for a High-Throughput Blockchain
    Song, Jaeyub
    Jeong, Juyeong
    Lee, Jemin
    Na, Inju
    Kim, Min-Soo
    PROCEEDINGS OF THE 33RD ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2024, 2024, : 2118 - 2127
  • [30] Task-Level Re-Execution Framework for Improving Fault Tolerance on Symmetry Multiprocessors
    Baek, Hyeongboo
    Lee, Jaewoo
    SYMMETRY-BASEL, 2019, 11 (05):