REPAIR: Hard-Error Recovery via Re-Execution

被引:0
|
作者
Soman, Jyothish [1 ]
Miralaei, Negar [1 ]
Mycroft, Alan [1 ]
Jones, Timothy M. [1 ]
机构
[1] Univ Cambridge, Comp Lab, Cambridge CB2 1TN, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wearout leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68x of a fully functioning system.
引用
收藏
页码:76 / 79
页数:4
相关论文
共 50 条
  • [31] Model-based Performance Analysis of Local Re-execution Scheme in Offloading System
    Wang, Qiushi
    Wu, Huaming
    Wolter, Katinka
    2013 43RD ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2013,
  • [32] A characterization of re-execution costs for real-time abort-oriented protocols
    Shu, LC
    FIFTH INTERNATIONAL CONFERENCE ON REAL-TIME COMPUTING SYSTEMS AND APPLICATIONS, PROCEEDINGS, 1998, : 286 - 292
  • [33] Impact of MapReduce Task Re-execution Policy on Job Completion Reliability and Job Completion Time
    Lin, Jia-Chun
    Leu, Fang-Yie
    Chen, Ying-ping
    Munawar, Waqaas
    2014 IEEE 28TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS (AINA), 2014, : 712 - 718
  • [34] ReSlice: Selective re-execution of long-retired misspeculated instructions using forward slicing
    Sarangi, SR
    Liu, W
    Torrellas, J
    Zhou, YY
    MICRO-38: Proceedings of the 38th Annual IEEE/ACM International Symposiumn on Microarchitecture, 2005, : 257 - 268
  • [35] Combining RAM technologies for hard-error recovery in L1 data caches working at very-low power modes
    Lorente, Vicente
    Valero, Alejandro
    Sahuquillo, Julio
    Petit, Salvador
    Canal, Ramon
    Lopez, Pedro
    Duato, Jose
    DESIGN, AUTOMATION & TEST IN EUROPE, 2013, : 83 - 88
  • [36] Automatic Runtime Error Repair and Containment via Recovery Shepherding
    Long, Fan
    Sidiroglou-Douskos, Stelios
    Rinard, Martin
    ACM SIGPLAN NOTICES, 2014, 49 (06) : 227 - 238
  • [37] THERMAL BENZOXAZINONE-BENZOXAZOLE CONVERSION, A RE-EXECUTION OF A MASS-SPECTROMETRIC DECAY BY THERMOLYSIS
    REICHEN, W
    HELVETICA CHIMICA ACTA, 1977, 60 (01) : 186 - 190
  • [38] Dynamic Fault-Tolerant Workflow Scheduling with Hybrid Spatial-Temporal Re-Execution in Clouds
    Wu, Na
    Zuo, Decheng
    Zhang, Zhan
    INFORMATION, 2019, 10 (05)
  • [39] Online Error Detection and Recovery in Dataflow Execution
    Alves, Tiago A. O.
    Kundu, Sandip
    Marzulo, Leandro A. J.
    Franca, Felipe M. G.
    PROCEEDINGS OF THE 2014 IEEE 20TH INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2014, : 99 - 104
  • [40] Benefits of Bayesian adaptive trial designs: A virtual re-execution using breast cancer trial data
    Hong, Wei
    McLachlan, Sue-Anne
    Moore, Melissa
    Mahar, Robert
    ASIA-PACIFIC JOURNAL OF CLINICAL ONCOLOGY, 2021, 17 : 29 - 30