REPAIR: Hard-Error Recovery via Re-Execution

被引:0
|
作者
Soman, Jyothish [1 ]
Miralaei, Negar [1 ]
Mycroft, Alan [1 ]
Jones, Timothy M. [1 ]
机构
[1] Univ Cambridge, Comp Lab, Cambridge CB2 1TN, England
基金
英国工程与自然科学研究理事会;
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Processor reliability at upcoming technology nodes presents significant challenges to designers from increased manufacturing variability, parametric variation and transistor wearout leading to permanent faults. We present a design to tolerate this impact at the microarchitectural level-a chip with n cores together with one or more shared instruction re-execution units (IRUs). Instructions using a faulty component are identified and re-executed on an IRU. This design incurs no slowdown in the absence of errors and allows continued operation of all n cores after multiple hard errors on one or all cores in the structures protected by our scheme. Experiments show that a single-core chip experiences only a 23% slowdown with 1 error, rising to 43% in the presence of 5 errors. In a 4-core scenario with 4 errors on every core and a shared IRU, REPAIR enables performance of 0.68x of a fully functioning system.
引用
收藏
页码:76 / 79
页数:4
相关论文
共 50 条
  • [1] INCREMENTAL RE-EXECUTION OF PROGRAMS
    KARINTHI, RR
    WEISER, M
    SIGPLAN NOTICES, 1987, 22 (07): : 38 - 44
  • [2] Soft Error Detection through Low-level Re-execution
    De Blaere, Brent
    Vankeirsbilck, Jens
    Boydens, Jeroen
    2021 5TH INTERNATIONAL CONFERENCE ON SYSTEM RELIABILITY AND SAFETY (ICSRS 2021), 2021, : 181 - 189
  • [3] Minimization of Vote Operations for Soft Error Detection in DMR Design with Error Correction by Operation Re-Execution
    Ito, Kazuhito
    Ishihara, Yuto
    Nishizawa, Shinichi
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2018, E101A (12): : 2271 - 2279
  • [4] Scalable selective re-execution for EDGE architectures
    Desikan, R
    Sethumadhavan, S
    Burger, D
    Keckler, SW
    ACM SIGPLAN NOTICES, 2004, 39 (11) : 120 - 132
  • [5] A different re-execution speed can help
    Benoit, Anne
    Cavelan, Aurelien
    Le Fevre, Valentin
    Robert, Yves
    Sun, Hongyang
    PROCEEDINGS OF 45TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS (ICPPW 2016), 2016, : 250 - 257
  • [6] Morty: Scaling Concurrency Control with Re-Execution
    Burke, Matthew
    Suri-Payer, Florian
    Helt, Jeffrey
    Alvisi, Lorenzo
    Crooks, Natacha
    PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023, 2023, : 687 - 702
  • [7] Design configuration selection for hard-error reliable processors via statistical rules
    Zhang, Ying
    Duan, Lide
    Li, Bin
    Peng, Lu
    Fu, Xin
    MICROPROCESSORS AND MICROSYSTEMS, 2014, 38 (01) : 22 - 30
  • [8] Fault tolerance through re-execution in multiscalar architecture
    Rashid, F
    Saluja, KK
    Ramanathan, P
    DSN 2000: INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2000, : 482 - 491
  • [9] Impact of Selective Implementation on Soft Error Detection Through Low-level Re-execution
    Nikscresht, Mohaddaseh
    De Blaere, Brent
    Vankeirsbilck, Jens
    Pissoort, Davy
    Boydens, Jeroen
    2021 IEEE INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, INTL CONF ON CLOUD AND BIG DATA COMPUTING, INTL CONF ON CYBER SCIENCE AND TECHNOLOGY CONGRESS DASC/PICOM/CBDCOM/CYBERSCITECH 2021, 2021, : 112 - 117
  • [10] Analysis of Local Re-execution in Mobile Offloading System
    Wang, Qiushi
    Jorba, Marti Griera
    Ripoll, Joan Martinez
    Wolter, Katinka
    2013 IEEE 24TH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING (ISSRE), 2013, : 31 - 40