Graceful degradation in algorithm-based fault tolerant multiprocessor systems

被引:9
|
作者
Yajnik, S [1 ]
Jha, NK [1 ]
机构
[1] PRINCETON UNIV, DEPT ELECT ENGN, PRINCETON, NJ 08544 USA
关键词
algorithm-based fault tolerance; concurrent error detection; concurrent fault location; fault diagnosis; graceful degradation; transient faults;
D O I
10.1109/71.577256
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t = 1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check_computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system.
引用
收藏
页码:137 / 153
页数:17
相关论文
共 50 条
  • [42] A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism
    Yao, Erlin
    Wang, Rui
    Chen, Mingyu
    Tan, Guangming
    Sun, Ninghui
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 438 - 448
  • [43] A MATHEMATICAL FRAMEWORK FOR ALGORITHM-BASED FAULT-TOLERANT COMPUTING OVER A RING OF INTEGERS
    KRISHNA, H
    CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 1994, 13 (05) : 625 - 653
  • [44] Multiprocessor-based fault-tolerant real-time task scheduling algorithm
    Zhang, Yongjun
    Zhang, Yi
    Peng, Yuxing
    Chen, Fujie
    1600, Sci Press (37):
  • [45] Graceful Degradation of Low-Criticality Tasks in Multiprocessor Dual-Criticality Systems
    Huang, Lin
    Hou, I-Hong
    Sapatnekar, Sachin S.
    Hu, Jiang
    PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON REAL-TIME NETWORKS AND SYSTEMS (RTNS 2018), 2018,
  • [46] A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis
    Manimaran, G
    Murthy, CSR
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1998, 9 (11) : 1137 - 1152
  • [47] A Novel Intelligent Algorithm for Fault-Tolerant Task Scheduling in Real-Time Multiprocessor Systems
    Zarinzad, Golbarg
    Rahmani, Amir Masoud
    Dayhim, Nikta
    Third 2008 International Conference on Convergence and Hybrid Information Technology, Vol 2, Proceedings, 2008, : 816 - 821
  • [48] Algorithm-based fault tolerance: a review
    Vijay, M
    Mittal, R
    MICROPROCESSORS AND MICROSYSTEMS, 1997, 21 (03) : 151 - 161
  • [49] BOUNDS ON ALGORITHM-BASED FAULT TOLERANCE IN MULTIPLE PROCESSOR SYSTEMS.
    Banerjee, Prithviraj
    Abraham, Jacob A.
    IEEE Transactions on Computers, 1986, C-35 (04) : 296 - 306
  • [50] Algorithm based fault tolerant state estimation of power systems
    Mishra, A
    Mili, L
    Phadke, AG
    2004 INTERNATIONAL CONFERENCE ON PROBABILISTIC METHODS APPLIED TO POWER SYSTEMS, 2004, : 174 - 179