MPI-FT: Portable fault tolerance scheme for MPI

被引:0
|
作者
Louca, S. [1 ]
Neophytou, N. [1 ]
Lachanas, A. [1 ]
Evripidou, P. [1 ]
机构
[1] Dept. of Computer Science, University of Cyprus, P.O. Box 537, CY-1678 Nicosia, Cyprus
关键词
Computer simulation - Computer system recovery - Data communication systems - Distributed computer systems - Fault tolerant computer systems - Monitoring - Telecommunication traffic;
D O I
10.1142/s0129626400000342
中图分类号
学科分类号
摘要
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.
引用
收藏
页码:371 / 382
相关论文
共 50 条
  • [1] A portable fault-tolerance scheme for MPI
    Louca, S
    Neophytou, N
    Evripidou, P
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-IV, PROCEEDINGS, 1998, : 690 - 697
  • [2] Fault Tolerance for Conjugate Gradient Solver Based on FT-MPI
    Zhang, Weizhe
    He, Hui
    STUDIES IN INFORMATICS AND CONTROL, 2013, 22 (01): : 51 - 60
  • [3] Network fault tolerance in open MPI
    Shipman, Galen A.
    Graham, Richard L.
    Bosilca, George
    EURO-PAR 2007 PARALLEL PROCESSING, PROCEEDINGS, 2007, 4641 : 868 - +
  • [4] FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world
    Fagg, GE
    Dongarra, JJ
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, PROCEEDINGS, 2000, 1908 : 346 - 353
  • [5] MATCH: An MPI Fault Tolerance Benchmark Suite
    Guo, Luanzheng
    Georgakoudis, Giorgis
    Parasyris, Konstantinos
    Laguna, Ignacio
    Li, Dong
    2020 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION (IISWC 2020), 2020, : 60 - 71
  • [6] Extending the MPI Stages Model of Fault Tolerance
    Schafer, Derek
    Laguna, Ignacio
    Skjellum, Anthony
    Sultana, Nawrin
    Mohror, Kathryn
    PROCEEDINGS OF THE EXASCALE MPI WORKSHOP (EXAMPI 2020), 2020, : 52 - 61
  • [7] Network fault tolerance in LA-MPI
    Aulwes, RT
    Daniel, DJ
    Desai, NN
    Graham, RL
    Risinger, LD
    Sukalski, MW
    Taylor, MA
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2003, 2840 : 344 - 351
  • [8] PC/MPI: Design and implementation of a portable MPI checkpointer
    Ahn, Sunil
    Kim, Junghwan
    Han, Sangyong
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2003, 2840 : 302 - 308
  • [9] PC/MPI: Design and implementation of a portable MPI checkpointer
    Ahn, S
    Kim, J
    Han, SY
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2003, 2840 : 302 - 308
  • [10] Scalable Distributed Consensus to Support MPI Fault Tolerance
    Buntinas, Darius
    2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2012, : 1240 - 1249