MPI-FT: Portable fault tolerance scheme for MPI

被引:0
|
作者
Louca, S. [1 ]
Neophytou, N. [1 ]
Lachanas, A. [1 ]
Evripidou, P. [1 ]
机构
[1] Dept. of Computer Science, University of Cyprus, P.O. Box 537, CY-1678 Nicosia, Cyprus
关键词
Computer simulation - Computer system recovery - Data communication systems - Distributed computer systems - Fault tolerant computer systems - Monitoring - Telecommunication traffic;
D O I
10.1142/s0129626400000342
中图分类号
学科分类号
摘要
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.
引用
收藏
页码:371 / 382
相关论文
共 50 条
  • [31] MPI jobs within MPI jobs: A practical way of enabling task-level fault-tolerance in HPC workflows
    Wozniak, Justin M.
    Dorier, Matthieu
    Ross, Robert
    Shu, Tong
    Kurc, Tahsin
    Tang, Li
    Podhorszki, Norbert
    Wolf, Matthew
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 101 : 576 - 589
  • [32] Applicability of generic naming services and fault-tolerant metacomputing with FT-MPI
    Dewolfs, D
    Kurzyniec, D
    Sunderam, V
    Broeckhove, J
    Dhaene, T
    Fagg, G
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, PROCEEDINGS, 2005, 3666 : 268 - 275
  • [33] A Portable Tool for Running MPI Applications in the Cloud
    Rak, Massimiliano
    Turtur, Mauro
    Villano, Umberto
    Pino, Luigi
    2014 INTERNATIONAL CONFERENCE ON INTELLIGENT NETWORKING AND COLLABORATIVE SYSTEMS (INCOS), 2014, : 10 - 17
  • [34] Enhancing fault-tolerance of large-scale MPI scientific applications
    Rodriguez, G.
    Gonzalez, P.
    Martin, M. J.
    Tourino, J.
    PARALLEL COMPUTING TECHNOLOGIES, PROCEEDINGS, 2007, 4671 : 153 - 161
  • [35] Portable and Scalable MPI Shared File Pointers
    Cope, Jason
    Iskra, Kamil
    Kimpe, Dries
    Ross, Robert
    RECENT ADVANCES IN THE MESSAGE PASSING INTERFACE, 2011, 6960 : 312 - +
  • [36] Flexible, portable performance analysis for PARMACS and MPI
    Gillich, S
    Ries, B
    HIGH-PERFORMANCE COMPUTING AND NETWORKING, 1995, 919 : 937 - 937
  • [37] Portable, MPI-Interoperable Coarray Fortran
    Yang, Chaoran
    Bland, Wesley
    Mellor-Crummey, John
    Balaji, Pavan
    ACM SIGPLAN NOTICES, 2014, 49 (08) : 81 - 92
  • [38] PMPIO - A portable implementation of MPI-IO
    Fineberg, SA
    Wong, P
    Nitzberg, B
    Kuszmaul, C
    FRONTIERS '96 - THE SIXTH SYMPOSIUM ON FRONTIERS OF MASSIVELY PARALLEL COMPUTING, PROCEEDINGS, 1996, : 188 - 195
  • [39] FT-MPI, fault-tolerant metacomputing and generic name services: A case study
    Dewolfs, David
    Broeckhove, Jan
    Sunderam, Vaidy
    Fagg, Graham E.
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 2006, 4192 : 133 - 140
  • [40] Scheduling in grid: Rescheduling MPI applications using a fault-tolerant MPI implementation
    Reddy, M. Vivekananda
    Chaudhary, Sanjay
    2007 2ND INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS SOFTWARE & MIDDLEWARE, VOLS 1 AND 2, 2007, : 706 - +