MPI-FT: Portable fault tolerance scheme for MPI

被引:0
|
作者
Louca, S. [1 ]
Neophytou, N. [1 ]
Lachanas, A. [1 ]
Evripidou, P. [1 ]
机构
[1] Dept. of Computer Science, University of Cyprus, P.O. Box 537, CY-1678 Nicosia, Cyprus
关键词
Computer simulation - Computer system recovery - Data communication systems - Distributed computer systems - Fault tolerant computer systems - Monitoring - Telecommunication traffic;
D O I
10.1142/s0129626400000342
中图分类号
学科分类号
摘要
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.
引用
收藏
页码:371 / 382
相关论文
共 50 条
  • [41] The Performance Of Erasure Codes Used In FT-MPI
    Liu Xiaoguang
    Wang Gang
    Zhang Yu
    Li Ang
    Xie Fang
    2009 INTERNATIONAL FORUM ON INFORMATION TECHNOLOGY AND APPLICATIONS, VOL 3, PROCEEDINGS, 2009, : 360 - +
  • [42] EReinit: Scalable and efficient fault-tolerance for bulk-synchronous MPI applications
    Chakraborty, Sourav
    Laguna, Ignacio
    Emani, Murali
    Mohror, Kathryn
    Panda, Dhabaleswar K.
    Schulz, Martin
    Subramoni, Hari
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (03):
  • [43] A Novel Stair-Case Replication (SCR) Based Fault Tolerance for MPI Applications
    Bansal, Sanjay
    Sharma, Sanjeev
    Trivedi, Ishita
    INFORMATION TECHNOLOGY AND MOBILE COMMUNICATION, 2011, 147 : 445 - +
  • [44] Exploring parallel MPI fault tolerance mechanisms for phylogenetic inference with RAxML-NG
    Huebner, Lukas
    Kozlov, Alexey M.
    Hespe, Demian
    Sanders, Peter
    Stamatakis, Alexandros
    BIOINFORMATICS, 2021, 37 (22) : 4056 - 4063
  • [45] Portable Explicit Threading and Concurrent Programming for MPI Applications
    Berka, Tobias
    Hagenauer, Helge
    Vajtersic, Marian
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT II, 2012, 7204 : 81 - 90
  • [46] Portable randomized list ranking on multiprocessors using MPI
    Träff, JL
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, 1998, 1497 : 395 - 402
  • [47] Design and Evaluation of FA-MPI, A Transactional Resilience Scheme for Non-blocking MPI
    Hassani, Amin
    Skjellum, Anthony
    Brightwell, Ron
    2014 44TH ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN), 2014, : 750 - 755
  • [48] COMB: A portable benchmark suite for assessing MPI overlap
    Lawry, W
    Wilson, C
    Maccabe, AB
    Brightwell, R
    2002 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, PROCEEDINGS, 2002, : 472 - 475
  • [49] Fault Awareness in the MPI 4.0 Session Model
    Rocco, Roberto
    Palermo, Gianluca
    Gregori, Daniele
    PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS 2023, CF 2023, 2023, : 189 - 192
  • [50] OCFTL: An MPI Implementation-Independent Fault Tolerance Library for Task-Based Applications
    Di Francia Rosso, Pedro Henrique
    Francesquini, Emilio
    HIGH PERFORMANCE COMPUTING, CARLA 2021, 2022, 1540 : 131 - 147