MPI-FT: Portable fault tolerance scheme for MPI

被引:0
|
作者
Louca, S. [1 ]
Neophytou, N. [1 ]
Lachanas, A. [1 ]
Evripidou, P. [1 ]
机构
[1] Dept. of Computer Science, University of Cyprus, P.O. Box 537, CY-1678 Nicosia, Cyprus
关键词
Computer simulation - Computer system recovery - Data communication systems - Distributed computer systems - Fault tolerant computer systems - Monitoring - Telecommunication traffic;
D O I
10.1142/s0129626400000342
中图分类号
学科分类号
摘要
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.
引用
收藏
页码:371 / 382
相关论文
共 50 条
  • [21] MPI/FT™:: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
    Batchu, R
    Neelamegam, JP
    Cui, ZQ
    Beddhu, M
    Skjellum, A
    Dandass, Y
    Apte, M
    FIRST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2001, : 26 - 33
  • [22] FAIL-MPI: How fault-tolerant is fault-tolerant MPI?
    Hoarau, William
    Lemarinier, Pierre
    Herault, Thomas
    Rodriguez, Eric
    Tixeuil, Sebastien
    Cappello, Franck
    2006 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, VOLS 1 AND 2, 2006, : 133 - +
  • [23] Supporting Task-level Fault-Tolerance in HPC Workflows by Launching MPI Jobs inside MPI Jobs
    Dorier, Matthieu
    Wozniak, Justin M.
    Ross, Robert
    PROCEEDINGS OF WORKS 2017: 12TH WORKSHOP ON WORKFLOWS IN SUPPORT OF LARGE-SCALE SCIENCE, 2017,
  • [24] NR-MPI: a Non-stop and Fault Resilient MPI
    Suo, Guang
    Lu, Yutong
    Liao, Xiangke
    Xie, Min
    Cao, Hongjia
    2013 19TH IEEE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2013), 2013, : 190 - 199
  • [25] Portable distributed priority queues with MPI
    Macquarie Univ, Sydney
    Concurrency Pract Exper, 3 (175-198):
  • [26] Portable distributed priority queues with MPI
    Mans, B
    CONCURRENCY-PRACTICE AND EXPERIENCE, 1998, 10 (03): : 175 - 198
  • [27] Run-Through Stabilization: An MPI Proposal for Process Fault Tolerance
    Hursey, Joshua
    Graham, Richard L.
    Bronevetsky, Greg
    Buntinas, Darius
    Pritchard, Howard
    Solt, David G.
    RECENT ADVANCES IN THE MESSAGE PASSING INTERFACE, 2011, 6960 : 329 - +
  • [28] Evaluating and extending user-level fault tolerance in MPI applications
    Laguna, Ignacio
    Richards, David F.
    Gamblin, Todd
    Schulz, Martin
    de Supinski, Bronis R.
    Mohror, Kathryn
    Pritchard, Howard
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2016, 30 (03): : 305 - 319
  • [29] Supporting User-directed Fault Tolerance Over Standard MPI
    Wu, Zhimin
    Wang, Rui
    Xu, Weizhi
    Chen, Mingyu
    Yao, Erlin
    PROCEEDINGS OF THE 2012 IEEE 18TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2012), 2012, : 696 - 697
  • [30] HARNESS and fault tolerant MPI
    Fagg, GE
    Bukovsky, A
    Dongarra, JJ
    PARALLEL COMPUTING, 2001, 27 (11) : 1479 - 1495