Fault tolerance in Message Passing Interface programs

被引:54
|
作者
Gropp, W [1 ]
Lusk, E [1 ]
机构
[1] Argonne Natl Lab, Math & Comp Sci Div, Argonne, IL 60439 USA
来源
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS | 2004年 / 18卷 / 03期
关键词
MPI; fault tolerance; process management; parallel computing;
D O I
10.1177/1094342004046045
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
引用
收藏
页码:363 / 372
页数:10
相关论文
共 50 条
  • [1] Symbolic Verification of Message Passing Interface Programs
    Yu, Hengbiao
    Chen, Zhenbang
    Fu, Xianjin
    Wang, Ji
    Su, Zhendong
    Sun, Jun
    Huang, Chun
    Dong, Wei
    2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2020), 2020, : 1248 - 1260
  • [2] Checkpointing Message-Passing Interface (MPI) parallel programs
    Li, WJ
    Tsay, JJ
    PACIFIC RIM INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT SYSTEMS, PROCEEDINGS, 1997, : 147 - 152
  • [3] Common mechanisms for supporting fault tolerance in DSM and message passing systems
    Badrinath, R
    Morin, C
    Concurrent Information Processing and Computing, 2005, 195 : 175 - 183
  • [4] Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path Analysis
    Korkhov, Vladimir
    Gankevich, Ivan
    Gavrikov, Anton
    Mingazova, Maria
    Petriakov, Ivan
    Tereshchenko, Dmitrii
    Shatalin, Artem
    Slobodskoy, Vitaly
    ALGORITHMS, 2023, 16 (11)
  • [5] Design, implementation and performance of fault-tolerant message passing interface (MPI)
    Selvakumar, AD
    Sobha, PM
    Ravindra, GC
    Pitchiah, R
    PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 2004, : 145 - 150
  • [6] Design, implementation and performance of Fault-Tolerant message passing interface (MPI)
    Selvakumar, AD
    Sobha, PM
    Ravindra, GC
    Pitchiah, R
    SEVENTH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND GRID IN ASIA PACIFIC REGION, PROCEEDINGS, 2004, : 120 - 129
  • [7] Message analysis for concurrent programs using message passing
    Carlsson, Richard
    Sagonas, Konstantinos
    Wilhelmsson, Jesper
    ACM TRANSACTIONS ON PROGRAMMING LANGUAGES AND SYSTEMS, 2006, 28 (04): : 715 - 746
  • [8] Analyzing nondeterminacy of message passing programs
    Xiong, JX
    Wang, DX
    SECOND INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS, AND NETWORKS (I-SPAN '96), PROCEEDINGS, 1996, : 547 - 549
  • [9] Notes on nondeterminism in message passing programs
    Kranzlmüller, D
    Schulz, M
    RECENT ADVANCES IN PARALLEL VIRTUAL MACHINE AND MESSAGE PASSING INTERFACE, PROCEEDINGS, 2002, 2474 : 357 - 367
  • [10] VISUALIZATION OF MESSAGE PASSING PARALLEL PROGRAMS
    BEMMERL, T
    BRAUN, P
    LECTURE NOTES IN COMPUTER SCIENCE, 1992, 634 : 79 - 90