MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

被引:21
|
作者
Rajanikanth Batchu
Yoginder S. Dandass
Anthony Skjellum
Murali Beddhu
机构
关键词
model-based fault tolerance; MPI; cluster computing; fault detection; group communication;
D O I
10.1023/B:CLUS.0000039491.64560.8a
中图分类号
学科分类号
摘要
Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.
引用
收藏
页码:303 / 315
页数:12
相关论文
共 18 条
  • [1] MPI/FT™:: Architecture and taxonomies for fault-tolerant, message-passing middleware for performance-portable parallel computing
    Batchu, R
    Neelamegam, JP
    Cui, ZQ
    Beddhu, M
    Skjellum, A
    Dandass, Y
    Apte, M
    FIRST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2001, : 26 - 33
  • [2] Challenges of low-overhead message-passing communication using commodity superscalar processors
    Grayson, B
    Chase, C
    INTERNATIONAL SOCIETY FOR COMPUTERS AND THEIR APPLICATIONS 10TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING SYSTEMS, 1997, : 193 - 197
  • [3] Fault-injection-based testing of fault-tolerant algorithms in message-passing parallel computers
    Blough, DM
    Torii, T
    TWENTY-SEVENTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST OF PAPERS, 1997, : 258 - 267
  • [4] Low-overhead fault-tolerant error correction scheme based on quantum stabilizer codes
    陈秀波
    赵立云
    徐刚
    潘兴博
    陈思怡
    程振文
    杨义先
    Chinese Physics B, 2022, (04) : 92 - 99
  • [5] Low-overhead fault-tolerant error correction scheme based on quantum stabilizer codes
    Chen, Xiu-Bo
    Zhao, Li-Yun
    Xu, Gang
    Pan, Xing-Bo
    Chen, Si-Yi
    Cheng, Zhen-Wen
    Yang, Yi-Xian
    CHINESE PHYSICS B, 2022, 31 (04)
  • [6] A model-based approach for fault-tolerant control
    Niemann, Henrik
    2010 CONFERENCE ON CONTROL AND FAULT-TOLERANT SYSTEMS (SYSTOL'10), 2010, : 481 - 492
  • [7] A MODEL-BASED APPROACH TO FAULT-TOLERANT CONTROL
    Niemann, Hans Henrik
    INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE, 2012, 22 (01) : 67 - 86
  • [8] A model-based approach for fault-tolerant control
    Niemann, Henrik
    Conference on Control and Fault-Tolerant Systems, SysTol'10 - Final Program and Book of Abstracts, 2010, : 481 - 492
  • [9] A Low-overhead Fault tolerant Technique for TSV-based Interconnects in 3D-IC Systems
    Ben Abdallah, Abderazek
    Dang, Khanh N.
    Okuyama, Yuichi
    2017 18TH INTERNATIONAL CONFERENCE ON SCIENCES AND TECHNIQUES OF AUTOMATIC CONTROL AND COMPUTER ENGINEERING (STA), 2017, : 179 - 184
  • [10] Fault-tolerant strategy for topology reconfiguration of manycore systems based on message passing model
    Wang, Jingxiang, 1600, Institute of Computing Technology (26):