Supporting cost-effective fault tolerance in distributed message-passing applications with file operations

被引:0
|
作者
Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, United States [1 ]
不详 [2 ]
机构
来源
J Supercomput | / 3卷 / 207-232期
关键词
Algorithms - Computer system recovery - Computer systems programming - Data communication systems - Fault tolerant computer systems - File organization - Response time (computer systems) - Software engineering - Subroutines;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.
引用
收藏
相关论文
共 44 条
  • [31] Methodology for cost-effective software fault tolerance for mission-critical systems
    Kreutzfeld, RJ
    Neese, RE
    IEEE AEROSPACE AND ELECTRONIC SYSTEMS MAGAZINE, 1997, 12 (09) : 25 - 30
  • [32] Methodology for cost-effective software fault tolerance for mission-critical systems
    Kreutzfeld, Robert J.
    Neese, Richard E.
    AIAA/IEEE Digital Avionics Systems Conference - Proceedings, 1996, : 19 - 24
  • [33] Monitor Based CAR - a Cost-Effective Approach to Zero-Fault-Tolerance in Maintenance
    Elzer, Peter F.
    Behnke, Ralf
    Nikolic, Vesna
    ATP EDITION, 2009, (05): : 43 - 48
  • [34] Cost-Effective Fault Tolerance for CNNs Using Parameter Vulnerability Based Hardening and Pruning
    Ahmadilivani, Mohammad Hasan
    Mousavi, Seyedhamidreza
    Raik, Jaan
    Daneshtalab, Masoud
    Jenihhin, Maksim
    2024 IEEE 30TH INTERNATIONAL SYMPOSIUM ON ON-LINE TESTING AND ROBUST SYSTEM DESIGN, IOLTS 2024, 2024,
  • [35] A Cost-Effective Fault Tolerance Technique for Functional TSV in 3-D ICs
    Reddy, Raviteja P.
    Acharyya, Amit
    Khursheed, Saqib
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2017, 25 (07) : 2071 - 2080
  • [36] TIRAN: Flexible and portable fault tolerance solutions for cost effective dependable applications
    Botti, O
    De Florio, V
    Deconinck, G
    Cassinari, F
    Donatelli, S
    Bobbio, A
    Klein, A
    Kufner, H
    Lauwereins, R
    Thurner, E
    Verhulst, E
    EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 1166 - 1170
  • [37] Cost-effective reliability enhancement for video stitching applications based on error-tolerance
    Hsieh, Tong-Yu
    Tsui, Pao-Wei
    Wu, Jun-Tsung
    MICROELECTRONICS RELIABILITY, 2024, 155
  • [38] A Cost-Effective Buffer Map Notification Scheme for P2P VoDs Supporting VCR Operations
    Uedera, Ryusuke
    Fujita, Satoshi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12) : 2713 - 2719
  • [39] Broadcast-TDMA: A Cost-Effective Fault-Tolerance Method for TSV Lifetime Reliability Enhancement
    Ni, Tianming
    Bian, Jingchang
    Yang, Zhao
    Nie, Mu
    Yao, Liang
    Huang, Zhengfeng
    Yan, Aibin
    Wen, Xiaoqing
    IEEE DESIGN & TEST, 2022, 39 (05) : 34 - 42
  • [40] Improved read performance in a cost-effective, fault-tolerant parallel virtual file system (CEFT-PVFS)
    Zhu, YF
    Jiang, H
    Qin, X
    Feng, D
    Swanson, DR
    CCGRID 2003: 3RD IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2003, : 730 - 735