Supporting cost-effective fault tolerance in distributed message-passing applications with file operations

被引：0

作者：

Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, United States ^{[1
]}

不详 ^{[2
]}

机构：

来源：

J Supercomput | / 3卷 / 207-232期

关键词：

Algorithms - Computer system recovery - Computer systems programming - Data communication systems - Fault tolerant computer systems - File organization - Response time (computer systems) - Software engineering - Subroutines;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.

引用

共 44 条

[31] Methodology for cost-effective software fault tolerance for mission-critical systems
Kreutzfeld, RJ
Neese, RE
IEEE AEROSPACE AND ELECTRONIC SYSTEMS MAGAZINE, 1997, 12 (09) : 25 - 30
[32] Methodology for cost-effective software fault tolerance for mission-critical systems
Kreutzfeld, Robert J.
Neese, Richard E.
AIAA/IEEE Digital Avionics Systems Conference - Proceedings, 1996, : 19 - 24
[33] Monitor Based CAR - a Cost-Effective Approach to Zero-Fault-Tolerance in Maintenance
Elzer, Peter F.
Behnke, Ralf
Nikolic, Vesna
ATP EDITION, 2009, (05): : 43 - 48
[34] Cost-Effective Fault Tolerance for CNNs Using Parameter Vulnerability Based Hardening and Pruning
Ahmadilivani, Mohammad Hasan
Mousavi, Seyedhamidreza
Raik, Jaan
Daneshtalab, Masoud
Jenihhin, Maksim
2024 IEEE 30TH INTERNATIONAL SYMPOSIUM ON ON-LINE TESTING AND ROBUST SYSTEM DESIGN, IOLTS 2024, 2024,
[35] A Cost-Effective Fault Tolerance Technique for Functional TSV in 3-D ICs
Reddy, Raviteja P.
Acharyya, Amit
Khursheed, Saqib
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2017, 25 (07) : 2071 - 2080
[36] TIRAN: Flexible and portable fault tolerance solutions for cost effective dependable applications
Botti, O
De Florio, V
Deconinck, G
Cassinari, F
Donatelli, S
Bobbio, A
Klein, A
Kufner, H
Lauwereins, R
Thurner, E
Verhulst, E
EURO-PAR'99: PARALLEL PROCESSING, 1999, 1685 : 1166 - 1170
[37] Cost-effective reliability enhancement for video stitching applications based on error-tolerance
Hsieh, Tong-Yu
Tsui, Pao-Wei
Wu, Jun-Tsung
MICROELECTRONICS RELIABILITY, 2024, 155
[38] A Cost-Effective Buffer Map Notification Scheme for P2P VoDs Supporting VCR Operations
Uedera, Ryusuke
Fujita, Satoshi
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12) : 2713 - 2719
[39] Broadcast-TDMA: A Cost-Effective Fault-Tolerance Method for TSV Lifetime Reliability Enhancement
Ni, Tianming
Bian, Jingchang
Yang, Zhao
Nie, Mu
Yao, Liang
Huang, Zhengfeng
Yan, Aibin
Wen, Xiaoqing
IEEE DESIGN & TEST, 2022, 39 (05) : 34 - 42
[40] Improved read performance in a cost-effective, fault-tolerant parallel virtual file system (CEFT-PVFS)
Zhu, YF
Jiang, H
Qin, X
Feng, D
Swanson, DR
CCGRID 2003: 3RD IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, PROCEEDINGS, 2003, : 730 - 735

← 1 2 3 4 5 →