Supporting cost-effective fault tolerance in distributed message-passing applications with file operations

被引:0
|
作者
Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, United States [1 ]
不详 [2 ]
机构
来源
J Supercomput | / 3卷 / 207-232期
关键词
Algorithms - Computer system recovery - Computer systems programming - Data communication systems - Fault tolerant computer systems - File organization - Response time (computer systems) - Software engineering - Subroutines;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.
引用
收藏
相关论文
共 44 条
  • [21] CEFT: A cost-effective, fault-tolerant parallel virtual file system
    Zhu, YF
    Jiang, H
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2006, 66 (02) : 291 - 306
  • [22] HARDWARE REDUNDANCY - THE WAY TO GO FOR COST-EFFECTIVE FAST FAULT TOLERANCE
    FOSTER, WE
    ELECTRONIC DESIGN, 1984, 32 (14) : 62 - 62
  • [23] Exploring Winograd Convolution for Cost-Effective Neural Network Fault Tolerance
    Xue, Xinghua
    Liu, Cheng
    Liu, Bo
    Huang, Haitong
    Wang, Ying
    Luo, Tao
    Zhang, Lei
    Li, Huawei
    Li, Xiaowei
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (11) : 1763 - 1773
  • [24] Joint Throughput and Fault Tolerance Requirement for Cost-Effective Dense WiFi
    Qiu, Shuwei
    Leung, Yiu-Wing
    2024 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE, WCNC 2024, 2024,
  • [25] Cost-effective Safety and Fault Localization using Distributed Temporal Redundancy
    Meyer, Brett H.
    Calhoun, Benton H.
    Lach, John
    Skadron, Kevin
    PROCEEDINGS OF THE PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON COMPILERS, ARCHITECTURES AND SYNTHESIS FOR EMBEDDED SYSTEMS (CASES '11), 2011, : 125 - 134
  • [26] A cost-effective fault management system for distribution systems with distributed generators
    Teng, Jen-Hao
    Luan, Shang-Wen
    Huang, Wei-Hao
    Lee, Dong-Jing
    Huang, Yung-Fu
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2015, 65 : 357 - 366
  • [27] A methodology for cost-effective software fault tolerance for mission-critical systems
    Kreutzfeld, RJ
    Neese, RE
    15TH DASC - AIAA/IEEE DIGITAL AVIONICS SYSTEMS CONFERENCE, 1996, : 19 - 24
  • [28] Cost-effective multichip module manufacture using passive substrate fault tolerance
    Peacock, C
    Bolouri, H
    Habiger, C
    IEEE TRANSACTIONS ON COMPONENTS PACKAGING AND MANUFACTURING TECHNOLOGY PART B-ADVANCED PACKAGING, 1997, 20 (03): : 320 - 326
  • [29] Methodology for cost-effective software fault tolerance for mission-critical systems
    TASC, Fairborne, United States
    IEEE Aerosp Electron Syst Mag, 1600, 9 (25-30):
  • [30] Enhancing Sensor Fault Tolerance in Automotive Systems With Cost-Effective Cyber Redundancy
    Foshati, Amin
    Ejlali, Alireza
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024, 9 (04): : 4794 - 4803