Supporting cost-effective fault tolerance in distributed message-passing applications with file operations

被引:0
|
作者
Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, United States [1 ]
不详 [2 ]
机构
来源
J Supercomput | / 3卷 / 207-232期
关键词
Algorithms - Computer system recovery - Computer systems programming - Data communication systems - Fault tolerant computer systems - File organization - Response time (computer systems) - Software engineering - Subroutines;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.
引用
收藏
相关论文
共 44 条
  • [41] Formulating Criticality-Based Cost-Effective Fault Tolerance Strategies for Multi-Tenant Service-Based Systems
    Wang, Yanchun
    He, Qiang
    Ye, Dayong
    Yang, Yun
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2018, 44 (03) : 291 - 307
  • [42] Cost-effective and fault-resilient reusability prediction model by using adaptive genetic algorithm based neural network for web-of-service applications
    Neelamadhab Padhy
    R. P. Singh
    Suresh Chandra Satapathy
    Cluster Computing, 2019, 22 : 14559 - 14581
  • [43] Cost-effective and fault-resilient reusability prediction model by using adaptive genetic algorithm based neural network for web-of-service applications
    Padhy, Neelamadhab
    Singh, R. P.
    Satapathy, Suresh Chandra
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 6): : 14559 - 14581
  • [44] New standard for electrical grid reliability and energy efficiency. Commercially available Fault Current Limiters enable cost-effective connection of generation and distributed renewable energy sources to transmission, distribution and industrial networks
    Garbi, Uri
    Pannu, Mohinder
    2015 3RD INTERNATIONAL CONFERENCE ON ELECTRIC POWER EQUIPMENT - SWITCHING TECHNOLOGY (ICEPE-ST), 2015, : 226 - 228