Supporting cost-effective fault tolerance in distributed message-passing applications with file operations

被引:0
|
作者
Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, United States [1 ]
不详 [2 ]
机构
来源
J Supercomput | / 3卷 / 207-232期
关键词
Algorithms - Computer system recovery - Computer systems programming - Data communication systems - Fault tolerant computer systems - File organization - Response time (computer systems) - Software engineering - Subroutines;
D O I
暂无
中图分类号
学科分类号
摘要
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.
引用
收藏
相关论文
共 44 条
  • [1] Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations
    Jinsong Ouyang
    Piyush Maheshwari
    The Journal of Supercomputing, 1999, 14 : 207 - 232
  • [2] Supporting cost-effective fault tolerance in distributed message-passing applications with file operations
    Ouyang, JS
    Maheshwari, P
    JOURNAL OF SUPERCOMPUTING, 1999, 14 (03): : 207 - 232
  • [3] A DISTRIBUTED MESSAGE-PASSING COORDINATOR FOR OBLOG APPLICATIONS
    BONCHEV, B
    CAPITAO, M
    INFORMATION NETWORKS AND DATA COMMUNICATION, 1994, 23 : 71 - 87
  • [4] Active optimistic and distributed message logging for message-passing applications
    Ropars, Thomas
    Morin, Christine
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2011, 23 (17): : 2167 - 2178
  • [5] Dynamic Tree Switching for Distributed Message-Passing Applications
    Chakraborty, Suchetana
    Chakraborty, Sandip
    Karmakar, Sushanta
    Nandi, Sukumar
    JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2015, 23 (01) : 1 - 40
  • [6] Dynamic Tree Switching for Distributed Message-Passing Applications
    Suchetana Chakraborty
    Sandip Chakraborty
    Sushanta Karmakar
    Sukumar Nandi
    Journal of Network and Systems Management, 2015, 23 : 1 - 40
  • [7] SUPPORTING DISTRIBUTED OBJECTS IN FIFO-BASED MESSAGE-PASSING SYSTEMS
    CHANG, WT
    TSENG, CC
    JOURNAL OF OBJECT-ORIENTED PROGRAMMING, 1995, 7 (09): : 56 - &
  • [8] Unified fault-tolerance framework for hybrid task-parallel message-passing applications
    Subasi, Omer
    Martsinkevich, Tatiana
    Zyulkyarov, Ferad
    Unsal, Osman
    Labarta, Jesus
    Cappello, Franck
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2018, 32 (05): : 641 - 657
  • [9] A multithreaded message-passing system for high performance distributed computing applications
    Park, SY
    Lee, J
    Hariri, S
    18TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 1998, : 258 - 265
  • [10] Common mechanisms for supporting fault tolerance in DSM and message passing systems
    Badrinath, R
    Morin, C
    Concurrent Information Processing and Computing, 2005, 195 : 175 - 183