Fault recovery for distributed shared memory systems

被引:0
|
作者
Dieter, WR
Lumpp, JE
机构
关键词
D O I
暂无
中图分类号
V [航空、航天];
学科分类号
08 ; 0825 ;
摘要
Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via ''checkpointing'' techniques that allow applications to ''roll back'' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.
引用
收藏
页码:525 / 540
页数:16
相关论文
共 50 条
  • [1] Efficient recovery from communication errors in distributed shared memory systems
    Lin, JW
    Kuo, SY
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1998, E81D (11): : 1213 - 1223
  • [2] Efficient recovery from communication errors in distributed shared memory systems
    Natl Taiwan Univ, Taipei, Taiwan
    IEICE Trans Inf Syst, 11 (1213-1223):
  • [3] Analysis of failure recovery schemes for distributed shared-memory systems
    Kim, JH
    Vaidya, NH
    IEE PROCEEDINGS-COMPUTERS AND DIGITAL TECHNIQUES, 1999, 146 (03): : 125 - 130
  • [4] ENSURING CORRECT ROLLBACK RECOVERY IN DISTRIBUTED SHARED-MEMORY SYSTEMS
    JANSSENS, B
    FUCHS, WK
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1995, 29 (02) : 211 - 218
  • [5] Lazy garbage collection of recovery state for fault-tolerant distributed shared memory
    Sultan, F
    Nguyen, TD
    Iftode, L
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2002, 13 (07) : 673 - 686
  • [6] A low overhead logging scheme for fast recovery in distributed shared memory systems
    Park, T
    Yeom, HY
    JOURNAL OF SUPERCOMPUTING, 2000, 15 (03): : 295 - 320
  • [7] A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems
    Taesoon Park
    Heon Y. Yeom
    The Journal of Supercomputing, 2000, 15 : 295 - 320
  • [8] Distributed shared memory: Concepts and systems
    Protic, J
    Tomasevic, M
    Milutinovic, V
    IEEE PARALLEL & DISTRIBUTED TECHNOLOGY, 1996, 4 (02): : 63 - 79
  • [9] Fault-tolerance using Cache-coherent distributed shared memory systems
    Hecht, DL
    Kavi, KM
    Gaede, RK
    Katsinis, C
    FOURTH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS, AND NETWORKS (I-SPAN'99), PROCEEDINGS, 1999, : 100 - 105
  • [10] Fault-tolerance using cache-coherent distributed shared memory systems
    Univ of Alabama in Huntsville, Huntsville, United States
    Int Symp Parall Archit Algorithms Networks I SPAN, (100-105):