Fault recovery for distributed shared memory systems

被引:0
|
作者
Dieter, WR
Lumpp, JE
机构
关键词
D O I
暂无
中图分类号
V [航空、航天];
学科分类号
08 ; 0825 ;
摘要
Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via ''checkpointing'' techniques that allow applications to ''roll back'' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.
引用
收藏
页码:525 / 540
页数:16
相关论文
共 50 条
  • [21] Lazy logging and prefetch-based crash recovery in software distributed shared memory systems
    Kongmunvattana, A
    Tzeng, NF
    IPPS/SPDP 1999: 13TH INTERNATIONAL PARALLEL PROCESSING SYMPOSIUM & 10TH SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING, PROCEEDINGS, 1999, : 399 - 406
  • [22] Cache based fault recovery for distributed systems
    Mendelson, A
    Suri, N
    THIRD IEEE INTERNATIONAL CONFERENCE ON ENGINEERING OF COMPLEX COMPUTER SYSTEMS, PROCEEDINGS, 1997, : 119 - 129
  • [23] A survey of recoverable distributed shared virtual memory systems
    Morin, C
    Puaut, I
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1997, 8 (09) : 959 - 969
  • [24] Distributed parallel volume rendering on shared memory systems
    Hancock, D.J.
    Hubbold, R.J.
    Future Generation Computer Systems, 1998, 13 (4-5): : 251 - 259
  • [25] Lazy some migration for distributed shared memory systems
    Baylor, S
    Ekanadham, K
    Jann, J
    Lim, BH
    Pattnaik, P
    FOURTH INTERNATIONAL CONFERENCE ON HIGH-PERFORMANCE COMPUTING, PROCEEDINGS, 1997, : 106 - 111
  • [26] DISTRIBUTED SHARED-MEMORY IMPLEMENTATION FOR MULTITRANSPUTER SYSTEMS
    TSANAKAS, P
    PAPAKONSTANTINOU, G
    EFTHIVOULIDIS, G
    INFORMATION AND SOFTWARE TECHNOLOGY, 1992, 34 (08) : 499 - 506
  • [27] Impacts of Topology and Bandwidth on Distributed Shared Memory Systems
    Milton, Jonathan
    Zarkesh-Ha, Payman
    COMPUTERS, 2023, 12 (04)
  • [28] Distributed parallel volume rendering on shared memory systems
    Hancock, DJ
    Hubbold, RJ
    HIGH-PERFORMANCE COMPUTING AND NETWORKING, 1997, 1225 : 157 - 164
  • [29] Distributed parallel volume rendering on shared memory systems
    Hancock, DJ
    Hubbold, RJ
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 1998, 13 (4-5): : 251 - 259
  • [30] A multithreaded processor designed for distributed shared memory systems
    Grunewald, W
    Ungerer, T
    ADVANCES IN PARALLEL AND DISTRIBUTED COMPUTING - PROCEEDINGS, 1997, : 206 - 213