Fault recovery for distributed shared memory systems

被引:0
|
作者
Dieter, WR
Lumpp, JE
机构
关键词
D O I
暂无
中图分类号
V [航空、航天];
学科分类号
08 ; 0825 ;
摘要
Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via ''checkpointing'' techniques that allow applications to ''roll back'' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.
引用
收藏
页码:525 / 540
页数:16
相关论文
共 50 条
  • [31] Survey of recoverable distributed shared virtual memory systems
    INRIA, Campus Universitaire de Beaulieu, 35042 Rennes Cédex, France
    不详
    IEEE Trans Parallel Distrib Syst, 9 (959-969):
  • [32] Design issues for distributed shared-memory systems
    Lenoski, DE
    INTERNATIONAL CONFERENCE ON COMPUTER DESIGN - VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 1996, : 62 - 62
  • [33] Conservative garbage collection on distributed shared memory systems
    Yu, WM
    Cox, A
    PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, 1996, : 402 - 410
  • [34] Issues on the architecture and the design of distributed shared memory systems
    Tzeng, NF
    Wallach, SJ
    INTERNATIONAL CONFERENCE ON COMPUTER DESIGN - VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 1996, : 60 - 61
  • [35] Memory latency and thread migration challenges for distributed shared memory systems
    Kavi, KM
    Cohen, WE
    PROCEEDINGS OF THE THIRTY-FIRST HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOL VII: SOFTWARE TECHNOLOGY TRACK, 1998, : 772 - 773
  • [36] Fault Tolerant Scheduling for Parallel Loops on Shared Memory Systems
    Wang, Yizhuo
    Cammarota, Rosario
    Nicolau, Alexandru
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2015, 31 (06) : 1937 - 1959
  • [37] An efficient logging scheme for recoverable distributed shared memory systems
    Park, T
    Cho, S
    Yeom, HY
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, 1997, : 305 - 313
  • [38] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
    Gelado, Isaac
    Cabezas, Javier
    Navarro, Nacho
    Stone, John E.
    Patel, Sanjay
    Hwu, Wen-mei W.
    ASPLOS XV: FIFTEENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2010, : 347 - 358
  • [39] TRANSACTION SYNCHRONIZATION IN DISTRIBUTED SHARED VIRTUAL MEMORY-SYSTEMS
    HSU, MC
    TAM, VO
    PROCEEDINGS : THE THIRTEENTH ANNUAL INTERNATIONAL COMPUTER SOFTWARE & APPLICATIONS CONFERENCE, 1989, : 166 - 175
  • [40] Fundamentals for consistent event ordering in distributed shared memory systems
    Preissinger, J
    Landes, T
    PDPTA '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-3, 2005, : 890 - 896