Fault recovery for distributed shared memory systems

被引:0
|
作者
Dieter, WR
Lumpp, JE
机构
关键词
D O I
暂无
中图分类号
V [航空、航天];
学科分类号
08 ; 0825 ;
摘要
Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via ''checkpointing'' techniques that allow applications to ''roll back'' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.
引用
收藏
页码:525 / 540
页数:16
相关论文
共 50 条
  • [41] Optimizing OpenMP programs on software distributed shared memory systems
    Min, SJ
    Basumallik, A
    Eigenmann, R
    INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2003, 31 (03) : 225 - 249
  • [42] Scheduling loop applications in software distributed shared memory systems
    Liang, Tyng-Yeu
    Shieh, Ce-Kuen
    Liu, Deh-Cheng
    IEICE Transactions on Information and Systems, 2000, E83-D (09) : 1721 - 1730
  • [43] PARALLEL LOOP SCHEDULING APPROACHES FOR DISTRIBUTED AND SHARED MEMORY SYSTEMS
    Aguilar, Jose
    Leiss, Ernst
    PARALLEL PROCESSING LETTERS, 2005, 15 (1-2)
  • [44] Optimizing OpenMP Programs on Software Distributed Shared Memory Systems
    Seung-Jai Min
    Ayon Basumallik
    Rudolf Eigenmann
    International Journal of Parallel Programming, 2003, 31 : 225 - 249
  • [45] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
    Gelado, Isaac
    Cabezas, Javier
    Navarro, Nacho
    Stone, John E.
    Patel, Sanjay
    Hwu, Wen-mei W.
    ACM SIGPLAN NOTICES, 2010, 45 (03) : 347 - 358
  • [46] Fault-tolerant distributed shared memory on a broadcast-based architecture
    Katsinis, C
    Hecht, D
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2004, 15 (12) : 1082 - 1092
  • [47] Dynamic task scheduling on multithreaded distributed shared memory systems
    Liang, TY
    Shieh, CK
    Liu, DC
    Zhu, WP
    INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS I-IV, PROCEEDINGS, 1998, : 1058 - 1065
  • [48] Thread migration and its applications in distributed shared memory systems
    Computer Science Dep Technion, Haifa, Israel
    J Syst Software, 1 (71-87):
  • [49] Scheduling loop applications in software distributed shared memory systems
    Liang, TY
    Shieh, CK
    Liu, DC
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2000, E83D (09): : 1721 - 1730
  • [50] A comparative evaluation of hybrid distributed shared-memory systems
    Moga, Adrian
    Dubois, Michel
    JOURNAL OF SYSTEMS ARCHITECTURE, 2009, 55 (01) : 43 - 52