Fault recovery for distributed shared memory systems

被引：0

作者：

Dieter, WR

Lumpp, JE

机构：

来源：

1997 IEEE AEROSPACE CONFERENCE PROCEEDINGS, VOL 2 | 1997年

关键词：

D O I：

暂无

中图分类号：

V [航空、航天];

学科分类号：

08 ; 0825 ;

摘要：

Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via ''checkpointing'' techniques that allow applications to ''roll back'' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.

引用

页码：525 / 540

页数：16

共 50 条

[1] Efficient recovery from communication errors in distributed shared memory systems
Lin, JW
Kuo, SY
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 1998, E81D (11): : 1213 - 1223
[2] Efficient recovery from communication errors in distributed shared memory systems
Natl Taiwan Univ, Taipei, Taiwan
IEICE Trans Inf Syst, 11 (1213-1223):
[3] Analysis of failure recovery schemes for distributed shared-memory systems
Kim, JH
Vaidya, NH
IEE PROCEEDINGS-COMPUTERS AND DIGITAL TECHNIQUES, 1999, 146 (03): : 125 - 130
[4] ENSURING CORRECT ROLLBACK RECOVERY IN DISTRIBUTED SHARED-MEMORY SYSTEMS
JANSSENS, B
FUCHS, WK
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1995, 29 (02) : 211 - 218
[5] Lazy garbage collection of recovery state for fault-tolerant distributed shared memory
Sultan, F
Nguyen, TD
Iftode, L
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2002, 13 (07) : 673 - 686
[6] A low overhead logging scheme for fast recovery in distributed shared memory systems
Park, T
Yeom, HY
JOURNAL OF SUPERCOMPUTING, 2000, 15 (03): : 295 - 320
[7] A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems
Taesoon Park
Heon Y. Yeom
The Journal of Supercomputing, 2000, 15 : 295 - 320
[8] Distributed shared memory: Concepts and systems
Protic, J
Tomasevic, M
Milutinovic, V
IEEE PARALLEL & DISTRIBUTED TECHNOLOGY, 1996, 4 (02): : 63 - 79
[9] Fault-tolerance using Cache-coherent distributed shared memory systems
Hecht, DL
Kavi, KM
Gaede, RK
Katsinis, C
FOURTH INTERNATIONAL SYMPOSIUM ON PARALLEL ARCHITECTURES, ALGORITHMS, AND NETWORKS (I-SPAN'99), PROCEEDINGS, 1999, : 100 - 105
[10] Fault-tolerance using cache-coherent distributed shared memory systems
Univ of Alabama in Huntsville, Huntsville, United States
Int Symp Parall Archit Algorithms Networks I SPAN, (100-105):

← 1 2 3 4 5 →