Fault recovery for distributed shared memory systems

被引：0

作者：

Dieter, WR

Lumpp, JE

机构：

来源：

1997 IEEE AEROSPACE CONFERENCE PROCEEDINGS, VOL 2 | 1997年

关键词：

D O I：

暂无

中图分类号：

V [航空、航天];

学科分类号：

08 ; 0825 ;

摘要：

Distributed Shared Memory (DSM) offers programmers a shared memory abstraction on top of an underlying network of distributed memory machines. Advances in network technology and price/performance of workstations suggest that DSM will be the dominant paradigm for future high-performance computing. However, as long running DSM applications scale to hundreds or even thousands of machines, the probability of a node or network link failing increases. Fault tolerance is typically achieved via ''checkpointing'' techniques that allow applications to ''roll back'' to a recent checkpoint rather than restarting. High-performance DSM systems using relaxed memory consistency are significantly more difficult to checkpoint than uniprocessor or message passing architectures. This paper describes previous approaches to checkpointing message passing parallel programs along with extensions to DSM systems.

引用

页码：525 / 540

页数：16

共 50 条

[31] Survey of recoverable distributed shared virtual memory systems
INRIA, Campus Universitaire de Beaulieu, 35042 Rennes Cédex, France
不详
IEEE Trans Parallel Distrib Syst, 9 (959-969):
[32] Design issues for distributed shared-memory systems
Lenoski, DE
INTERNATIONAL CONFERENCE ON COMPUTER DESIGN - VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 1996, : 62 - 62
[33] Conservative garbage collection on distributed shared memory systems
Yu, WM
Cox, A
PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, 1996, : 402 - 410
[34] Issues on the architecture and the design of distributed shared memory systems
Tzeng, NF
Wallach, SJ
INTERNATIONAL CONFERENCE ON COMPUTER DESIGN - VLSI IN COMPUTERS AND PROCESSORS, PROCEEDINGS, 1996, : 60 - 61
[35] Memory latency and thread migration challenges for distributed shared memory systems
Kavi, KM
Cohen, WE
PROCEEDINGS OF THE THIRTY-FIRST HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOL VII: SOFTWARE TECHNOLOGY TRACK, 1998, : 772 - 773
[36] Fault Tolerant Scheduling for Parallel Loops on Shared Memory Systems
Wang, Yizhuo
Cammarota, Rosario
Nicolau, Alexandru
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2015, 31 (06) : 1937 - 1959
[37] An efficient logging scheme for recoverable distributed shared memory systems
Park, T
Cho, S
Yeom, HY
PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, 1997, : 305 - 313
[38] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems
Gelado, Isaac
Cabezas, Javier
Navarro, Nacho
Stone, John E.
Patel, Sanjay
Hwu, Wen-mei W.
ASPLOS XV: FIFTEENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2010, : 347 - 358
[39] TRANSACTION SYNCHRONIZATION IN DISTRIBUTED SHARED VIRTUAL MEMORY-SYSTEMS
HSU, MC
TAM, VO
PROCEEDINGS : THE THIRTEENTH ANNUAL INTERNATIONAL COMPUTER SOFTWARE & APPLICATIONS CONFERENCE, 1989, : 166 - 175
[40] Fundamentals for consistent event ordering in distributed shared memory systems
Preissinger, J
Landes, T
PDPTA '05: PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-3, 2005, : 890 - 896

← 1 2 3 4 5 →