Reliability-aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

被引：0

作者：

Naksinehaboon, Nichamon ^{[1
]}

Liu, Yudan ^{[1
]}

Leangsuksun, Chokchai ^{[1
]}

Nassar, Raja ^{[1
]}

Paun, Mihaela ^{[1
]}

Scott, Stephen L. ^{[2
]}

机构：

[1] Louisiana Tech Univ, Coll Eng & Sci, Ruston, LA 71270 USA

[2] Oak Ridge Natl Lab, Comp Sci & Mathmat Div, Oak Ridge, TN 37831 USA

来源：

CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS | 2008年

关键词：

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model [19] on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.

引用

页码：783 / +

页数：2

共 46 条

[1] A Reliability-aware Approach for an Optimal Checkpoint/Restart Model in HPC Environments
Liu, Yudan
Nassar, Raja
Leangsuksun, Chockchai
Naksinehaboon, Nichamon
Paun, Mihaela
Scott, Stephen
2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 452 - +
[2] Reliability-aware checkpoint/restart scheme: A performability trade-off
Liu, Yudan
Leangsuksun, Chokchai Box
Song, Hertong
Scott, Stephen L.
2005 IEEE International Conference on Cluster Computing (CLUSTER), 2006, : 245 - 252
[3] Reliability-Aware Resource Allocation in HPC Systems
Gottumukkala, Narasimha Raju
Leangsuksun, Chokchai Box
Taerat, Narate
Nassar, Raja
Scott, Stephen L.
2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 312 - +
[4] Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart
Zheng, Ziming
Yu, Li
Lan, Zhiling
IEEE TRANSACTIONS ON COMPUTERS, 2015, 64 (05) : 1402 - 1415
[5] Reliability-aware resource management for computational grid/cluster environments
Limaye, K
Leangsuksun, B
Liu, YD
Greenwood, Z
Scott, SL
Libby, R
Chanchio, K
2005 6TH INTERNATIONAL WORKSHOP ON GRID COMPUTING (GRID), 2005, : 211 - 218
[6] Reliability-aware automatic composition approach for web services
Li Mu
Li Bo
Huai JinPeng
SCIENCE CHINA-INFORMATION SCIENCES, 2012, 55 (04) : 921 - 937
[7] Reliability-aware automatic composition approach for web services
LI Mu 1
2 School of Computer Science and Engineering
Science China(Information Sciences), 2012, 55 (04) : 921 - 937
[8] Reliability-aware automatic composition approach for web services
Mu Li
Bo Li
JinPeng Huai
Science China Information Sciences, 2012, 55 : 921 - 937
[9] RuleDRL: Reliability-Aware SFC Provisioning With Bounded Approximations in Dynamic Environments
Zeng, Yue
Qu, Zhihao
Guo, Song
Tang, Bin
Ye, Baoliu
Li, Jing
Zhang, Jie
IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (05) : 3651 - 3664
[10] RASA: Reliability-Aware Scheduling Approach for FPGA-Based Resilient Embedded Systems in Extreme Environments
Saha, Sangeet
Zhai, Xiaojun
Ehsan, Shoaib
Majeed, Shakaiba
McDonald-Maier, Klaus
IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2022, 52 (06): : 3885 - 3899

← 1 2 3 4 5 →