Reliability-aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

被引:0
|
作者
Naksinehaboon, Nichamon [1 ]
Liu, Yudan [1 ]
Leangsuksun, Chokchai [1 ]
Nassar, Raja [1 ]
Paun, Mihaela [1 ]
Scott, Stephen L. [2 ]
机构
[1] Louisiana Tech Univ, Coll Eng & Sci, Ruston, LA 71270 USA
[2] Oak Ridge Natl Lab, Comp Sci & Mathmat Div, Oak Ridge, TN 37831 USA
来源
CCGRID 2008: EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, VOLS 1 AND 2, PROCEEDINGS | 2008年
关键词
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the HPC community. In this paper, we built a model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints. Moreover, a method to find the number of those incremental checkpoints is given. Furthermore, most of the comparison results between the incremental checkpoint model and the full checkpoint model [19] on the same failure data set show that the total waste time in the incremental checkpoint model is significantly smaller than the waste time in the full checkpoint model.
引用
收藏
页码:783 / +
页数:2
相关论文
共 46 条
  • [1] A Reliability-aware Approach for an Optimal Checkpoint/Restart Model in HPC Environments
    Liu, Yudan
    Nassar, Raja
    Leangsuksun, Chockchai
    Naksinehaboon, Nichamon
    Paun, Mihaela
    Scott, Stephen
    2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 452 - +
  • [2] Reliability-aware checkpoint/restart scheme: A performability trade-off
    Liu, Yudan
    Leangsuksun, Chokchai Box
    Song, Hertong
    Scott, Stephen L.
    2005 IEEE International Conference on Cluster Computing (CLUSTER), 2006, : 245 - 252
  • [3] Reliability-Aware Resource Allocation in HPC Systems
    Gottumukkala, Narasimha Raju
    Leangsuksun, Chokchai Box
    Taerat, Narate
    Nassar, Raja
    Scott, Stephen L.
    2007 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING, 2007, : 312 - +
  • [4] Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart
    Zheng, Ziming
    Yu, Li
    Lan, Zhiling
    IEEE TRANSACTIONS ON COMPUTERS, 2015, 64 (05) : 1402 - 1415
  • [5] Reliability-aware resource management for computational grid/cluster environments
    Limaye, K
    Leangsuksun, B
    Liu, YD
    Greenwood, Z
    Scott, SL
    Libby, R
    Chanchio, K
    2005 6TH INTERNATIONAL WORKSHOP ON GRID COMPUTING (GRID), 2005, : 211 - 218
  • [6] Reliability-aware automatic composition approach for web services
    Li Mu
    Li Bo
    Huai JinPeng
    SCIENCE CHINA-INFORMATION SCIENCES, 2012, 55 (04) : 921 - 937
  • [7] Reliability-aware automatic composition approach for web services
    LI Mu 1
    2 School of Computer Science and Engineering
    Science China(Information Sciences), 2012, 55 (04) : 921 - 937
  • [8] Reliability-aware automatic composition approach for web services
    Mu Li
    Bo Li
    JinPeng Huai
    Science China Information Sciences, 2012, 55 : 921 - 937
  • [9] RuleDRL: Reliability-Aware SFC Provisioning With Bounded Approximations in Dynamic Environments
    Zeng, Yue
    Qu, Zhihao
    Guo, Song
    Tang, Bin
    Ye, Baoliu
    Li, Jing
    Zhang, Jie
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (05) : 3651 - 3664
  • [10] RASA: Reliability-Aware Scheduling Approach for FPGA-Based Resilient Embedded Systems in Extreme Environments
    Saha, Sangeet
    Zhai, Xiaojun
    Ehsan, Shoaib
    Majeed, Shakaiba
    McDonald-Maier, Klaus
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2022, 52 (06): : 3885 - 3899