To Checkpoint or Not to Checkpoint: Understanding Energy-Performance-I/O Tradeoffs in HPC Checkpointing

被引:0
|
作者
El-Sayed, Nosayba [1 ]
Schroeder, Bianca [1 ]
机构
[1] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
High-performance computing; Fault tolerance; Checkpoint/Restart; Energy-efficiency; Performance; INTERVAL;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyzethe impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.
引用
收藏
页码:93 / 102
页数:10
相关论文
共 50 条
  • [1] Understanding Practical Tradeoffs in HPC Checkpoint-Scheduling Policies
    El-Sayed, Nosayba
    Schroeder, Bianca
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2018, 15 (02) : 336 - 350
  • [2] POSTER: Energy-Performance Tradeoffs in Multilevel Checkpoint Strategies
    Gomez, Leonardo A. Bautista
    Balaprakash, Prasanna
    Bouguerra, Mohamed-Slim
    Wild, Stefan M.
    Cappello, Franck
    Hovland, Paul D.
    2014 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2014, : 278 - 279
  • [3] Prediction of Energy Consumption by Checkpoint/Restart in HPC
    Moran, M.
    Balladini, I
    Rexachs, D.
    Luque, E.
    IEEE ACCESS, 2019, 7 : 71791 - 71803
  • [4] Analysis of Checkpoint I/O Behavior
    Leon, Betzabeth
    Gomez-Sanchez, Pilar
    Franco, Daniel
    Rexachs, Dolores
    Luque, Emilio
    COMPUTATIONAL SCIENCE - ICCS 2020, PT I, 2020, 12137 : 191 - 205
  • [5] A Checkpoint of Research on Parallel I/O for High-Performance Computing
    Boito, Francieli Zanon
    Inacio, Eduardo C.
    Bez, Jean Luca
    Navaux, Philippe O. A.
    Dantas, Mario A. R.
    Denneulin, Yves
    ACM COMPUTING SURVEYS, 2018, 51 (02)
  • [6] Modeling and Analysis of Checkpoint I/O Operations
    Arunagiri, Sarala
    Daly, John T.
    Teller, Patricia J.
    ANALYTICAL AND STOCHASTIC MODELING TECHNIQUES AND APPLICATIONS, PROCEEDINGS, 2009, 5513 : 386 - +
  • [7] Energy-Performance Tradeoffs for HPC Applications on Low Power Processors
    Calore, Enrico
    Schifano, Sebastiano Fabio
    Tripiccione, Raffaele
    EURO-PAR 2015: PARALLEL PROCESSING WORKSHOPS, 2015, 9523 : 737 - 748
  • [8] Toward Understanding I/O Behavior in HPC Workflows
    Luettgau, Jakob
    Snyder, Shane
    Carns, Philip
    Wozniak, Justin M.
    Kunkel, Julian
    Ludwig, Thomas
    PROCEEDINGS OF 2018 IEEE/ACM 3RD JOINT INTERNATIONAL WORKSHOP ON PARALLEL DATA STORAGE & DATA INTENSIVE SCALABLE COMPUTING SYSTEMS (PDSW-DISCS), 2018, : 64 - 75
  • [9] A model of checkpoint behavior for applications that have I/O
    Leon, Betzabeth
    Mendez, Sandra
    Franco, Daniel
    Rexachs, Dolores
    Luque, Emilio
    JOURNAL OF SUPERCOMPUTING, 2022, 78 (13): : 15404 - 15436
  • [10] Energy-aware I/O Optimization for Checkpoint and Restart on a NAND Flash Memory System
    Saito, Takafumi
    Sato, Kento
    Sato, Hitoshi
    Matsuoka, Satoshi
    FTXS'13: PROCEEDINGS OF THE 3RD ACM WORKSHOP ON FAULT-TOLERANCE FOR HPC AT EXTREME SCALE, 2013, : 41 - 47