To Checkpoint or Not to Checkpoint: Understanding Energy-Performance-I/O Tradeoffs in HPC Checkpointing

被引:0
|
作者
El-Sayed, Nosayba [1 ]
Schroeder, Bianca [1 ]
机构
[1] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
来源
2014 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER) | 2014年
基金
加拿大自然科学与工程研究理事会;
关键词
High-performance computing; Fault tolerance; Checkpoint/Restart; Energy-efficiency; Performance; INTERVAL;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
As the scale of high-performance computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as two serious design concerns that are expected to become more challenging in future Exascale systems. Therefore, efficiently running systems at such large scales requires an in-depth understanding of the performance and energy costs associated with different fault tolerance techniques. The most commonly used fault tolerance method is checkpoint/restart. Over the years, checkpoint scheduling policies have been traditionally optimized and analysed from a performance perspective. Understanding the energy profile of these policies or how to optimize them for energy savings (rather than performance), remain not very well understood. In this paper, we provide an extensive analysis of the energy/performance tradeoffs associated with an array of checkpoint scheduling policies, including policies that we propose, as well as few existing ones in the literature. We estimate the energy overhead for a given checkpointing policy, and provide simple formulas to optimize checkpoint scheduling for energy savings, with or without a bound on runtime. We then evaluate and compare the runtime-optimized and energy-optimized versions of the different methods using trace driven simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high energy savings with a low runtime overhead when using non-constant (adaptive) checkpointing methods that exploit characteristics of HPC failures. We also analyzethe impact of energy-optimized checkpointing on the storage subsystem, identify policies that are more optimal for I/O savings, and study how to optimize for energy with a bound on I/O time.
引用
收藏
页码:93 / 102
页数:10
相关论文
共 50 条
  • [21] Understanding I/O Performance Using I/O Skeletal Applications
    Logan, Jeremy
    Klasky, Scott
    Abbasi, Hasan
    Liu, Qing
    Ostrouchov, George
    Parashar, Manish
    Podhorszki, Norbert
    Tian, Yuan
    Wolf, Matthew
    EURO-PAR 2012 PARALLEL PROCESSING, 2012, 7484 : 77 - 88
  • [22] Understanding Parallel I/O Performance and Tuning
    Byna, Suren
    PROCEEDINGS OF THE FIFTH INTERNATIONAL WORKSHOP ON SYSTEMS AND NETWORK TELEMETRY AND ANALYTICS, SNTA 2022, 2022, : 1 - 2
  • [23] Parallel Performance and I/O Profiling of HPC RNA-Seq Applications
    Cruz, Lucas
    Coelho, Micaella
    Galheigo, Marcelo
    Carneiro, Andre
    Carvalho, Diego
    Gadelha, Luiz
    Boito, Francieli
    Navaux, Philippe
    Osthoff, Carla
    Ocana, Kary
    COMPUTACION Y SISTEMAS, 2022, 26 (04): : 1625 - 1633
  • [24] Hopes and Facts in Evaluating the Performance of HPC-I/O on a Cloud Environment
    Gomez-Sanchez, Pilar
    Mendez, Sandra
    Rexachs, Dolores
    Luque, Emilio
    JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2015, 15 (01): : 23 - 29
  • [25] Performance Evaluation and Modeling of HPC I/O on Non-Volatile Memory
    Liu, Wei
    Wu, Kai
    Liu, Jialin
    Chen, Feng
    Li, Dong
    2017 INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE (NAS), 2017, : 41 - 50
  • [26] Does Varying BeeGFS Configuration Affect the I/O Performance of HPC Workloads?
    Borkar, Arnav
    Tony, Joel
    Vamsi, Hari K. N.
    Barman, Tushar
    Bhisikar, Yash
    Sreenath, T. M.
    Paul, Arnab K.
    2023 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING WORKSHOPS, CLUSTER WORKSHOPS, 2023, : 5 - 7
  • [27] AdapCK: Optimizing I/O for Checkpointing on Large-Scale High Performance Computing Systems
    Jia, Jie
    Liu, Yi
    Liu, Yanke
    Chen, Yifan
    Lin, Fang
    EURO-PAR 2024: PARALLEL PROCESSING, PT III, EURO-PAR 2024, 2024, 14803 : 342 - 355
  • [28] Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System
    Fu, Jing
    Min, Misun
    Latham, Robert
    Carothers, Christopher D.
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 465 - 473
  • [29] Optimization of checkpointing-related I/O for high-performance parallel and distributed computing
    Subramaniyan, Rajagopal
    Grobelny, Eric
    Studham, Scott
    George, Alan D.
    JOURNAL OF SUPERCOMPUTING, 2008, 46 (02): : 150 - 180
  • [30] Optimization of checkpointing-related I/O for high-performance parallel and distributed computing
    Rajagopal Subramaniyan
    Eric Grobelny
    Scott Studham
    Alan D. George
    The Journal of Supercomputing, 2008, 46 : 150 - 180