Using group replication for resilience on exascale systems

被引:4
|
作者
Bougeret, Marin [1 ]
Casanova, Henri [2 ]
Robert, Yves [3 ,6 ]
Vivien, Frederic [4 ,7 ]
Zaidouni, Dounia [5 ,7 ]
机构
[1] LIRMM Montpellier, Montpellier, France
[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA
[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France
[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France
[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France
[6] Univ Tennessee, Knoxville, TN USA
[7] INRIA, Paris, France
关键词
Checkpointing; replication; exascale platforms; resilience;
D O I
10.1177/1094342013505348
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.
引用
收藏
页码:210 / 224
页数:15
相关论文
共 50 条
  • [31] Perspectives on Anomaly and Event Detection in Exascale Systems
    Iuhasz, Gabriel
    Petcu, Dana
    2019 IEEE 5TH INTL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY) / IEEE INTL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC) / IEEE INTL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2019, : 225 - 229
  • [32] Updating the Energy Model for Future Exascale Systems
    Kogge, Peter M.
    HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2015, 2015, 9137 : 323 - 339
  • [33] Energy Efficient Runtime Framework for Exascale Systems
    Mhedheb, Yousri
    Streit, Achim
    HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2016 INTERNATIONAL WORKSHOPS, 2016, 9945 : 32 - 44
  • [34] HIPLZ: Enabling performance portability for exascale systems
    Zhao, Jisheng
    Bertoni, Colleen
    Young, Jeffrey
    Harms, Kevin
    Sarkar, Vivek
    Videau, Brice
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (25):
  • [35] Dynamic load balancing in distributed exascale computing systems
    Mirtaheri, Seyedeh Leili
    Grandinetti, Lucio
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (04): : 3677 - 3689
  • [36] Networking and communication challenges for post-exascale systems
    Dhabaleswar Panda
    Xiao-Yi Lu
    Hari Subramoni
    Frontiers of Information Technology & Electronic Engineering, 2018, 19 : 1230 - 1235
  • [37] Networking and communication challenges for post-exascale systems
    Panda, Dhabaleswar
    Lu, Xiao-Yi
    Subramoni, Hari
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (10) : 1230 - 1235
  • [38] Revisiting Co-Scheduling for Upcoming ExaScale Systems
    Lankes, Stefan
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2015), 2015, : 669 - 670
  • [39] Rethinking Hardware-Software Codesign for Exascale Systems
    Shalf, John
    Quinlan, Dan
    Janssen, Curtis
    COMPUTER, 2011, 44 (11) : 22 - 30
  • [40] On the Use of Commodity Ethernet Technology in Exascale HPC Systems
    Benito, Mariano
    Vallejo, Enrique
    Beivide, Ramon
    2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, : 254 - 263