Using group replication for resilience on exascale systems

被引:4
|
作者
Bougeret, Marin [1 ]
Casanova, Henri [2 ]
Robert, Yves [3 ,6 ]
Vivien, Frederic [4 ,7 ]
Zaidouni, Dounia [5 ,7 ]
机构
[1] LIRMM Montpellier, Montpellier, France
[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA
[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France
[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France
[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France
[6] Univ Tennessee, Knoxville, TN USA
[7] INRIA, Paris, France
关键词
Checkpointing; replication; exascale platforms; resilience;
D O I
10.1177/1094342013505348
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.
引用
收藏
页码:210 / 224
页数:15
相关论文
共 50 条
  • [21] Exascale Storage Systems the SIRIUS Way
    Klasky, S. A.
    Abbasi, H.
    Ainsworth, M.
    Choi, J.
    Curry, M.
    Kurc, T.
    Liu, Q.
    Lofstead, J.
    Maltzahn, C.
    Parashar, M.
    Podhorszki, N.
    Suchyta, E.
    Wang, F.
    Wolf, M.
    Chang, C. S.
    Churchill, M.
    Ethier, S.
    XXVII IUPAP CONFERENCE ON COMPUTATIONAL PHYSICS (CCP2015), 2016, 759
  • [22] Improving Network Services' Resilience using Independent Configuration Replication
    Lopes, Miguel
    Costa, Antonio
    Dias, Bruno
    2013 IFIP/IEEE INTERNATIONAL SYMPOSIUM ON INTEGRATED NETWORK MANAGEMENT (IM 2013), 2013, : 1389 - 1392
  • [23] Matrices Over Runtime Systems @ Exascale
    Agullo, Emmanuel
    Bosilca, George
    Bramas, Berenger
    Castagnede, Cedric
    Coulaud, Olivier
    Darve, Eric
    Dongarra, Jack
    Faverge, Mathieu
    Furmento, Nathalie
    Giraud, Luc
    Lacoste, Xavier
    Langou, Julien
    Ltaief, Hatem
    Messner, Matthias
    Namyst, Raymond
    Ramet, Pierre
    Takahashi, Toru
    Thibault, Samuel
    Tomov, Stanimire
    Yamazaki, Ichitaro
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1330 - +
  • [24] Using Adaptive Message Replication on Improving Control Resilience of SDN
    Tsai, Pang-Wei
    Fong, Wai-Hong
    Chang, Wu-Hsien
    Yang, Chu-Sing
    JOURNAL OF INTERNET TECHNOLOGY, 2018, 19 (07): : 2162 - 2174
  • [25] A Case for Criticality Models in Exascale Systems
    Kocoloski, Brian
    Piga, Leonardo
    Huang, Wei
    Paul, Indrani
    Lange, John
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 213 - 216
  • [26] Group Decision Support Systems for Emergency Management and Resilience: CoastalProtectSIM
    Zhao, Xiaoyi
    Chen, Yumei
    Ku, Mingyoung
    Rich, Eliot
    Deegan, Michael
    Luna-Reyes, Luis F.
    PROCEEDINGS OF THE 50TH ANNUAL HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, 2017, : 2489 - 2497
  • [27] Exascale Computational Fluid Dynamics in Heterogeneous Systems
    Trebotich, David
    JOURNAL OF FLUIDS ENGINEERING-TRANSACTIONS OF THE ASME, 2024, 146 (04):
  • [28] Exploring Reliability of Exascale Systems through Simulations
    Zhao, Dongfang
    Zhang, Da
    Wang, Ke
    Raicu, Ioan
    HIGH PERFORMANCE COMPUTING SYMPOSIUM 2013 (HPC 2013) - 2013 SPRING SIMULATION MULTI-CONFERENCE (SPRINGSIM'13), 2013, 45 (06): : 1 - 9
  • [29] The Role of Photonics in Future Exascale Data Systems
    Ben Yoo, S. J.
    2016 21ST OPTOELECTRONICS AND COMMUNICATIONS CONFERENCE (OECC) HELD JOINTLY WITH 2016 INTERNATIONAL CONFERENCE ON PHOTONICS IN SWITCHING (PS), 2016,
  • [30] FusedOS: A Hybrid Approach to Exascale Operating Systems
    Park, Yoonho
    Van Hensbergen, Eric
    Hillenbrand, Marius
    Inglett, Todd
    Rosenburg, Bryan
    Ryu, Kyung Dong
    Wisniewski, Robert
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1416 - 1416