Using group replication for resilience on exascale systems

被引:4
|
作者
Bougeret, Marin [1 ]
Casanova, Henri [2 ]
Robert, Yves [3 ,6 ]
Vivien, Frederic [4 ,7 ]
Zaidouni, Dounia [5 ,7 ]
机构
[1] LIRMM Montpellier, Montpellier, France
[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA
[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France
[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France
[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France
[6] Univ Tennessee, Knoxville, TN USA
[7] INRIA, Paris, France
关键词
Checkpointing; replication; exascale platforms; resilience;
D O I
10.1177/1094342013505348
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.
引用
收藏
页码:210 / 224
页数:15
相关论文
共 50 条
  • [41] Managing Computation, Precision, Accuracy and Performance on ExaScale Systems
    2013 21ST IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2013, : 37 - 37
  • [42] Deploying Optimized Scientific and Engineering Applications on Exascale Systems
    Gerber, Richard
    Joo, Balint
    Parker, Scott
    COMPUTING IN SCIENCE & ENGINEERING, 2024, 26 (01) : 41 - 47
  • [43] The ExaNeSt Project: Interconnects, Storage, and Packaging for Exascale Systems
    Katevenis, M.
    Chrysos, N.
    Marazakis, M.
    Mavroidis, I.
    Chaix, F.
    Kallimanis, N.
    Navaridas, J.
    Goodacre, J.
    Vicini, P.
    Biagioni, A.
    Paolucci, P. S.
    Lonardo, A.
    Pastorelli, E.
    Lo Cicero, F.
    Ammendola, R.
    Hopton, P.
    Coates, P.
    Taffoni, G.
    Cozzini, S.
    Kersten, M.
    Zhang, Y.
    Sahuquillo, J.
    Lechago, S.
    Pinto, C.
    Lietzow, B.
    Everett, D.
    Perna, G.
    19TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD 2016), 2016, : 60 - 67
  • [44] Dynamic load balancing in distributed exascale computing systems
    Seyedeh Leili Mirtaheri
    Lucio Grandinetti
    Cluster Computing, 2017, 20 : 3677 - 3689
  • [45] Operational Intelligence for Distributed Computing Systems for Exascale Science
    Di Girolamo, Alessandro
    Legger, Federica
    Paparrigopoulos, Panos
    Klimentov, Alexei
    Schovancova, Jaroslava
    Kuznetsov, Valentin
    Lassnig, Mario
    Clissa, Luca
    Rinaldi, Lorenzo
    Sharma, Mayank
    Bakhshiansohi, Hamed
    Zvada, Marian
    Bonacorsi, Daniele
    Tisbeni, Simone Rossi
    Giommi, Luca
    de Sousa, Leticia Decker
    Diotalevi, Tommaso
    Grigorieva, Maria
    Padolski, Sergey
    24TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP 2019), 2020, 245
  • [46] Checkpointing Exascale Memory Systems with Existing Memory Technologies
    Abeyratne, Nilmini
    Chen, Hsing-Min
    Oh, Byoungchan
    Dreslinski, Ronald
    Chakrabarti, Chaitali
    Mudge, Trevor
    MEMSYS 2016: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON MEMORY SYSTEMS, 2016, : 18 - 29
  • [47] Codesign Challenges for Exascale Systems: Performance, Power, and Reliability
    Kerbyson, Darren J.
    Vishnu, Abhinav
    Barker, Kevin J.
    Hoisie, Adolfy
    COMPUTER, 2011, 44 (11) : 37 - 43
  • [48] Resilience optimization in manufacturing systems using Quantum Annealing
    Schworm, Philipp
    Wu, Xiangqian
    Klar, Matthias
    Gayer, Jannik
    Glatt, Moritz
    Aurich, Jan C.
    MANUFACTURING LETTERS, 2023, 36 : 13 - 17
  • [49] Evaluating System of Systems Resilience using Interdependency Analysis
    Han, Seung Yeob
    Marais, Karen
    DeLaurentis, Daniel
    PROCEEDINGS 2012 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2012, : 1251 - 1256
  • [50] Decision Models and Group Decision Support Systems for Emergency Management and City Resilience
    Chen, Yumei
    Zhao, Xiaoyi
    Rich, Eliot
    Luna-Reyes, Luis Felipe
    INTERNATIONAL JOURNAL OF E-PLANNING RESEARCH, 2018, 7 (02) : 35 - 50