Using group replication for resilience on exascale systems

被引:4
|
作者
Bougeret, Marin [1 ]
Casanova, Henri [2 ]
Robert, Yves [3 ,6 ]
Vivien, Frederic [4 ,7 ]
Zaidouni, Dounia [5 ,7 ]
机构
[1] LIRMM Montpellier, Montpellier, France
[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA
[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France
[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France
[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France
[6] Univ Tennessee, Knoxville, TN USA
[7] INRIA, Paris, France
关键词
Checkpointing; replication; exascale platforms; resilience;
D O I
10.1177/1094342013505348
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.
引用
收藏
页码:210 / 224
页数:15
相关论文
共 50 条
  • [1] Resilience Challenges for Exascale Systems
    Jouppi, Norman P.
    IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE VLSI SYSTEMS, PROCEEDINGS, 2009, : 379 - 379
  • [2] TOWARD EXASCALE RESILIENCE
    Cappello, Franck
    Geist, Al
    Gropp, Bill
    Kale, Laxmikant
    Kramer, Bill
    Snir, Marc
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04): : 374 - 388
  • [3] Resilience-Aware Resource Management for Exascale Computing Systems
    Dauwe, Daniel
    Pasricha, Sudeep
    Maciejewski, Anthony A.
    Siegel, Howard Jay
    IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, 2018, 3 (04): : 332 - 345
  • [4] Simulating Application Resilience at Exascale
    Riesen, Rolf
    Ferreira, Kurt B.
    Varela, Maria Ruiz
    Taufer, Michela
    Rodrigues, Arun
    EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT II, 2012, 7156 : 221 - 230
  • [5] Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems
    Chung, Jinsuk
    Lee, Ikhwan
    Sullivan, Michael
    Ryoo, Jee Ho
    Kim, Dong Wan
    Yoon, Doe Hyun
    Kaplan, Larry
    Erez, Mattan
    2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2012,
  • [6] Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
    Chung, Jinsuk
    Lee, Ikhwan
    Sullivan, Michael
    Ryoo, Jee Ho
    Kim, Dong Wan
    Yoon, Doe Hyun
    Kaplan, Larry
    Erez, Mattan
    SCIENTIFIC PROGRAMMING, 2013, 21 (3-4) : 197 - 212
  • [7] Toward exascale resilience: 2014 update
    Cappello, Franck
    Geist, Al
    Gropp, William
    Kale, Sanjay
    Kramer, Bill
    Snir, Marc
    Supercomputing Frontiers and Innovations, 2014, 1 (01) : 4 - 27
  • [8] An Analysis of Resilience Techniques for Exascale Computing Platforms
    Dauwe, Daniel
    Pasricha, Sudeep
    Maciejewski, Anthony A.
    Siegel, Howard Jay
    2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 914 - 923
  • [9] Technologies for exascale systems
    Coteus, P. W.
    Knickerbocker, J. U.
    Lam, C. H.
    Vlasov, Y. A.
    IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2011, 55 (05)
  • [10] Embedded Systems and Exascale Computing
    Jensen, David W.
    Rodrigues, Arun F.
    COMPUTING IN SCIENCE & ENGINEERING, 2010, 12 (06) : 20 - 29