Using group replication for resilience on exascale systems

被引：4

作者：

Bougeret, Marin ^{[1
]}

Casanova, Henri ^{[2
]}

Robert, Yves ^{[3
,6
]}

Vivien, Frederic ^{[4
,7
]}

Zaidouni, Dounia ^{[5
,7
]}

机构：

[1] LIRMM Montpellier, Montpellier, France

[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA

[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France

[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France

[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France

[6] Univ Tennessee, Knoxville, TN USA

[7] INRIA, Paris, France

来源：

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS | 2014年 / 28卷 / 02期

关键词：

Checkpointing; replication; exascale platforms; resilience;

D O I：

10.1177/1094342013505348

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.

引用

页码：210 / 224

页数：15

共 50 条

[1] Resilience Challenges for Exascale Systems
Jouppi, Norman P.
IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE VLSI SYSTEMS, PROCEEDINGS, 2009, : 379 - 379
[2] TOWARD EXASCALE RESILIENCE
Cappello, Franck
Geist, Al
Gropp, Bill
Kale, Laxmikant
Kramer, Bill
Snir, Marc
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2009, 23 (04): : 374 - 388
[3] Resilience-Aware Resource Management for Exascale Computing Systems
Dauwe, Daniel
Pasricha, Sudeep
Maciejewski, Anthony A.
Siegel, Howard Jay
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, 2018, 3 (04): : 332 - 345
[4] Simulating Application Resilience at Exascale
Riesen, Rolf
Ferreira, Kurt B.
Varela, Maria Ruiz
Taufer, Michela
Rodrigues, Arun
EURO-PAR 2011: PARALLEL PROCESSING WORKSHOPS, PT II, 2012, 7156 : 221 - 230
[5] Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems
Chung, Jinsuk
Lee, Ikhwan
Sullivan, Michael
Ryoo, Jee Ho
Kim, Dong Wan
Yoon, Doe Hyun
Kaplan, Larry
Erez, Mattan
2012 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2012,
[6] Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems
Chung, Jinsuk
Lee, Ikhwan
Sullivan, Michael
Ryoo, Jee Ho
Kim, Dong Wan
Yoon, Doe Hyun
Kaplan, Larry
Erez, Mattan
SCIENTIFIC PROGRAMMING, 2013, 21 (3-4) : 197 - 212
[7] Toward exascale resilience: 2014 update
Cappello, Franck
Geist, Al
Gropp, William
Kale, Sanjay
Kramer, Bill
Snir, Marc
Supercomputing Frontiers and Innovations, 2014, 1 (01) : 4 - 27
[8] An Analysis of Resilience Techniques for Exascale Computing Platforms
Dauwe, Daniel
Pasricha, Sudeep
Maciejewski, Anthony A.
Siegel, Howard Jay
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 914 - 923
[9] Technologies for exascale systems
Coteus, P. W.
Knickerbocker, J. U.
Lam, C. H.
Vlasov, Y. A.
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2011, 55 (05)
[10] Embedded Systems and Exascale Computing
Jensen, David W.
Rodrigues, Arun F.
COMPUTING IN SCIENCE & ENGINEERING, 2010, 12 (06) : 20 - 29

← 1 2 3 4 5 →