Using group replication for resilience on exascale systems

被引：4

作者：

Bougeret, Marin ^{[1
]}

Casanova, Henri ^{[2
]}

Robert, Yves ^{[3
,6
]}

Vivien, Frederic ^{[4
,7
]}

Zaidouni, Dounia ^{[5
,7
]}

机构：

[1] LIRMM Montpellier, Montpellier, France

[2] Univ Hawaii Manoa, Informat & Comp Sci Dept, Honolulu, HI 96822 USA

[3] Ecole Normale Super Lyon, Comp Sci Lab LIP, F-69364 Lyon 07, France

[4] Ecole Normale Super Lyon, INRIA, F-69364 Lyon 07, France

[5] Ecole Normale Super Lyon, Dept Comp Sci, F-69364 Lyon 07, France

[6] Univ Tennessee, Knoxville, TN USA

[7] INRIA, Paris, France

来源：

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS | 2014年 / 28卷 / 02期

关键词：

Checkpointing; replication; exascale platforms; resilience;

D O I：

10.1177/1094342013505348

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

High performance computing applications must be resilient to faults. The traditional fault-tolerance solution is checkpoint-recovery, by which application state is saved to and recovered from secondary storage throughout execution. It has been shown that, even when using an optimal checkpointing strategy, the checkpointing overhead precludes high parallel efficiency at large scale. Additional fault-tolerance mechanisms must thus be used. Such a mechanism is replication, that is, multiple processors performing the same computation so that a processor failure does not necessarily imply an application failure. In spite of resource waste, replication can lead to higher parallel efficiency when compared to using only checkpoint-recovery at large scale. We propose to execute and checkpoint multiple application instances concurrently, an approach we term group replication. For exponential failures we give an upper bound on the expected application execution time. This bound corresponds to a particular checkpointing period that we derive. For general failures, we propose a dynamic programming algorithm to determine non-periodic checkpoint dates as well as an empirical periodic checkpointing solution whose period is found via a numerical search. Using simulation we evaluate our proposed approaches, including comparison to the non-replication case, for both exponential and Weibull failure distributions. Our broad finding is that group replication is useful in a range of realistic application and checkpointing overhead scenarios for future exascale platforms.

引用

页码：210 / 224

页数：15

共 50 条

[31] Perspectives on Anomaly and Event Detection in Exascale Systems
Iuhasz, Gabriel
Petcu, Dana
2019 IEEE 5TH INTL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY) / IEEE INTL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC) / IEEE INTL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2019, : 225 - 229
[32] Updating the Energy Model for Future Exascale Systems
Kogge, Peter M.
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2015, 2015, 9137 : 323 - 339
[33] Energy Efficient Runtime Framework for Exascale Systems
Mhedheb, Yousri
Streit, Achim
HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2016 INTERNATIONAL WORKSHOPS, 2016, 9945 : 32 - 44
[34] HIPLZ: Enabling performance portability for exascale systems
Zhao, Jisheng
Bertoni, Colleen
Young, Jeffrey
Harms, Kevin
Sarkar, Vivek
Videau, Brice
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (25):
[35] Dynamic load balancing in distributed exascale computing systems
Mirtaheri, Seyedeh Leili
Grandinetti, Lucio
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2017, 20 (04): : 3677 - 3689
[36] Networking and communication challenges for post-exascale systems
Dhabaleswar Panda
Xiao-Yi Lu
Hari Subramoni
Frontiers of Information Technology & Electronic Engineering, 2018, 19 : 1230 - 1235
[37] Networking and communication challenges for post-exascale systems
Panda, Dhabaleswar
Lu, Xiao-Yi
Subramoni, Hari
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2018, 19 (10) : 1230 - 1235
[38] Revisiting Co-Scheduling for Upcoming ExaScale Systems
Lankes, Stefan
PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2015), 2015, : 669 - 670
[39] Rethinking Hardware-Software Codesign for Exascale Systems
Shalf, John
Quinlan, Dan
Janssen, Curtis
COMPUTER, 2011, 44 (11) : 22 - 30
[40] On the Use of Commodity Ethernet Technology in Exascale HPC Systems
Benito, Mariano
Vallejo, Enrique
Beivide, Ramon
2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, : 254 - 263

← 1 2 3 4 5 →