A Comparison of Application-Level Fault Tolerance Schemes for Task Pools

被引：8

作者：

Posner, Jonas ^{[1
]}

Reitz, Lukas ^{[1
]}

Fohry, Claudia ^{[1
]}

机构：

[1] Univ Kassel, Res Grp Programming Languages Methodol, Kassel, Germany

来源：

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2020年 / 105卷

关键词：

HPC programming languages; Libraries and tools; DESIGN;

D O I：

10.1016/j.future.2019.11.031

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Fault tolerance is an important requirement for successful program execution on exascale systems. The common approach, checkpointing, regularly saves a program's state, such that the execution can be restarted after permanent node failures. Checkpointing is often performed on system level, but its deployment on application level can reduce the running time overhead. The drawback of application-level checkpointing is a higher programming expense. It pays off if the checkpointing is applied to reusable patterns. We consider task pools, which exist in many variants. The paper supposes that tasks are generated dynamically and are free of side effects. Further, the final result must be computed from individual task results by reduction. Moreover, the pools must be distributed with private queues, and adopt work stealing. The paper describes and evaluates three application-level fault tolerance schemes for task pools. All use uncoordinated checkpointing and regularly save information in a resilient store. The first scheme (called AllFT) saves descriptors of all open tasks; the second scheme (called IncFT) selectively and incrementally saves only part of them; and the third scheme (called LogFT) logs stealing events and writes checkpoints in parallel to task processing. All schemes have been implemented by extending the Global Load Balancing (GLB) library of the "APGAS for Java" programming system. In experiments with the UTS, NQueens, and BC benchmarks with up to 672 workers, the running time overhead during failure-free execution, compared to a non-resilient version of GLB, was typically below 6%. The recovery cost was negligible, and there was no clear winner among the three schemes. A more detailed performance analysis with synthetic benchmarks revealed that IncFT and LogFT are superior in scenarios with large task descriptors. (C) 2019 Elsevier B.V. All rights reserved.

引用

页码：119 / 134

页数：16

共 50 条

[21] Relyzer: Exploiting Application-Level Fault Equivalence to Analyze Application Resiliency to Transient Faults
Hari, Siva Kumar Sastry
Adve, Santa V.
Naeimi, Helia
Ramachandran, Pradeep
ASPLOS XVII: SEVENTEENTH INTERNATIONAL CONFERENCE ON ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, 2012, : 123 - 134
[22] Relyzer: Exploiting Application-Level Fault Equivalence to Analyze Application Resiliency to Transient Faults
Hari, Siva Kumar Sastry
Adve, Sarita V.
Naeimi, Helia
Ramachandran, Pradeep
ACM SIGPLAN NOTICES, 2012, 47 (04) : 123 - 134
[23] Application-level data caching
Boal, PE
DR DOBBS JOURNAL, 2003, 28 (12): : 30 - +
[24] Application-level concurrency management
Ogel, F
Thomas, G
Folliot, B
Piumarta, I
Concurrent Information Processing and Computing, 2005, 195 : 19 - 30
[25] RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading
Hukerikar, Saurabh
Teranishi, Keita
Diniz, Pedro C.
Lucas, Robert F.
INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2018, 46 (02) : 225 - 251
[26] RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading
Saurabh Hukerikar
Keita Teranishi
Pedro C. Diniz
Robert F. Lucas
International Journal of Parallel Programming, 2018, 46 : 225 - 251
[27] Optimal Placement of Application-Level Checkpoints
Wang, Panfeng
Wang, Zhiyuan
Du, Yunfei
Yang, Xuejun
Zhou, Haifang
HPCC 2008: 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, PROCEEDINGS, 2008, : 853 - 858
[28] Application level fault tolerance in heterogeneous networks of workstations
Beguelin, A
Seligman, E
Stephan, P
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 43 (02) : 147 - 155
[29] Application Level Fault Tolerance in Heterogeneous Networks of Workstations
J Parallel Distrib Comput, 2 (147):
[30] Application-Level Scheduling with Deadline Constraints
Wu, Huasen
Lin, Xiaojun
Liu, Xin
Zhang, Youguang
2014 PROCEEDINGS IEEE INFOCOM, 2014, : 2436 - 2444

← 1 2 3 4 5 →