A Comparison of Application-Level Fault Tolerance Schemes for Task Pools

被引:8
|
作者
Posner, Jonas [1 ]
Reitz, Lukas [1 ]
Fohry, Claudia [1 ]
机构
[1] Univ Kassel, Res Grp Programming Languages Methodol, Kassel, Germany
关键词
HPC programming languages; Libraries and tools; DESIGN;
D O I
10.1016/j.future.2019.11.031
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fault tolerance is an important requirement for successful program execution on exascale systems. The common approach, checkpointing, regularly saves a program's state, such that the execution can be restarted after permanent node failures. Checkpointing is often performed on system level, but its deployment on application level can reduce the running time overhead. The drawback of application-level checkpointing is a higher programming expense. It pays off if the checkpointing is applied to reusable patterns. We consider task pools, which exist in many variants. The paper supposes that tasks are generated dynamically and are free of side effects. Further, the final result must be computed from individual task results by reduction. Moreover, the pools must be distributed with private queues, and adopt work stealing. The paper describes and evaluates three application-level fault tolerance schemes for task pools. All use uncoordinated checkpointing and regularly save information in a resilient store. The first scheme (called AllFT) saves descriptors of all open tasks; the second scheme (called IncFT) selectively and incrementally saves only part of them; and the third scheme (called LogFT) logs stealing events and writes checkpoints in parallel to task processing. All schemes have been implemented by extending the Global Load Balancing (GLB) library of the "APGAS for Java" programming system. In experiments with the UTS, NQueens, and BC benchmarks with up to 672 workers, the running time overhead during failure-free execution, compared to a non-resilient version of GLB, was typically below 6%. The recovery cost was negligible, and there was no clear winner among the three schemes. A more detailed performance analysis with synthetic benchmarks revealed that IncFT and LogFT are superior in scenarios with large task descriptors. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页码:119 / 134
页数:16
相关论文
共 50 条
  • [31] Accelerating application-level security protocols
    Burnside, M
    Keromytis, AD
    ICON 2003: 11TH IEEE INTERNATIONAL CONFERENCE ON NETWORKS, 2003, : 313 - 318
  • [32] On application-level load balancing in FastReplica
    Lee, Jangwon
    de Veciana, Gustavo
    COMPUTER COMMUNICATIONS, 2007, 30 (17) : 3218 - 3231
  • [33] Application-Level Energy Awareness for OpenMP
    Alessi, Ferdinando
    Thoman, Peter
    Georgakoudis, Giorgis
    Fahringer, Thomas
    Nikolopoulos, Dimitrios S.
    OPENMP: HETEROGENOUS EXECUTION AND DATA MOVEMENTS, IWOMP 2015, 2015, 9342 : 219 - 232
  • [34] Application-level IP measurements for multimedia
    Räisänen, VI
    Rosti, J
    IEEE 2000 EIGHTH INTERNATIONAL WORKSHOP ON QUALITY OF SERVICE, 2000, : 170 - 172
  • [35] Application-level measurements of performance on the vBNS
    Clark, M
    Jeffay, K
    IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS VOL 2, 1999, : 362 - 366
  • [36] Application-Level Isolation and Recovery with Solitude
    Jain, Shvetank
    Shafique, Fareha
    Djeric, Vladan
    Goel, Ashvin
    EUROSYS'08: PROCEEDINGS OF THE EUROSYS 2008 CONFERENCE, 2008, : 95 - 107
  • [37] Voice over application-level multicast
    Blundell, Nick
    Egi, Norbert
    Mathy, Laurent
    2006 IEEE INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE, VOLS 1 AND 2, 2006, : 667 - +
  • [38] Application-level prediction of battery dissipation
    Krintz, C
    Wen, Y
    Wolski, R
    ISLPED '04: PROCEEDINGS OF THE 2004 INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN, 2004, : 224 - 229
  • [39] Application-level survivability: Resumable FTP
    Grzywa, M
    Yurcik, W
    Brumbaugh, L
    2001 MILCOM, VOLS 1 AND 2, PROCEEDINGS: COMMUNICATIONS FOR NETWORK-CENTRIC OPERATIONS: CREATING THE INFORMATION FORCE, 2001, : 107 - 112
  • [40] Application-level measurements of performance on the vBNS
    Clark, Michele
    Jeffay, Kevin
    International Conference on Multimedia Computing and Systems -Proceedings, 1999, 2 : 362 - 366