A different re-execution speed can help

被引:3
|
作者
Benoit, Anne [1 ,2 ]
Cavelan, Aurelien [1 ,2 ]
Le Fevre, Valentin [1 ,2 ]
Robert, Yves [1 ,2 ,3 ]
Sun, Hongyang [1 ,2 ]
机构
[1] Ecole Normale Super Lyon, Lyon, France
[2] Inria, Rennes, France
[3] Univ Tennessee, Knoxville, TN 37996 USA
关键词
resilience; silent errors; speeds; re-execution; checkpointing; verification;
D O I
10.1109/ICPPW.2016.45
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We consider divisible load scientific applications executing on large-scale platforms subject to silent errors. While the goal is usually to complete the execution as fast as possible in expectation, another major concern is energy consumption. The use of dynamic voltage and frequency scaling (DVFS) can help save energy, but at the price of performance degradation. Consider the execution model where a set of K different speeds is given, and whenever a failure occurs, a different re-execution speed may be used. Can this help? We address the following bi-criteria problem: how to compute the optimal checkpointing period to minimize energy consumption while bounding the degradation in performance. We solve this bi-criteria problem by providing a closed-form solution for the checkpointing period, and demonstrate via a comprehensive set of simulations that a different re-execution speed can indeed help.
引用
收藏
页码:250 / 257
页数:8
相关论文
共 50 条
  • [1] INCREMENTAL RE-EXECUTION OF PROGRAMS
    KARINTHI, RR
    WEISER, M
    SIGPLAN NOTICES, 1987, 22 (07): : 38 - 44
  • [2] Scalable selective re-execution for EDGE architectures
    Desikan, R
    Sethumadhavan, S
    Burger, D
    Keckler, SW
    ACM SIGPLAN NOTICES, 2004, 39 (11) : 120 - 132
  • [3] Morty: Scaling Concurrency Control with Re-Execution
    Burke, Matthew
    Suri-Payer, Florian
    Helt, Jeffrey
    Alvisi, Lorenzo
    Crooks, Natacha
    PROCEEDINGS OF THE EIGHTEENTH EUROPEAN CONFERENCE ON COMPUTER SYSTEMS, EUROSYS 2023, 2023, : 687 - 702
  • [4] Fault tolerance through re-execution in multiscalar architecture
    Rashid, F
    Saluja, KK
    Ramanathan, P
    DSN 2000: INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2000, : 482 - 491
  • [5] Analysis of Local Re-execution in Mobile Offloading System
    Wang, Qiushi
    Jorba, Marti Griera
    Ripoll, Joan Martinez
    Wolter, Katinka
    2013 IEEE 24TH INTERNATIONAL SYMPOSIUM ON SOFTWARE RELIABILITY ENGINEERING (ISSRE), 2013, : 31 - 40
  • [6] Backtracking and re-execution in the automatic debugging of parallelized programs
    Matthews, G
    Hood, R
    Johnson, S
    Leggett, P
    11TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2002, : 150 - 160
  • [7] Impacts of Task Re-Execution Policy on MapReduce Jobs
    Lin, Jia-Chun
    Leu, Fang-Yie
    Chen, Ying-ping
    COMPUTER JOURNAL, 2016, 59 (05): : 701 - 714
  • [8] REPAIR: Hard-Error Recovery via Re-Execution
    Soman, Jyothish
    Miralaei, Negar
    Mycroft, Alan
    Jones, Timothy M.
    PROCEEDINGS OF THE 2015 IEEE INTERNATIONAL SYMPOSIUM ON DEFECT AND FAULT TOLERANCE IN VLSI AND NANOTECHNOLOGY SYSTEMS (DFTS), 2015, : 76 - 79
  • [9] The Efficient Server Audit Problem, Deduplicated Re-execution, and the Web
    Tan, Cheng
    Yu, Lingfan
    Leners, Joshua B.
    Walfish, Michael
    PROCEEDINGS OF THE TWENTY-SIXTH ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP '17), 2017, : 546 - 564
  • [10] Finding missing synchronization in a distributed computation using controlled re-execution
    Neeraj Mittal
    Vijay K. Garg
    Distributed Computing, 2004, 17 : 107 - 130