Algorithm-Based Fault Tolerance for Fail-Stop Failures

被引:61
|
作者
Chen, Zizhong [1 ]
Dongarra, Jack [2 ]
机构
[1] Colorado Sch Mines, Dept Math & Comp Sci, Golden, CO 80401 USA
[2] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
关键词
Algorithm-based fault tolerance; checkpointing; fail-stop failures; parallel matrix-matrix multiplication; ScaLAPACK;
D O I
10.1109/TPDS.2008.58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix-matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix-matrix multiplication algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the fault tolerance overhead for this approach is surprisingly low.
引用
收藏
页码:1628 / 1641
页数:14
相关论文
共 50 条
  • [21] On the security for the fail-stop digital signatures
    Zhang, QH
    Zheng, JM
    Yang, CD
    WAVELET ANALYSIS AND ITS APPLICATIONS (WAA), VOLS 1 AND 2, 2003, : 814 - 820
  • [22] Fail-stop components by pattern matching
    Janowski, T
    Mostowski, WI
    FORMAL METHODS FOR OPEN OBJECT-BASED DISTRIBUTED SYSTEMS IV, 2000, 49 : 351 - 370
  • [23] FAIL-STOP PROCESSORS - AN APPROACH TO DESIGNING FAULT-TOLERANT COMPUTING SYSTEMS
    SCHLICHTING, RD
    SCHNEIDER, FB
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1983, 1 (03): : 222 - 238
  • [24] An improved fail-stop signature scheme based on dual complexities
    Chang, K. H. (evenken2002@yahoo.com.tw), 1600, ICIC International (10):
  • [25] AN IMPROVED FAIL-STOP SIGNATURE SCHEME BASED ON DUAL COMPLEXITIES
    Chain, Kai
    Chen, Jonathan Jen-Rong
    Yang, Jar-Ferr
    Chang, Kuei Hu
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2014, 10 (02): : 535 - 544
  • [26] Fail-stop threshold signature schemes based on elliptic curves
    Susilo, W
    Safavi-Naini, R
    Pieprzyk, J
    INFORMATION SECURITY AND PRIVACY, 1999, 1587 : 103 - 116
  • [27] Extremely Simple Fail-Stop ECDSA Signatures
    Yaksetig, Mario
    APPLIED CRYPTOGRAPHY AND NETWORK SECURITY WORKSHOPS, PT II, ACNS 2024-AIBLOCK 2024, AIHWS 2024, AIOTS 2024, SCI 2024, AAC 2024, SIMLA 2024, LLE 2024, AND CIMSS 2024, 2024, 14587 : 230 - 234
  • [28] New and efficient fail-stop signature scheme
    Susilo, Willy, 1600, Oxford Univ Press, Oxford, United Kingdom (43):
  • [29] Fail-stop verifiable secret sharing schemes
    Susilo, W
    Mu, Y
    SAM'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SECURITY AND MANAGEMENT, VOLS 1 AND 2, 2003, : 663 - 667
  • [30] The Fail-Stop Controller AE11
    Bohl, E
    Lindenkreuz, T
    Stephan, R
    ITC - INTERNATIONAL TEST CONFERENCE 1997, PROCEEDINGS: INTEGRATING MILITARY AND COMMERCIAL COMMUNICATIONS FOR THE NEXT CENTURY, 1997, : 567 - 577