Algorithm-Based Fault Tolerance for Fail-Stop Failures

被引:61
|
作者
Chen, Zizhong [1 ]
Dongarra, Jack [2 ]
机构
[1] Colorado Sch Mines, Dept Math & Comp Sci, Golden, CO 80401 USA
[2] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
关键词
Algorithm-based fault tolerance; checkpointing; fail-stop failures; parallel matrix-matrix multiplication; ScaLAPACK;
D O I
10.1109/TPDS.2008.58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix-matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix-matrix multiplication algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the fault tolerance overhead for this approach is surprisingly low.
引用
收藏
页码:1628 / 1641
页数:14
相关论文
共 50 条
  • [31] A new and efficient fail-stop signature scheme
    Susilo, W
    Safavi-Naini, R
    Gysin, M
    Seberry, J
    COMPUTER JOURNAL, 2000, 43 (05): : 430 - 437
  • [32] Algorithm-Based Fault Tolerance for Parallel Stencil Computations
    Cavelan, Aurelien
    Ciorba, Florina M.
    2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 12 - 22
  • [33] Algorithm-based Fault Tolerance for Dense Matrix Factorizations
    Du, Peng
    Bouteiller, Aurelien
    Bosilca, George
    Herault, Thomas
    Dongarra, Jack
    ACM SIGPLAN NOTICES, 2012, 47 (08) : 225 - 234
  • [34] IMPROVED BOUNDS FOR ALGORITHM-BASED FAULT-TOLERANCE
    ROSENKRANTZ, DJ
    RAVI, SS
    IEEE TRANSACTIONS ON COMPUTERS, 1993, 42 (05) : 630 - 635
  • [35] A LINEAR ALGEBRAIC MODEL OF ALGORITHM-BASED FAULT TOLERANCE
    ANFINSON, CJ
    LUK, FT
    IEEE TRANSACTIONS ON COMPUTERS, 1988, 37 (12) : 1599 - 1604
  • [36] Wavelet Codes for Algorithm-Based Fault Tolerance Applications
    Redinbo, G. Robert
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2010, 7 (03) : 315 - 328
  • [37] ALGORITHM-BASED FAULT-TOLERANCE FOR FFT NETWORKS
    WANG, SJ
    JHA, NK
    IEEE TRANSACTIONS ON COMPUTERS, 1994, 43 (07) : 849 - 854
  • [38] Comment Fail-Stop Blind Signature Scheme Design Based on Pairings
    HU Xiaoming
    WuhanUniversityJournalofNaturalSciences, 2006, (06) : 1545 - 1548
  • [39] Threshold fail-stop signature schemes based on discrete logarithm and factorization
    Safavi-Naini, R
    Susilo, W
    INFORMATION SECURITY, PROCEEDINGS, 2001, 1975 : 292 - 307
  • [40] AN EFFICIENT WRITE-ALL ALGORITHM FOR FAIL-STOP PRAM WITHOUT INITIALIZED MEMORY
    SHVARTSMAN, AA
    INFORMATION PROCESSING LETTERS, 1992, 44 (04) : 223 - 231