Algorithm-Based Fault Tolerance for Fail-Stop Failures

被引:61
|
作者
Chen, Zizhong [1 ]
Dongarra, Jack [2 ]
机构
[1] Colorado Sch Mines, Dept Math & Comp Sci, Golden, CO 80401 USA
[2] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
关键词
Algorithm-based fault tolerance; checkpointing; fail-stop failures; parallel matrix-matrix multiplication; ScaLAPACK;
D O I
10.1109/TPDS.2008.58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix-matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix-matrix multiplication algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the fault tolerance overhead for this approach is surprisingly low.
引用
收藏
页码:1628 / 1641
页数:14
相关论文
共 50 条
  • [41] How to construct fail-stop confirmer signature schemes
    Safavi-Naini, R
    Susilo, W
    Wang, HX
    INFORMATION SECURITY AND PRIVACY, PROCEEDINGS, 2001, 2119 : 435 - 444
  • [42] GENERALIZED AGREEMENT BETWEEN CONCURRENT FAIL-STOP PROCESSES
    BURNS, JE
    CRUZ, RI
    LOUI, MC
    DISTRIBUTED ALGORITHMS, 1993, 725 : 84 - 98
  • [43] BYZANTINE GENERALS IN ACTION - IMPLEMENTING FAIL-STOP PROCESSORS
    SCHNEIDER, FB
    ACM TRANSACTIONS ON COMPUTER SYSTEMS, 1984, 2 (02): : 145 - 154
  • [44] A proxy fail-stop signature scheme with proxy revocation
    Kim, Young-Seol
    Chang, Jik Hyun
    JOURNAL OF DISCRETE MATHEMATICAL SCIENCES & CRYPTOGRAPHY, 2008, 11 (03): : 281 - 295
  • [45] A High-dimensional Algorithm-Based Fault Tolerance Scheme
    Fu, Xiang
    Tang, Hao
    Liao, Huimin
    Huang, Xin
    Xu, Wubiao
    Meng, Shiman
    Zhang, Weiping
    Guo, Luanzheng
    Sato, Kento
    2023 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW, 2023, : 326 - 330
  • [46] Automatic Algorithm-Based Fault Tolerance (AABFT) of Stencil Computations
    Narmour, Louis
    Derrien, Steven
    Rajopadhye, Sanjay
    2023 32ND INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT, 2023, : 187 - 198
  • [47] Matrix Control-Flow Algorithm-Based Fault Tolerance
    Ferreira, Ronaldo Rodrigues
    Moreira, Alvaro Freitas
    Carro, Luigi
    2011 IEEE 17TH INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2011,
  • [48] Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance
    Jia, Yulu
    Bosilca, George
    Dongarra, Jack J.
    2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
  • [49] Algorithm-based fault tolerance for spaceborne computing: Basis and implementations
    Turmon, M
    Granat, R
    2000 IEEE AEROSPACE CONFERENCE PROCEEDINGS, VOL 4, 2000, : 411 - 420
  • [50] An efficient construction for fail-stop signature for long messages
    Safavi-Naini, R
    Susilo, W
    Wang, HX
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2001, 17 (06) : 879 - 897