Algorithm-Based Fault Tolerance for Fail-Stop Failures

被引:61
|
作者
Chen, Zizhong [1 ]
Dongarra, Jack [2 ]
机构
[1] Colorado Sch Mines, Dept Math & Comp Sci, Golden, CO 80401 USA
[2] Univ Tennessee, Dept Elect Engn & Comp Sci, Knoxville, TN 37996 USA
关键词
Algorithm-based fault tolerance; checkpointing; fail-stop failures; parallel matrix-matrix multiplication; ScaLAPACK;
D O I
10.1109/TPDS.2008.58
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kernel can be tolerated without checkpointing or message logging. It has been proved in the previous algorithm-based fault tolerance research that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no matter which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship in the input checksum matrices can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix-matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version matrix-matrix multiplication algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. Because no periodical checkpointing is involved, the fault tolerance overhead for this approach is surprisingly low.
引用
收藏
页码:1628 / 1641
页数:14
相关论文
共 50 条
  • [1] Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition
    Hakkarinen, Doug
    Wu, Panruo
    Chen, Zizhong
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (05) : 1323 - 1335
  • [2] Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments
    Chen, Zizhong
    2008 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL & DISTRIBUTED PROCESSING, VOLS 1-8, 2008, : 336 - +
  • [3] NVMe SSD Failures in the Field: the Fail-Stop and the Fail-Slow
    Lu, Ruiming
    Xu, Erci
    Zhang, Yiming
    Zhu, Zhaosheng
    Wang, Mengtian
    Zhu, Zongpeng
    Xue, Guangtao
    Li, Minglu
    Wu, Jiesheng
    PROCEEDINGS OF THE 2022 USENIX ANNUAL TECHNICAL CONFERENCE, 2022, : 1005 - 1019
  • [4] Fail-stop signatures
    Pedersen, TP
    Pfitzmann, B
    SIAM JOURNAL ON COMPUTING, 1997, 26 (02) : 291 - 330
  • [5] A fault-tolerant protocol for election in chordal-ring networks with fail-stop processor failures
    Pan, Y
    Singh, G
    IEEE TRANSACTIONS ON RELIABILITY, 1997, 46 (01) : 11 - 17
  • [6] Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy
    Bouteiller, Aurelien
    Herault, Thomas
    Bosilca, George
    Du, Peng
    Dongarra, Jack
    ACM Transactions on Parallel Computing, 2015, 1 (02)
  • [7] Algorithm-based fault tolerance: a review
    Vijay, M
    Mittal, R
    MICROPROCESSORS AND MICROSYSTEMS, 1997, 21 (03) : 151 - 161
  • [8] Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms
    Anne Benoit
    Louis-Claude Canon
    Emmanuel Jeannot
    Yves Robert
    Journal of Scheduling, 2012, 15 : 615 - 627
  • [9] Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms
    Benoit, Anne
    Canon, Louis-Claude
    Jeannot, Emmanuel
    Robert, Yves
    JOURNAL OF SCHEDULING, 2012, 15 (05) : 615 - 627
  • [10] Mitigation Of Fail-Stop Failures In Integer Matrix Products Via Numerical Packing
    Anarado, Ijeoma
    Andreopoulos, Yiannis
    2015 IEEE 21ST INTERNATIONAL ON-LINE TESTING SYMPOSIUM (IOLTS), 2015, : 101 - 107