FAULT-TOLERANT PARALLEL MULTIGRID METHOD ON UNSTRUCTURED ADAPTIVE MESH

被引:0
|
作者
Fung, Frederick [1 ,2 ]
Stals, Linda [2 ]
Deng, Quanling [3 ]
机构
[1] Australian Natl Univ, Math Sci Inst, Canberra, ACT 2601, Australia
[2] Australian Natl Univ, Natl Computat Infrastruct, Canberra, ACT 2601, Australia
[3] Australian Natl Univ, Sch Comp, Canberra, ACT 2601, Australia
来源
SIAM JOURNAL ON SCIENTIFIC COMPUTING | 2024年 / 46卷 / 05期
关键词
algorithmic-based fault tolerance; unstructured adaptive meshes; geometric multigrid; DAVIDSON METHOD; RECOVERY;
D O I
10.1137/23M1582904
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
As the generation of exascale high-performance clusters begins, it has become evident that numerical algorithms will greatly benefit from built-in resilience features that can handle system faults. Prior studies of fault-tolerant multigrid methods have focused on structured grids. In this work, however, we study the resilience of multigrid solvers on unstructured grids with adaptive refinement. The challenge lies in the fact that unstructured grids distributed across multiple processors may manifest as local hierarchical grids with unaligned boundaries. Our numerical experiments highlight that this disparity can result in divergence when employing standard local multigrid for fault recovery. We analyze this phenomenon by using an energy control condition. To tackle the divergence issue, we propose a simple variation of the multigrid V-cycle that scales the coarse problem. We present a convergence proof for the new algorithm. By implementing this new method for local recovery, our numerical experiments confirm that convergence can be recovered on unstructured grids while the algorithm agrees with the standard multigrid V-cycle on grids with aligned boundaries. More importantly, the impact of a fault can be mitigated and delays in the global multigrid iterations can be reduced. Finally, we investigate how local regions within the adaptive mesh, associated with different faulty processors, affect the effectiveness of fault recovery.
引用
收藏
页码:S145 / S169
页数:25
相关论文
共 50 条
  • [41] Fault-Tolerant Parallel Execution of Workflows with Deadlines
    Eitschberger, Patrick
    Keller, Joerg
    2017 25TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2017), 2017, : 78 - 84
  • [42] SUPPORTING FAULT-TOLERANT PARALLEL PROGRAMMING IN LINDA
    BAKKEN, DE
    SCHLICHTING, RD
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1995, 6 (03) : 287 - 302
  • [43] A temporal model for fault-tolerant parallel programs
    Slimani, Y
    Majdoub, L
    PROCEEDINGS OF THE SIXTH IEEE COMPUTER SOCIETY WORKSHOP ON FUTURE TRENDS OF DISTRIBUTED COMPUTING SYSTEMS, 1997, : 304 - 309
  • [44] A Fault-Tolerant Adaptive Routing Method Based on the Passage of Faulty Nodes
    Kawazoe, Akari
    Kurokawa, Yota
    Fukushi, Masaru
    2020 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TAIWAN), 2020,
  • [45] Fault-tolerant attitude control of spacecraft by using robust adaptive method
    Li, Ding
    Lei, Jin
    PROCEEDINGS OF 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON ELECTRONIC MEASUREMENT & INSTRUMENTS (ICEMI), VOL. 1, 2015, : 416 - 421
  • [46] Construction of fault-tolerant mesh-connected highly parallel computer and its performance analysis
    Takanami, Itsuo, 1600, Publ by Scripta Technica Inc, New York, NY, United States (24):
  • [47] Fault-tolerant recursive least-squares computations on a mesh-connected parallel processor
    Zomaya, AY
    Yates, A
    Olariu, S
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2002, 62 (07) : 1142 - 1167
  • [48] FAULT-TOLERANT WORMHOLE ROUTING ALGORITHMS FOR MESH NETWORKS
    BOPPANA, RV
    CHALASANI, S
    IEEE TRANSACTIONS ON COMPUTERS, 1995, 44 (07) : 848 - 864
  • [49] Fault-tolerant wormhole routing algorithm for mesh networks
    Sui, PH
    Wang, SD
    IEE PROCEEDINGS-COMPUTERS AND DIGITAL TECHNIQUES, 2000, 147 (01): : 9 - 14
  • [50] Fault-tolerant mesh of trust applied to DNS security
    Griffin, W
    Mundy, R
    Weiler, S
    Massey, D
    Vora, N
    DARPA INFORMATION SURVIVABILITY CONFERENCE AND EXPOSITION, VOL II, PROCEEDINGS, 2003, : 84 - 86