FAULT-TOLERANT PARALLEL MULTIGRID METHOD ON UNSTRUCTURED ADAPTIVE MESH

被引:0
|
作者
Fung, Frederick [1 ,2 ]
Stals, Linda [2 ]
Deng, Quanling [3 ]
机构
[1] Australian Natl Univ, Math Sci Inst, Canberra, ACT 2601, Australia
[2] Australian Natl Univ, Natl Computat Infrastruct, Canberra, ACT 2601, Australia
[3] Australian Natl Univ, Sch Comp, Canberra, ACT 2601, Australia
来源
SIAM JOURNAL ON SCIENTIFIC COMPUTING | 2024年 / 46卷 / 05期
关键词
algorithmic-based fault tolerance; unstructured adaptive meshes; geometric multigrid; DAVIDSON METHOD; RECOVERY;
D O I
10.1137/23M1582904
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
As the generation of exascale high-performance clusters begins, it has become evident that numerical algorithms will greatly benefit from built-in resilience features that can handle system faults. Prior studies of fault-tolerant multigrid methods have focused on structured grids. In this work, however, we study the resilience of multigrid solvers on unstructured grids with adaptive refinement. The challenge lies in the fact that unstructured grids distributed across multiple processors may manifest as local hierarchical grids with unaligned boundaries. Our numerical experiments highlight that this disparity can result in divergence when employing standard local multigrid for fault recovery. We analyze this phenomenon by using an energy control condition. To tackle the divergence issue, we propose a simple variation of the multigrid V-cycle that scales the coarse problem. We present a convergence proof for the new algorithm. By implementing this new method for local recovery, our numerical experiments confirm that convergence can be recovered on unstructured grids while the algorithm agrees with the standard multigrid V-cycle on grids with aligned boundaries. More importantly, the impact of a fault can be mitigated and delays in the global multigrid iterations can be reduced. Finally, we investigate how local regions within the adaptive mesh, associated with different faulty processors, affect the effectiveness of fault recovery.
引用
收藏
页码:S145 / S169
页数:25
相关论文
共 50 条
  • [21] FAULT-TOLERANT SCHEMES FOR PARALLEL ARCHITECTURES
    LIVESEY, MJ
    OWCZARCZYK, J
    ELECTRONICS LETTERS, 1987, 23 (22) : 1206 - 1207
  • [22] Highly fault-tolerant parallel computation
    Spielman, DA
    37TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, PROCEEDINGS, 1996, : 154 - 163
  • [23] A novel fault-tolerant parallel algorithm
    Wang, Panfeng
    Du, Yunfei
    Fu, Hongyi
    Zhou, Haifang
    Yang, Xuejun
    Yang, Wenjing
    ADVANCED PARALLEL PROCESSING TECHNOLOGIES, PROCEEDINGS, 2007, 4847 : 18 - 29
  • [24] Fault-Tolerant Routing Schemes for Wormhole Mesh
    Duan, Xinming
    Zhang, Dakun
    Sun, Xuemei
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS, PROCEEDINGS, 2009, : 298 - 301
  • [25] Fault-tolerant hyper-mesh multicomputers
    Sharma, Neeraj K.
    International Journal of Parallel and Distributed Systems and Networks, 1998, 1 (01): : 11 - 16
  • [26] A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers
    Ho, CT
    Stockmeyer, L
    IEEE TRANSACTIONS ON COMPUTERS, 2004, 53 (04) : 427 - 438
  • [27] An adaptive fault-tolerant component model
    Fraga, J
    Siqueira, F
    Favarim, F
    NINTH IEEE INTERNATIONAL WORKSHOP ON OBJECT-ORIENTED REAL-TIME DEPENDABLE SYSTEMS, 2004, : 179 - 186
  • [28] Adaptive distributed and fault-tolerant systems
    Hiltunen, MA
    Schlichting, RD
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 1996, 11 (05): : 275 - 285
  • [29] Adaptive Fault-Tolerant Communication Based-Control for Parallel Connected Rectifiers
    Sharida, Ali
    Kamal, Naheel Faisal
    Bayhan, Sertac
    Abu-Rub, Haitham
    IEEE OPEN JOURNAL OF THE INDUSTRIAL ELECTRONICS SOCIETY, 2023, 4 : 709 - 719
  • [30] Efficiency Optimization Method for Parallel Converters in Fault-tolerant Microgrids 1
    Li, Pengwei
    Bazzi, Ali M.
    2022 INTERNATIONAL POWER ELECTRONICS CONFERENCE (IPEC-HIMEJI 2022- ECCE ASIA), 2022, : 1898 - 1902