FAULT-TOLERANT PARALLEL MULTIGRID METHOD ON UNSTRUCTURED ADAPTIVE MESH

被引：0

作者：

Fung, Frederick ^{[1
,2
]}

Stals, Linda ^{[2
]}

Deng, Quanling ^{[3
]}

机构：

[1] Australian Natl Univ, Math Sci Inst, Canberra, ACT 2601, Australia

[2] Australian Natl Univ, Natl Computat Infrastruct, Canberra, ACT 2601, Australia

[3] Australian Natl Univ, Sch Comp, Canberra, ACT 2601, Australia

来源：

SIAM JOURNAL ON SCIENTIFIC COMPUTING | 2024年 / 46卷 / 05期

关键词：

algorithmic-based fault tolerance; unstructured adaptive meshes; geometric multigrid; DAVIDSON METHOD; RECOVERY;

D O I：

10.1137/23M1582904

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

As the generation of exascale high-performance clusters begins, it has become evident that numerical algorithms will greatly benefit from built-in resilience features that can handle system faults. Prior studies of fault-tolerant multigrid methods have focused on structured grids. In this work, however, we study the resilience of multigrid solvers on unstructured grids with adaptive refinement. The challenge lies in the fact that unstructured grids distributed across multiple processors may manifest as local hierarchical grids with unaligned boundaries. Our numerical experiments highlight that this disparity can result in divergence when employing standard local multigrid for fault recovery. We analyze this phenomenon by using an energy control condition. To tackle the divergence issue, we propose a simple variation of the multigrid V-cycle that scales the coarse problem. We present a convergence proof for the new algorithm. By implementing this new method for local recovery, our numerical experiments confirm that convergence can be recovered on unstructured grids while the algorithm agrees with the standard multigrid V-cycle on grids with aligned boundaries. More importantly, the impact of a fault can be mitigated and delays in the global multigrid iterations can be reduced. Finally, we investigate how local regions within the adaptive mesh, associated with different faulty processors, affect the effectiveness of fault recovery.

引用

页码：S145 / S169

页数：25

共 50 条

[1] Fault-Tolerant Adaptive Parallel and Distributed Simulation
D'Angelo, Gabriele
Ferretti, Stefano
Marzolla, Moreno
Armaroli, Lorenzo
2016 IEEE/ACM 20TH INTERNATIONAL SYMPOSIUM ON DISTRIBUTED SIMULATION AND REAL TIME APPLICATIONS (DS-RT), 2016, : 37 - 44
[2] Fault-Tolerant Adaptive Routing in n-D Mesh
Chen, Meirun
Yang, Yi
WEB-AGE INFORMATION MANAGEMENT, 2016, 9998 : 77 - 87
[3] A fault-tolerant computing method for Xdraw parallel algorithm
Wanfeng Dou
Yanan Li
The Journal of Supercomputing, 2018, 74 : 2776 - 2800
[4] Design and evaluation of a fault-tolerant adaptive router for parallel computers
Yoshinaga, T
Hosogoshi, H
Sowa, M
INNOVATIVE ARCHITECTURE FOR FUTURE GENERATION HIGH-PERFORMANCE PROCESSORS AND SYSTEMS, 2003, : 100 - 107
[5] A fault-tolerant computing method for Xdraw parallel algorithm
Dou, Wanfeng
Li, Yanan
JOURNAL OF SUPERCOMPUTING, 2018, 74 (06): : 2776 - 2800
[6] An parallel diagnosis method for an optimal fault-tolerant network
Suh, JK
Kwon, HJ
Rhee, CS
1997 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, PROCEEDINGS, 1997, : 750 - 755
[7] A reconfigurable and adaptive routing method for fault-tolerant mesh-based networks-on-chip
Valinataj, Mojtaba
Mohammadi, Siamak
Plosila, Juha
Liljeberg, Pasi
Tenhunen, Hannu
AEU-INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATIONS, 2011, 65 (07) : 630 - 640
[8] FAULT-TOLERANT PARALLEL PROCESSOR
HARPER, RE
LALA, JH
JOURNAL OF GUIDANCE CONTROL AND DYNAMICS, 1991, 14 (03) : 554 - 563
[9] Deadlock-free adaptive routing in fault-tolerant mesh networks
Xiang, Dong
Zhang, Yue-Li
Jisuanji Xuebao/Chinese Journal of Computers, 2007, 30 (11): : 1954 - 1962
[10] Adaptive and fault-tolerant routing with 100% node utilization for mesh multicomputer
Wang, SD
Tsai, MJ
1998 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 367 - 374

← 1 2 3 4 5 →