A Block Gauss-Seidel/Jacobi Preconditioner for Heterogeneous Many-Core Architecture

被引：0

作者：

Wu L.-L. ^{[1
]}

Chen R.-L. ^{[2
]}

Luo L. ^{[2
]}

Yan Z.-Z. ^{[2
]}

Liao Z.-J. ^{[2
]}

Chi L.-H. ^{[3
]}

Liu J. ^{[1
]}

机构：

[1] Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha

[2] Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, Guangdong

[3] Institute of Advanced Science and Technology, Hunan Institute of Traffic Engineering, Xiangyin, 414600, Hunan

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2019年 / 42卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Block Gauss-Seidel/Jacobi method; Domain decomposition method; Heterogeneous many-core architecture; Sunway TaihuLight supercomputer; Unstructured mesh;

D O I：

10.11897/SP.J.1016.2019.02447

中图分类号：

学科分类号：

摘要：

The Gauss-Seidel algorithm is widely used in the field of parallel computing as an iterative solver for linear system. However, it is challenging to exploit its fine-level parallelism on the emerging heterogeneous many-core architectures. For unstructured mesh problems, the missing of geometric information in matrices of unstructured mesh problems made the classical geometry-based parallel strategies fail. Instead, a block Gauss-Seidel/Jacobi algorithm for heterogeneous many-core architectures based on the algebra-based parallel strategy is proposed, which is used as the subdomain solver (preconditioner) in the DDM (Domain Decomposition Method). To balance between the parallel and numerical efficiencies, we combined the inherent high parallelism of Jacobi method and the good numerical converge rate of Gauss-Seidel method together to form the new algorithm. The block Gauss-Seidel/Jacobi algorithm can be regarded as small Gauss-Seidel iterations on the intra-thread level and a global Jacobi iteration on the inter-thread level. The proposed algorithm presents scalable parallelism with minimum inter-thread communication. Moreover, the unknowns were ordered in a node-by-node manner, meaning that all the N unknowns belongs to a mesh node are coupled to form a N×N block. Such ordering of unknowns does not affect the numerical performance but improves the convergence and scalability of our solver. For the problem solving the incompressible Navier-Stokes equations, the subdomain solver takes a 4×4 node block as a minimum unit. The Sunway TaihuLight supercomputer, consisted of 40960 homegrown SW26010 many-core processors with totally over 10 million cores, is exactly based on heterogeneous many-core architecture. In this paper, the block Gauss-Seidel/Jacobi algorithm was implemented and optimized based on SW26010 many-core architecture. It is known that communication is the bottleneck of heterogeneous many-core architecture. Therefore, a series of low-communication-complexity optimization strategies were designed to achieve a better numerical performance. To reduce the communication cost, we designed implementation-based optimization strategies such as multiple lines packed copying and computation-communication overlapping techniques using the small but fast LDM (Local Data Memory) on many-core processors. Moreover, a copying reduced variant of the block Gauss-Seidel/Jacobi algorithm is also proposed to alleviate the memory bandwidth bottleneck. Unlike the optimization strategies based on accurate computation, the copying reduced version leads to inaccurate preconditioner by neglecting portion of data that need to be communicated and synchronized. Omitting the less important data does not affect the numerical efficiency apparently but achieves much better parallel efficiency. In this work, we adopt the RAS (Restricted Additive Schwarz) method for inter-node parallel algorithm and the block Gauss-Seidel/Jacobi method for the inter- and intra-thread parallelism. The aerodynamic simulations of a high-speed train model and a car model on unstructured mesh were tested on the Sunway TaihuLight. Numerical results show that the proposed block Gauss-Seidel/Jacobi algorithm delivers at most a 4.16x speedup comparing to the sequential version. For the parallel efficiency, our algorithm achieves a 61% parallel efficiency as the number of processors increases from 1040 to 33280. Moreover, the proposed algorithms are not limited to be used as the subdomain solver in DDM, they can also be adopted as the iterative solver for linear systems and smoothers in the multigrid method. © 2019, Science Press. All right reserved.

引用

页码：2447 / 2460

页数：13

共 17 条

[1] Chen R.L., Cai X.C., A scalable domain decomposition method and applications in simulation and optimization of fluids, Scientia Sinica Math, 46, pp. 915-928, (2016)
[2] Cai X.C., Keyes D.E., Venkatakrishnan V., Newton-Krylov-Schwarz: An implicit solver for CFD, (1995)
[3] Cai X.C., Sarkis M., A restricted additive Schwarz preconditioner for general sparse linear systems, SIAM Journal on Scientific Computing, 21, 2, pp. 792-797, (1999)
[4] Fu H., Liao J., Yang J., Et al., The Sunway TaihuLight supercomputer: System and applications, Science China Information Sciences, 59, 7, (2016)
[5] Grote M.J., Huckle T., Parallel preconditioning with sparse approximate inverses, SIAM Journal on Scientific Computing, 18, 3, pp. 838-853, (1997)
[6] Chow E., Patel A., Fine-grained parallel incomplete LU factorization, SIAM Journal on Scientific Computing, 37, 2, pp. C169-C193, (2015)
[7] Yang C., Xue W., Fu H., Et al., 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 57-68, (2016)
[8] Heuveline V., Lukarski D., Weiss J.P., Enhanced parallel ILU (p)-based preconditioners for multi-core CPUs and GPUs: The power (q)-pattern method, Preprint Series of the Engineering Mathematics and Computing Lab, 8, pp. 1-36, (2011)
[9] Heuveline V., Lukarski D., Weiss J.P., Scalable multi-coloring preconditioning for multi-core CPUs and GPUs, Proceedings of the European Conference on Parallel Processing, pp. 389-397, (2010)
[10] Cotronis Y., Konstantinidis E., Louka M.A., Et al., Parallel SOR for solving the convection diffusion equation using GPUs with CUDA, Proceedings of the European Conference on Parallel Processing, pp. 575-586, (2012)

← 1 2 →