A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

被引：17

作者：

Lin, Shaozhong ^{[1
,2
]}

Xie, Zhiqiang ^{[1
,2
]}

机构：

[1] Changjiang River Sci Res Inst, Wuhan 430010, Peoples R China

[2] Res Ctr Water Engn Safety & Disaster Prevent MWR, Wuhan 430010, Peoples R China

来源：

JOURNAL OF SUPERCOMPUTING | 2017年 / 73卷 / 01期

基金：

中国国家自然科学基金;

关键词：

JPCG; Sparse linear systems; Multi-GPU cluster; Communication reduction; Node reordering; Counting sort; Computation/communication overlapping;

D O I：

10.1007/s11227-016-1887-4

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The General Purpose Graphics Processing Unit (GPGPU or GPU) has powerful float-point computation ability and is suitable for intensive computing, such as solving large linear systems. The Jacobi Preconditioned Conjugate Gradient method (Jacobi_PCG or JPCG), one type of preconditioned iteration methods for the numerical solution of large sparse linear systems, has advantages of high parallelism and is especially appropriate for implementation on GPUs. On multi-GPU cluster, the matrix-vector multiplication involved in the PCG iteration needs the vector entries generated by current GPU and other GPUs, so the communication between GPUs becomes a major performance bottleneck. In this paper, we study the implementation of the JPCG on multi-GPU cluster. Considering the coarse-grained parallelism between GPUs and the sparsity of matrices arising from the finite element method (FEM), a simple and fast node reordering method is presented to optimize the bandwidth of sparse matrices, resulting in a reduction of the communication between GPUs. This novel reordering method is based on integerized nodal coordinates of FEM mesh and the counting sort algorithm. Additionally, computation and communication are overlapped using CUDA asynchronous memory transfer and MPI_sendrecv communication to further reduce the communication cost. A JPCG solver on multi-GPU cluster is developed using CUDA Fortran. Tests show that this solver has high efficiency and strong scalability.

引用

页码：433 / 454

页数：22

共 50 条

[31] DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems
Yang, Shuai
Zhang, Changyou
Ma, Ji
2023 IEEE 30TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC 2023, 2023, : 194 - 202
[32] Accelerated MR Physics Simulations on multi-GPU systems
Xanthis, Christos G.
Venetis, Ioannis E.
Aletras, Anthony H.
2013 IEEE 13TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2013,
[33] Performance Optimization of Allreduce Operation for Multi-GPU Systems
Nukada, Akira
2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 3107 - 3112
[34] Efficient breadth first search on multi-GPU systems
Mastrostefano, Enrico
Bernaschi, Massimo
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (09) : 1292 - 1305
[35] Autonomous Execution for Multi-GPU Systems: Compiler Support
Koç University, Istanbul, Turkey
不详
CA, United States
Proc. SC -W: Workshops Int. Conf. High Perform. Comput., Netw., Storage Anal., (1129-1140):
[36] Dynamic load balancing on heterogeneous multi-GPU systems
Acosta, Alejandro
Blanco, Vicente
Almeida, Francisco
COMPUTERS & ELECTRICAL ENGINEERING, 2013, 39 (08) : 2591 - 2602
[37] Tensor Movement Orchestration in Multi-GPU Training Systems
Lin, Shao-Fu
Chen, Yi-Jung
Cheng, Hsiang-Yun
Yang, Chia-Lin
2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 1140 - 1152
[38] Gossip: Efficient Communication Primitives for Multi-GPU Systems
Kobus, Robin
Juenger, Daniel
Hundt, Christian
Schmidt, Bertil
PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
[39] Solving Multiple Tridiagonal Systems on a Multi-GPU Platform
Dieguez, Adrian P.
Amor, Margarita
Doallo, Ramon
2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), 2018, : 759 - 763
[40] Concurrent number cruncher: a GPU implementation of a general sparse linear solver
Buatois, Luc
Caumon, Guillaume
Levy, Bruno
INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2009, 24 (03) : 205 - 223

← 1 2 3 4 5 →