A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster

被引:17
|
作者
Lin, Shaozhong [1 ,2 ]
Xie, Zhiqiang [1 ,2 ]
机构
[1] Changjiang River Sci Res Inst, Wuhan 430010, Peoples R China
[2] Res Ctr Water Engn Safety & Disaster Prevent MWR, Wuhan 430010, Peoples R China
来源
JOURNAL OF SUPERCOMPUTING | 2017年 / 73卷 / 01期
基金
中国国家自然科学基金;
关键词
JPCG; Sparse linear systems; Multi-GPU cluster; Communication reduction; Node reordering; Counting sort; Computation/communication overlapping;
D O I
10.1007/s11227-016-1887-4
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The General Purpose Graphics Processing Unit (GPGPU or GPU) has powerful float-point computation ability and is suitable for intensive computing, such as solving large linear systems. The Jacobi Preconditioned Conjugate Gradient method (Jacobi_PCG or JPCG), one type of preconditioned iteration methods for the numerical solution of large sparse linear systems, has advantages of high parallelism and is especially appropriate for implementation on GPUs. On multi-GPU cluster, the matrix-vector multiplication involved in the PCG iteration needs the vector entries generated by current GPU and other GPUs, so the communication between GPUs becomes a major performance bottleneck. In this paper, we study the implementation of the JPCG on multi-GPU cluster. Considering the coarse-grained parallelism between GPUs and the sparsity of matrices arising from the finite element method (FEM), a simple and fast node reordering method is presented to optimize the bandwidth of sparse matrices, resulting in a reduction of the communication between GPUs. This novel reordering method is based on integerized nodal coordinates of FEM mesh and the counting sort algorithm. Additionally, computation and communication are overlapped using CUDA asynchronous memory transfer and MPI_sendrecv communication to further reduce the communication cost. A JPCG solver on multi-GPU cluster is developed using CUDA Fortran. Tests show that this solver has high efficiency and strong scalability.
引用
收藏
页码:433 / 454
页数:22
相关论文
共 50 条
  • [31] DeltaSPARSE: High-Performance Sparse General Matrix-Matrix Multiplication on Multi-GPU Systems
    Yang, Shuai
    Zhang, Changyou
    Ma, Ji
    2023 IEEE 30TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING, DATA, AND ANALYTICS, HIPC 2023, 2023, : 194 - 202
  • [32] Accelerated MR Physics Simulations on multi-GPU systems
    Xanthis, Christos G.
    Venetis, Ioannis E.
    Aletras, Anthony H.
    2013 IEEE 13TH INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING (BIBE), 2013,
  • [33] Performance Optimization of Allreduce Operation for Multi-GPU Systems
    Nukada, Akira
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 3107 - 3112
  • [34] Efficient breadth first search on multi-GPU systems
    Mastrostefano, Enrico
    Bernaschi, Massimo
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2013, 73 (09) : 1292 - 1305
  • [35] Autonomous Execution for Multi-GPU Systems: Compiler Support
    Koç University, Istanbul, Turkey
    不详
    CA, United States
    Proc. SC -W: Workshops Int. Conf. High Perform. Comput., Netw., Storage Anal., (1129-1140):
  • [36] Dynamic load balancing on heterogeneous multi-GPU systems
    Acosta, Alejandro
    Blanco, Vicente
    Almeida, Francisco
    COMPUTERS & ELECTRICAL ENGINEERING, 2013, 39 (08) : 2591 - 2602
  • [37] Tensor Movement Orchestration in Multi-GPU Training Systems
    Lin, Shao-Fu
    Chen, Yi-Jung
    Cheng, Hsiang-Yun
    Yang, Chia-Lin
    2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA, 2023, : 1140 - 1152
  • [38] Gossip: Efficient Communication Primitives for Multi-GPU Systems
    Kobus, Robin
    Juenger, Daniel
    Hundt, Christian
    Schmidt, Bertil
    PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,
  • [39] Solving Multiple Tridiagonal Systems on a Multi-GPU Platform
    Dieguez, Adrian P.
    Amor, Margarita
    Doallo, Ramon
    2018 26TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP 2018), 2018, : 759 - 763
  • [40] Concurrent number cruncher: a GPU implementation of a general sparse linear solver
    Buatois, Luc
    Caumon, Guillaume
    Levy, Bruno
    INTERNATIONAL JOURNAL OF PARALLEL EMERGENT AND DISTRIBUTED SYSTEMS, 2009, 24 (03) : 205 - 223