Rgs-SpMM: Accelerate Sparse Matrix-Matrix Multiplication by Row Group Splitting Strategy on the GPU

被引：0

作者：

Guo, Mingfeng ^{[1
]}

Wang, Yaobin ^{[1
]}

Huang, Jun ^{[1
]}

Wang, Qingfeng ^{[1
]}

Zhang, Yaqing ^{[1
]}

Xu, Mu ^{[2
]}

Lu, Fang ^{[2
]}

机构：

[1] Southwest Univ Sci & Technol, Sch Comp Sci & Technol, Minist Educ, Key Lab Testing Technol Mfg Proc, Mianyang 621010, Sichuan, Peoples R China

[2] Alibaba Grp, Hangzhou, Peoples R China

来源：

NETWORK AND PARALLEL COMPUTING, NPC 2022 | 2022年 / 13615卷

基金：

中国国家自然科学基金;

关键词：

Sparse Matrix-Matrix Multiplication; GPU; Row group splitting;

D O I：

10.1007/978-3-031-21395-3_6

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The Sparse Matrix-Matrix Multiplication (SpMM) operation is widely used in different fields, especially the recently popular GNN framework. Researchers have designed many kernels on the GPU to accelerate the SpMM operation. Existing methods mostly adopt a row splitting strategy to obtain better parallelism and memory access efficiency. However, due to irregularities of sparse matrices such as short rows with few non-zero elements, current methods suffer from the under-utilization of thread resources in GPU. In this paper, We rearrange the distribution of non-zero elements in the sparse matrix and design the SpMM kernel based on the row group splitting strategy. In contrast to previous methods which assign a "row" task unit to a warp for processing, we combine short rows in a sparse matrix into "row groups" as a task unit, which allocate more appropriate non-zero elements tasks to the GPU resources. This method reduces the thread divergence in a warp and improves load balancing among warps. Our experimental data comes from the SNAP Matrix Collection. The results show that our kernel is faster than cuSPARSE and GE-SpMM, with an average speedup of 1.61 and 1.42 respectively.

引用

页码：61 / 66

页数：6

共 50 条

[1] HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores
Li, Zhonggen
Ke, Xiangyu
Zhu, Yifan
Gao, Yunjun
Tu, Yaofeng
arXiv,
[2] Optimizing Sparse Matrix-Matrix Multiplication for the GPU
Dalton, Steven
Olson, Luke
Bell, Nathan
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2015, 41 (04):
[3] Adaptive Sparse Matrix-Matrix Multiplication on the GPU
Winter, Martin
Mlakar, Daniel
Zayer, Rhaleb
Seidel, Hans-Peter
Steinberger, Markus
PROCEEDINGS OF THE 24TH SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING (PPOPP '19), 2019, : 68 - 81
[4] GPU-ACCELERATED SPARSE MATRIX-MATRIX MULTIPLICATION BY ITERATIVE ROW MERGING
Gremse, Felix
Hoefter, Andreas
Schwen, Lars Ole
Kiessling, Fabian
Naumann, Uwe
SIAM JOURNAL ON SCIENTIFIC COMPUTING, 2015, 37 (01): : C54 - C71
[5] Acc-SpMM: Accelerating General-purpose Sparse Matrix-Matrix Multiplication with GPU Tensor Cores
Zhao, Haisha
Li, San
Wang, Jiaheng
Zhou, Chunbao
Wang, Jue
Xin, Zhikuang
Li, Shunde
Liang, Zhiqiang
Pan, Zhijie
Liu, Fang
Zeng, Yan
Wang, Yangang
Chi, Xuebin
PROCEEDINGS OF THE 2025 THE 30TH ACM SIGPLAN ANNUAL SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, PPOPP 2025, 2025, : 326 - 338
[6] Accelerating sparse matrix-matrix multiplication with GPU Tensor Cores
Zachariadis, Orestis
Satpute, Nitin
Gomez-Luna, Juan
Olivares, Joaquin
COMPUTERS & ELECTRICAL ENGINEERING, 2020, 88 (88)
[7] SPMSD: An Partitioning-Strategy for Parallel General Sparse Matrix-Matrix Multiplication on GPU
Cui, Huanyu
Wang, Nianbin
Han, Qilong
Wang, Ye
PARALLEL PROCESSING LETTERS, 2024, 34 (02)
[8] An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data
Liu, Weifeng
Vinter, Brian
2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[9] Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures
Mehrabi, Atefeh
Lee, Donghyuk
Chatterjee, Niladrish
Sorin, Daniel J.
Lee, Benjamin C.
O'Connor, Mike
2021 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS 2021), 2021, : 48 - 58
[10] Multithreaded sparse matrix-matrix multiplication for many-core and GPU architectures
Deveci, Mehmet
Trott, Christian
Rajamanickam, Sivasankaran
PARALLEL COMPUTING, 2018, 78 : 33 - 46

← 1 2 3 4 5 →