Performance Analysis and Optimization for SpMV on GPU Using Probabilistic Modeling

被引:192
作者
Li, Kenli [1 ,2 ]
Yang, Wangdong [1 ,2 ]
Li, Keqin [1 ,3 ]
机构
[1] Hunan Univ, Coll Informat Sci & Engn, Changsha 410082, Hunan, Peoples R China
[2] Natl Supercomp Ctr Changsha, Changsha 410082, Hunan, Peoples R China
[3] SUNY Coll New Paltz, Dept Comp Sci, New Paltz, NY 12561 USA
基金
中国国家自然科学基金;
关键词
GPU; performance modeling; probability mass function; sparse matrix-vector multiplication; MATRIX-VECTOR MULTIPLICATION; TOOL;
D O I
10.1109/TPDS.2014.2308221
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This paper presents a unique method of performance analysis and optimization for sparse matrix-vector multiplication (SpMV) on GPU. This method has wide adaptability for different types of sparse matrices and is different from existing methods which only adapt to some particular sparse matrices. In addition, our method does not need additional benchmarks to get optimized parameters, which are calculated directly through the probability mass function (PMF). We make the following contributions. (1) We present a PMF to analyze precisely the distribution pattern of non-zero elements in a sparse matrix. The PMF can provide theoretical basis for the compression of a sparse matrix. (2) Compression efficiency of COO, CSR, ELL, and HYB can be analyzed precisely through the PMF, and combined with the hardware parameters of GPU, the performance of SpMV based on COO, CSR, ELL, and HYB can be estimated. Furthermore, the most appropriate format for SpMV can be selected according to estimated value of the performance. Experiments prove that the theoretical estimated values and the tested values have high consistency. (3) For HYB, the optimal segmentation threshold can be found through the PMF to achieve the optimal performance for SpMV. Our performance modeling and analysis are very accurate. The order of magnitude of the estimated speedup and that of the tested speedup for each of the ten tested sparse matrices based on the three formats COO, CSR, and ELL are the same. The percentage of relative difference between an estimated value and a tested value is less than 20 percent for over 80 percent cases. The performance improvement of our algorithm is also effective. The average performance improvement of the optimal solution for HYB is over 15 percent compared with that of the automatic solution provided by CUSPARSE lib.
引用
收藏
页码:196 / 205
页数:10
相关论文
共 34 条
[1]  
[Anonymous], 2012, NVIDIA CUDA C Programming Guide NVIDIA
[2]  
[Anonymous], P 2 INT C ED TECHN C
[3]  
[Anonymous], IEEE T PARALLEL DIST
[4]  
[Anonymous], RC24704 IBM TJ WATS
[5]  
[Anonymous], UNIVARIATE DISCRETE
[6]  
[Anonymous], P INT C PAR DISTR SY
[7]  
[Anonymous], ARXIV10122270
[8]   An Adaptive Performance Modeling Tool for GPU Architectures [J].
Baghsorkhi, Sara S. ;
Delahaye, Matthieu ;
Patel, Sanjay J. ;
Gropp, William D. ;
Hwu, Wen-mei W. .
ACM SIGPLAN NOTICES, 2010, 45 (05) :105-114
[9]  
Bell N, 2009, STUDENTS GUIDE TO THE MA TESOL, P1
[10]   Sparse matrix solvers on the GPU:: Conjugate gradients and multigrid [J].
Bolz, J ;
Farmer, I ;
Grinspun, E ;
Schröder, P .
ACM TRANSACTIONS ON GRAPHICS, 2003, 22 (03) :917-924