Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the Recursive Sparse Blocks format

被引:20
作者
Martone, Michele [1 ]
机构
[1] Max Planck Inst Plasma Phys, D-85748 Garching, Germany
关键词
Sparse matrix-vector multiply; Symmetric matrix-vector multiply; Transpose matrix-vector multiply; Shared memory parallel; Cache blocking; Sparse matrix assembly;
D O I
10.1016/j.parco.2014.03.008
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
In earlier work we have introduced the "Recursive Sparse Blocks" (RSB) sparse matrix storage scheme oriented towards cache efficient matrix-vector multiplication (SpMV) and triangular solution (SpSV) on cache based shared memory parallel computers. Both the transposed (SpMV_T) and symmetric (SymSpMV) matrix-vector multiply variants are supported. RSB stands for a meta-format: it recursively partitions a rectangular sparse matrix in quadrants; leaf submatrices are stored in an appropriate traditional format - either Compressed Sparse Rows (CSR) or Coordinate (COO). In this work, we compare the performance of our RSB implementation of SpMV, SpMV_T, SymSpMV to that of the state-of-the-art Intel Math Kernel Library (MKL) CSR implementation on the recent Intel's Sandy Bridge processor. Our results with a few dozens of real world large matrices suggest the efficiency of the approach: in all of the cases, RSB's SymSpMV (and in most cases, SpMV_T as well) took less than half of MKL CSR's time; SpMV's advantage was smaller. Furthermore, RSB's SpMV T is more scalable than MKL's CSR, in that it performs almost as well as SpMV. Additionally, we include comparisons to the state-of-the art format Compressed Sparse Blocks (CSB) implementation. We observed RSB to be slightly superior to CSB in SpMV_T, slightly inferior in SpMV, and better (in most cases by a factor of two or more) in SymSpMV. Although RSB is a non-traditional storage format and thus needs a special constructor, it can be assembled from CSR or any other similar row-ordered representation arrays in the time of a few dozens of matrix-vector multiply executions. Thanks to its significant advantage over MKL's CSR routines for symmetric or transposed matrix-vector multiplication, in most of the observed cases the assembly cost has been observed to amortize with fewer than fifty iterations. (C) 2014 Elsevier B.V. All rights reserved.
引用
收藏
页码:251 / 270
页数:20
相关论文
共 31 条
[1]  
[Anonymous], 2008, 10031 IEEE
[2]  
[Anonymous], 1994, TEMPLATES SOLUTION L, DOI DOI 10.1137/1.9781611971538
[3]  
[Anonymous], 2011, OpenMP Application Program Interface
[4]  
Asanovic K., 2006, The landscape of parallel computing research: A view from berkeley
[5]   The Design of OpenMP Tasks [J].
Ayguade, Eduard ;
Copty, Nawal ;
Duran, Alejandro ;
Hoeflinger, Jay ;
Lin, Yuan ;
Massaioli, Federico ;
Teruel, Xavier ;
Unnikrishnan, Priya ;
Zhang, Guansong .
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (03) :404-418
[6]  
Bai Z., 2000, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Ed. by, DOI DOI 10.1137/1.9780898719581
[7]   Pattern-based Sparse Matrix Representation for Memory-Efficient SMVM Kernels [J].
Belgin, Mehmet ;
Back, Godmar ;
Ribbens, Calvin J. .
ICS'09: PROCEEDINGS OF THE 2009 ACM SIGARCH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2009, :100-109
[8]  
Buluc A., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P721, DOI 10.1109/IPDPS.2011.73
[9]  
Buluç A, 2009, SPAA'09: PROCEEDINGS OF THE TWENTY-FIRST ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P233
[10]  
Byun J., 2012, pOSKI: Parallel optimized sparse kernel interface library