Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the Recursive Sparse Blocks format

被引：20

作者：

Martone, Michele ^{[1
]}

机构：

[1] Max Planck Inst Plasma Phys, D-85748 Garching, Germany

来源：

PARALLEL COMPUTING | 2014年 / 40卷 / 07期

关键词：

Sparse matrix-vector multiply; Symmetric matrix-vector multiply; Transpose matrix-vector multiply; Shared memory parallel; Cache blocking; Sparse matrix assembly;

D O I：

10.1016/j.parco.2014.03.008

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

In earlier work we have introduced the "Recursive Sparse Blocks" (RSB) sparse matrix storage scheme oriented towards cache efficient matrix-vector multiplication (SpMV) and triangular solution (SpSV) on cache based shared memory parallel computers. Both the transposed (SpMV_T) and symmetric (SymSpMV) matrix-vector multiply variants are supported. RSB stands for a meta-format: it recursively partitions a rectangular sparse matrix in quadrants; leaf submatrices are stored in an appropriate traditional format - either Compressed Sparse Rows (CSR) or Coordinate (COO). In this work, we compare the performance of our RSB implementation of SpMV, SpMV_T, SymSpMV to that of the state-of-the-art Intel Math Kernel Library (MKL) CSR implementation on the recent Intel's Sandy Bridge processor. Our results with a few dozens of real world large matrices suggest the efficiency of the approach: in all of the cases, RSB's SymSpMV (and in most cases, SpMV_T as well) took less than half of MKL CSR's time; SpMV's advantage was smaller. Furthermore, RSB's SpMV T is more scalable than MKL's CSR, in that it performs almost as well as SpMV. Additionally, we include comparisons to the state-of-the art format Compressed Sparse Blocks (CSB) implementation. We observed RSB to be slightly superior to CSB in SpMV_T, slightly inferior in SpMV, and better (in most cases by a factor of two or more) in SymSpMV. Although RSB is a non-traditional storage format and thus needs a special constructor, it can be assembled from CSR or any other similar row-ordered representation arrays in the time of a few dozens of matrix-vector multiply executions. Thanks to its significant advantage over MKL's CSR routines for symmetric or transposed matrix-vector multiplication, in most of the observed cases the assembly cost has been observed to amortize with fewer than fifty iterations. (C) 2014 Elsevier B.V. All rights reserved.

引用

页码：251 / 270

页数：20

共 31 条

[1]

[Anonymous], 2008, 10031 IEEE

[2]

[Anonymous], 1994, TEMPLATES SOLUTION L, DOI DOI 10.1137/1.9781611971538

[3]

[Anonymous], 2011, OpenMP Application Program Interface

[4]

Asanovic K., 2006, The landscape of parallel computing research: A view from berkeley

[5] The Design of OpenMP Tasks [J].

Ayguade, Eduard ;

Copty, Nawal ;

Duran, Alejandro ;

Hoeflinger, Jay ;

Lin, Yuan ;

Massaioli, Federico ;

Teruel, Xavier ;

Unnikrishnan, Priya ;

Zhang, Guansong .

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (03) :404-418

[6]

Bai Z., 2000, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide. Ed. by, DOI DOI 10.1137/1.9780898719581

[7] Pattern-based Sparse Matrix Representation for Memory-Efficient SMVM Kernels [J].

Belgin, Mehmet ;

Back, Godmar ;

Ribbens, Calvin J. .

ICS'09: PROCEEDINGS OF THE 2009 ACM SIGARCH INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2009, :100-109

[8]

Buluc A., 2011, Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), P721, DOI 10.1109/IPDPS.2011.73

[9]

Buluç A, 2009, SPAA'09: PROCEEDINGS OF THE TWENTY-FIRST ANNUAL SYMPOSIUM ON PARALLELISM IN ALGORITHMS AND ARCHITECTURES, P233

[10]

Byun J., 2012, pOSKI: Parallel optimized sparse kernel interface library

← 1 2 3 4 →