More Web Proxy on the site http://driver.im/

research-article

Hypergraph Partitioning Based Models and Methods for Exploiting Cache Locality in Sparse Matrix-Vector Multiplication

Authors: Kadir Akbudak, Enver Kayaaslan, Cevdet AykanatAuthors Info & Claims

SIAM Journal on Scientific Computing, Volume 35, Issue 3

Pages C237 - C262

https://doi.org/10.1137/100813956

Published: 06 June 2013 Publication History

Abstract

Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used in iterative linear solvers. The same sparse matrix is multiplied by a dense vector repeatedly in these solvers. Matrices with irregular sparsity patterns make it difficult to utilize cache locality effectively in SpMxV computations. In this work, we investigate single- and multiple-SpMxV frameworks for exploiting cache locality in SpMxV computations. For the single-SpMxV framework, we propose two cache-size--aware row/column reordering methods based on one-dimensional (1D) and two-dimensional (2D) top-down sparse matrix partitioning. We utilize the column-net hypergraph model for the 1D method and enhance the row-column-net hypergraph model for the 2D method. The primary aim in both of the proposed methods is to maximize the exploitation of temporal locality in accessing input vector entries. The multiple-SpMxV framework depends on splitting a given matrix into a sum of multiple nonzero-disjoint matrices. We propose a cache-size--aware splitting method based on 2D top-down sparse matrix partitioning by utilizing the row-column-net hypergraph model. The aim in this proposed method is to maximize the exploitation of temporal locality in accessing both input- and output-vector entries. We evaluate the validity of our models and methods on a wide range of sparse matrices using both cache-miss simulations and actual runs by using OSKI. Experimental results show that proposed methods and models outperform state-of-the-art schemes.

References

[1]

R. C. Agarwal, F. G. Gustavson, and M. Zubair, A high performance algorithm using pre-processing for the sparse matrix-vector multiplication, in Proceedings of Supercomputing '92 (Minneapolis, MN), IEEE, Washington, DC, 1992, pp. 32--41.

[2]

K. Akbudak, E. Kayaslan, and C. Aykanat, Hypergraph-Partitioning-Based Models and Methods for Exploiting Cache Locality in Sparse-Matrix Vector Multiplication, Technical report BU-CE-1201, Computer Engineering Department, Bilkent University, Ankara, Turkey, 2012; also available online at http://www.cs.bilkent.edu.tr/tech-reports/2012/BU-CE-1201.pdf.

[3]

I. Al-Furaih and S. Ranka, Memory hierarchy management for iterative graph structures, in IPPS/SPDP, IEEE, Washington, DC, 1998, pp. 298--302.

[4]

C. Aykanat, A. P\inar, and Ü. V. \CATLY, Permuting sparse rectangular matrices into block-diagonal form, SIAM J. Sci. Comput., 25 (2004), pp. 1860--1879.

[5]

Z. Bai, J. Demmel, J. Dongarra, A. Ruhe, and H. van der Vorst, Templates for the Solution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM, Philadelphia, 2000.

[6]

M. W. Berry, B. Hendrickson, and P. Raghavan, Sparse matrix reordering schemes for browsing hypertext, in The Mathematics of Numerical Analysis, Lectures Appl. Math. 32, AMS, Providence, RI, 1996, pp. 99--123.

[7]

Ü. V. Çatalyürek and C. Aykanat, Decomposing irregularly sparse matrices for parallel matrix-vector multiplications, in Proceedings of the 3rd International Symposium on Solving Irregularly Structured Problems in Parallel, Irregular '96, Lecture Notes in Comput. Sci. 1117, Springer-Verlag, Berlin, 1996, pp. 75--86.

[8]

Ü. V. Çatalyürek and C. Aykanat, Hypergraph-partitioning based decomposition for parallel sparse-matrix vector multiplication, IEEE Trans. Parallel Distributed Systems, 10 (1999), pp. 673--693.

[9]

Ü. V. Çatalyürek and C. Aykanat, PaToH: A Multilevel Hypergraph Partitioning Tool, Version 3.0, Department of Computer Engineering, Bilkent University, Ankara, Turkey; available online at http://bmi.osu.edu/$\sim$umit/software.htm, 1999.

[10]

Ü. V. Çatalyürek and C. Aykanat, A fine-grain hypergraph model for $2$D decomposition of sparse matrices, in Proceedings of the 15th International Parallel and Distributed Processing Symposium, IEEE, Washington, DC, p. 118.

[11]

Ü. V. Çatalyürek, C. Aykanat, and B. Uçar, On two-dimensional sparse matrix partitioning: Models, methods, and a recipe, SIAM J. Sci. Comput., 32 (2010), pp. 656--683.

[12]

J. M. Crummey, D. Whalley, and K. Kennedy, Improving memory hierarchy performance for irregular applications using data and computation reorderings, Internat. J. Parallel Programming, 29 (2001), pp. 217--247.

[13]

R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy, The design and implementation of a parallel unstructured Euler solver using software primitives, AIAA J., 1992 (1992), AIAA-92-0562.

[14]

T. A. Davis and Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Software, 38 (2011), p. 1.

[15]

C. Ding and K. Kennedy, Improving cache performance in dynamic applications through data and computation reorganization at run time, SIGPLAN Not., 34 (1999), pp. 229--241.

[16]

E. Elmroth, F. Gustavson, I. Jonsson, and B. K\aa gström, Recursive blocked algorithms and hybrid data structures for dense matrix library software, SIAM Rev., 46 (2004), pp. 3--45.

[17]

J. D. Frens and D. S. Wise, Auto-blocking matrix-multiplication or tracking blas$3$ performance from source code, SIGPLAN Not., 32 (1997), pp. 206--216.

[18]

G. Haase, M. Liebmann, and G. Plank, A Hilbert-order multiplication scheme for unstructured sparse matrices, Int. J. Parallel Emerg. Distrib. Syst., 22 (2007), pp. 213--220.

[19]

H. Han and C. Tseng, Exploiting locality for irregular scientific codes, IEEE Trans. Parallel Distributed Systems, 17 (2006), pp. 606--618.

[20]

S. A. Haque and S. Hossain, A note on the performance of sparse matrix-vector multiplication with column reordering, in Proceedings of the International Conference on Computing, Engineering and Information, IEEE, Washington, DC, 2009, pp. 23--26.

[21]

K. Helsgaun, An effective implementation of the Lin-Kernighan traveling salesman heuristic, European J. Oper. Res., 126 (2000), pp. 106--130.

[22]

B. Hendrickson and T. G. Kolda, Partitioning rectangular and structurally nonsymmetric sparse matrices for parallel processing, SIAM J. Sci. Comput., 21 (2000), pp. 2048--2072.

[23]

D. B. Heras, V. B. Pérez, J. C. Cabaleiro, and F. F. Rivera, Modeling and improving locality for the sparse-matrix-vector product on cache memories, Future Generation Comp. Syst., 18 (2001), pp. 55--67.

[24]

E. Im and K. Yelick, Optimizing Sparse Matrix Vector Multiplication on SMPs, in 9th SIAM Conference on Parallel Processing for Scientific Computing, SIAM, Philadelphia, 1999.

[25]

G. Jin and M. J. Crummey, Using space-filling curves for computation reordering, in Proceedings of the Los Alamos Computer Science Institute Sixth Annual Symposium (published on CD), Los Alamos National Labs, Sante Fe, NM, 2005.

[26]

G. Karypis, V. Kumar, R. Aggarwal, and S. Shekhar, hMeTiS: A Hypergraph Partitioning Package Version 1.0.1, Department of Computer Science and Engineering, Army HPC Research Center, University of Minnesota, Minneapolis, MN, 1998.

[27]

J. Koster, Parallel Templates for Numerical Linear Algebra, a High-Performance Computation Library, Master's thesis, Utrecht University, Utrecht, The Netherlands, 2002.

[28]

T. Lengauer, Combinatorial Algorithms for Integrated Circuit Layout, Wiley--Teubner, Chichester, UK, 1990.

[29]

R. Mirchandaney, J. H. Saltz, R. M. Smith, D. M. Nico, and K. Crowley, Principles of runtime support for parallel processors, in Proceedings of the 2nd International Conference on Supercomputing, ICS '88, ACM, New York, 1988, pp. 140--152.

[30]

J. C. Pichel, D. B. Heras, J. C. Cabaleiro, and F. F. Rivera, Increasing data reuse of sparse algebra codes on simultaneous multithreading architectures, Concurrency Comput. Practice Experience, 21 (2009), pp. 1838--1856.

[31]

J. C. Pichel, D. B. Heras, J. C. Cabaleiro, and F. F. Rivera, Performance optimization of irregular codes based on the combination of reordering and blocking techniques, Parallel Comput., 31 (2005), pp. 858--876.

[32]

A. Pinar and M. T. Heath, Improving performance of sparse matrix-vector multiplication, in Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (CDROM), Supercomputing '99, ACM, New York, 1999, p. 30.

[33]

Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed., SIAM, Philadelphia, 2003.

[34]

M. M. Strout and P. D. Hovland, Metrics and models for reordering transformations, in Proceedings of the 2nd ACM SIGPLAN Workshop on Memory System Performance (MSP04) (Washington, DC), ACM, New York, 2004, pp. 23--34.

[35]

O. Temam and W. Jalby, Characterizing the behavior of sparse algorithms on caches, in Proceedings of Supercomputing '92 (Minneapolis, MN), IEEE, Washington, DC, 1992, pp. 578--587.

[36]

S. Toledo, Improving memory-system performance of sparse matrix-vector multiplication, IBM J. Research Development, 41 (1997), pp. 711--725.

[37]

B. Ucar and C. Aykanat, Partitioning sparse matrices for parallel preconditioned iterative methods, SIAM J. Sci. Comput., 29 (2007), pp. 1683--1709.

[38]

B. Vastenhouw and R. H. Bisseling, A two-dimensional data distribution method for parallel sparse matrix-vector multiplication, SIAM Rev., 47 (2005), pp. 67--95.

[39]

R. Vuduc, J. W. Demmel, and K. A. Yelick, OSKI: A library of automatically tuned sparse matrix kernels, J. Phys. Conference Series, 16 (2005), p. 521.

[40]

J. White and P. Sadayappan, On improving the performance of sparse matrix-vector multiplication, in Proceedings of the International Conference on High-Performance Computing, IEEE Computer Society, Los Alamitos, CA, 1997, pp. 578--587.

[41]

A. N. Yzelman and R. H. Bisseling, Cache-oblivious sparse matrix--vector multiplication by using sparse matrix partitioning methods, SIAM J. Sci. Comput., 31 (2009), pp. 3128--3154.

Cited By

Hong YBuluç A(2024)A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00053(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00053
Li SLiu DLiu W(2023)Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware CompressionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171542:12(4606-4617)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3281715
Kwon OKang JLee SKim WSong JKloeckner AMoreira J(2022)Efficient Task-Mapping of Parallel Applications Using a Space-Filling CurveProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569657(384-397)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569657
Show More Cited By

Recommendations

Cache-Oblivious Sparse Matrix-Vector Multiplication by Using Sparse Matrix Partitioning Methods

In this article, we introduce a cache-oblivious method for sparse matrix-vector multiplication. Our method attempts to permute the rows and columns of the input matrix using a recursive hypergraph-based sparse matrix partitioning scheme so that the ...
A sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of an SPD matrix

In this paper, a method via sparse-sparse iteration for computing a sparse incomplete factorization of the inverse of a symmetric positive definite matrix is proposed. The resulting factorized sparse approximate inverse is used as a preconditioner for ...
CIMGS: An Incomplete Orthogonal FactorizationPreconditioner

A new preconditioner for symmetric positive definite systems is proposed, analyzed, and tested. The preconditioner, compressed incomplete modified Gram--Schmidt (CIMGS), is based on an incomplete orthogonal factorization. CIMGS is robust both ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing

SIAM Journal on Scientific Computing Volume 35, Issue 3

2013

922 pages

ISSN:1064-8275

DOI:10.1137/sjoce3.35.3

Issue’s Table of Contents

© 2013, Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 06 June 2013

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hong YBuluç A(2024)A Sparsity-Aware Distributed-Memory Algorithm for Sparse-Sparse Matrix MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00053(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00053
Li SLiu DLiu W(2023)Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware CompressionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.328171542:12(4606-4617)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TCAD.2023.3281715
Kwon OKang JLee SKim WSong JKloeckner AMoreira J(2022)Efficient Task-Mapping of Parallel Applications Using a Space-Filling CurveProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569657(384-397)Online publication date: 8-Oct-2022
https://dl.acm.org/doi/10.1145/3559009.3569657
Demirci GFerhatosmanoglu HZhou HMoreira JMueller FEtsion Y(2021)Partitioning sparse deep neural networks for scalable training and inferenceProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460372(254-265)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460372
Wang YChou JChung I(2021)A Deep Reinforcement Learning Method for Solving Task Mapping Problems with Dynamic Traffic on Parallel SystemsThe International Conference on High Performance Computing in Asia-Pacific Region10.1145/3432261.3432262(1-10)Online publication date: 20-Jan-2021
https://dl.acm.org/doi/10.1145/3432261.3432262

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents