More Web Proxy on the site http://driver.im/

research-article

Open access

A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication

Authors:

Christie Alappat,

Achim Basermann,

Alan R. Bishop,

Gerhard WelleinAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 7, Issue 3

Article No.: 19, Pages 1 - 37

https://doi.org/10.1145/3399732

Published: 29 June 2020 Publication History

All formats PDF

Abstract

The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today’s multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation that eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially. By means of a parameterized roofline model, we analyze the SymmSpMV performance in detail and discuss outliers. While we focus on SymmSpMV in this article, our algorithm and software are applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring.

References

[1]

Andreas Alvermann. 2019. ScaMaC: The Scalable Matrix Collection. Retrieved from https://bitbucket.org/essex/matrixcollection/.

[2]

Doruk Bozdağ, Ümit Çatalyürek, Assefaw H. Gebremedhin, Fredrik Manne, Erik G. Boman, and Füsun Özgüner. 2010. Distributed-memory parallel algorithms for distance-2 coloring and related problems in derivative computation. SIAM J. Sci. Comput. 32, 4 (2010), 2418--2446.

Digital Library

[3]

Doruk Bozdağ, Assefaw H. Gebremedhin, Fredrik Manne, Erik G. Boman, and Umit V. Catalyurek. 2008. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68, 4 (2008), 515--535.

Digital Library

[4]

Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the 21st Symposium on Parallelism in Algorithms and Architectures (SPAA’09). ACM, New York, NY, 233--244.

[5]

Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, Washington, DC, 721--733.

[6]

Ümit V. Catalyurek and Cevdet Aykanat. 1999. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distrib. Syst. 10, 7 (1999), 673--693.

Digital Library

[7]

Elizabeth Cuthill. 1972. Several Strategies for Reducing the Bandwidth of Matrices. Springer US, Boston, MA, 157--166.

[8]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages.

Digital Library

[9]

Josep Díaz, Jordi Petit, and Maria Serna. 2002. A survey of graph layout problems. ACM Comput. Surv. 34, 3 (Sept. 2002), 313--356.

Digital Library

[10]

Athena Elafrou, Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2018. SparseX: A library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Trans. Math. Softw. 44, 3, Article 26 (Jan. 2018), 32 pages.

Digital Library

[11]

David J. Evans. 1984. Parallel S.O.R. iterative methods. Parallel Comput. 1, 1 (Aug. 1984), 3--18.

Digital Library

[12]

Fujitsu. 2019. Retrieved from https://www.fujitsu.com/global/Images/post-k_supercomputer_with_fujitsu%27s_original_cpu_a64fx_powered_by_arm_isa.pdf.

[13]

Martin Galgon, Lukas Krämer, Jonas Thies, Achim Basermann, and Bruno Lang. 2015. On the parallel iterative solution of linear systems arising in the FEAST algorithm for computing inner eigenvalues. Parallel Comput. 49, C (Nov. 2015), 153--163.

[14]

Assefaw H. Gebremedhin and Fredrik Manne. 2000. Scalable parallel graph coloring algorithms. Concurr.: Pract. Exper. 12, 12 (2000), 1131--1146.

[15]

Assefaw Hadish Gebremedhin, Fredrik Manne, and Alex Pothen. 2002. Parallel distance-k coloring algorithms for numerical optimization. In Proceedings of the 8th International Euro-Par Conference on Parallel Processing (Euro-Par’02). Springer-Verlag, London, UK, 912--921. Retrieved from http://dl.acm.org/citation.cfm?id=646667.699892.

Digital Library

[16]

Assefaw H. Gebremedhin, Duc Nguyen, Md. Mostofa Ali Patwary, and Alex Pothen. 2013. ColPack: Software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40, 1, Article 1 (Oct. 2013), 31 pages.

Digital Library

[17]

Theodoros Gkountouvas, Vasileios Karakasis, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2013. Improving the performance of the symmetric sparse matrix-vector multiplication in multicore. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. 273--283.

Digital Library

[18]

Willam D. Gropp, Dinesh K. Kaushik, David E. Keyes, and Barry F. Smith. 2000. Towards realistic performance bounds for implicit CFD codes. In Parallel Computational Fluid Dynamics 1999, D. Keyes, J. Periaux, A. Ecer, N. Satofuka, and P. Fox (Eds.). Elsevier, 241--248.

[19]

Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perf. Comput. Applic. 18, 1 (2004), 135--158.

Digital Library

[20]

Intel. 2019. Intel Math Kernel Library. Retrieved from https://software.intel.com/en-us/mkl.

[21]

Takeshi Iwashita, Hiroshi Nakashima, and Yasuhito Takahashi. 2012. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS’12). IEEE Computer Society, Washington, DC, 474--483.

Digital Library

[22]

Mark T. Jones and Paul E. Plassmann. 1994. Scalable iterative solution of sparse linear systems. Parallel Comput. 20, 5 (May 1994), 753--773.

Digital Library

[23]

George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (1998), 359--392.

Digital Library

[24]

Kokkos. 2018. Kokkos C++ Performance Portability Programming EcoSystem: The Programming Model—Parallel Execution and Memory Abstraction. Retrieved from http://trilinos.sandia.gov/packages/kokkos.

[25]

Moritz Kreutzer, Dominik Ernst, Alan R. Bishop, Holger Fehske, Georg Hager, Kengo Nakajima, and Gerhard Wellein. 2018. Chebyshev filter diagonalization on modern manycore processors and GPGPUs. In High Performance Computing, Rio Yokota, Michèle Weiland, David Keyes, and Carsten Trinitis (Eds.). Springer International Publishing, Cham, 329--349.

[26]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, 5 (2014), C401--C423.

Digital Library

[27]

Marcin Krotkiewski and Marcin Dabrowski. 2010. Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs. Parallel Comput. 36, 4 (Apr. 2010), 181--198.

Digital Library

[28]

C. Y. Lee. 1961. An algorithm for path connections and its applications. IRE Trans. Electron. Comput. EC-10, 3 (Sept. 1961), 346--365.

[29]

Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, Article 26, 14 pages.

Digital Library

[30]

Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339--350.

Digital Library

[31]

Weifeng Liu and Brian Vinter. 2015. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49, C (Nov. 2015), 179--193.

[32]

Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13). ACM, New York, NY, 273--282.

Digital Library

[33]

Hao Lu, Mahantesh Halappanavar, Daniel Chavarría-Miranda, Assefaw H. Gebremedhin, Ajay Panyala, and Ananth Kalyanaraman. 2017. Algorithms for balanced graph colorings with applications in parallel computing. IEEE Trans. Parallel Distrib. Syst. 28, 5 (May 2017), 1240--1256.

Digital Library

[34]

Michele Martone. 2014. Efficient multithreaded untransposed, transposed, or symmetric sparse matrix-vector multiplication with the recursive sparse blocks format. Parallel Comput. 40, 7 (July 2014), 251--270.

Digital Library

[35]

James McQueen, Marina Meilă, Jacob VanderPlas, and Zhongyue Zhang. 2016. Megaman: Scalable manifold learning in Python. J. Mach. Learn. Res. 17, 148 (2016), 1--5. Retrieved from http://jmlr.org/papers/v17/16-109.html.

[36]

Piotr Mironowicz, Adam Dziekonski, and Michał Mrozowski. 2015. A task-scheduling approach for efficient sparse symmetric matrix-vector multiplication on a GPU. SIAM J. Sci. Comput. 37, 6 (2015), C643--C666.

[37]

Chao-Wei Ou and Sanjay Ranka. 1997. Parallel incremental graph partitioning. IEEE Trans. Parallel Distrib. Syst. 8, 8 (Aug. 1997), 884--896.

[38]

Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Proceedings of the 29th International Conference on Supercomputing (ISC’14), Vol. 8488. Springer-Verlag New York, Inc., New York, NY, 124--140.

Digital Library

[39]

Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa A. Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 945--955.

Digital Library

[40]

Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics.

Digital Library

[41]

Toby Simpson, Dimosthenis Pasadakis, Drosos Kourounis, Kohei Fujita, Takuma Yamaguchi, Tsuyoshi Ichimura, and Olaf Schenk. 2018. Balanced graph partition refinement using the graph p-Laplacian. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC’18). ACM, New York, NY, Article 8, 11 pages.

Digital Library

[42]

SpMP Development Team. 2015. Sparse Matrix Pre-processing Library. Retrieved from https://github.com/IntelLabs/SpMP.

[43]

Sivan Toledo. 1997. Improving the memory-system performance of sparse matrix-vector multiplication. IBM J. Res. Dev. 41, 6 (Nov. 1997), 711--726.

Digital Library

[44]

Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 1st International Workshop on Parallel Software Tools and Tool Infrastructures.

Digital Library

[45]

Ulrike von Luxburg. 2007. A tutorial on spectral clustering. Stat. Comput. 17, 4 (01 Dec. 2007), 395--416.

[46]

Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. J. Phys.: Conf. Series 16, 1 (2005), 521. Retrieved from http://stacks.iop.org/1742-6596/16/i=1/a=071.

[47]

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. 35, 3 (Mar. 2009), 178--194.

Digital Library

[48]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76.

Digital Library

[49]

Albert-Jan N. Yzelman. 2011. Fast Sparse Matrix-Vector Multiplication by Partitioning and Reordering. Ph.D. Dissertation. Utrecht University, Utrecht. Retrieved from https://dspace.library.uu.nl/handle/1874/210147.

[50]

Albert-Jan N. Yzelman. 2015. Generalised vectorisation for sparse matrix-vector multiplication. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA3’15). ACM, New York, NY, Article 6, 8 pages.

[51]

Albert-Jan N. Yzelman and Rob H. Bisseling. 2009. Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods. SIAM J. Sci. Comput. 31, 4 (2009), 3128--3154.

Digital Library

Cited By

Alappat CThies JHager GFehske HWellein G(2024)Algebraic temporal blocking for sparse iterative solvers on multi-core CPUsThe International Journal of High Performance Computing Applications10.1177/10943420241283828Online publication date: 25-Sep-2024
https://doi.org/10.1177/10943420241283828
Huang ZPanozzo DZorin D(2024)Optimized shock-protecting microstructuresACM Transactions on Graphics10.1145/368776543:6(1-21)Online publication date: 19-Dec-2024
https://dl.acm.org/doi/10.1145/3687765
Huang ZTozoni DGjoka AFerguson ZSchneider TPanozzo DZorin D(2024)Differentiable solver for time-dependent deformation problems with contactACM Transactions on Graphics10.1145/365764843:3(1-30)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3657648
Show More Cited By

Index Terms

A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication
1. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms

Recommendations

Optimization of quasi-diagonal matrix-vector multiplication on GPU

Sparse matrix-vector multiplication SpMV is of singular importance in sparse linear algebra, which is an important issue in scientific computing and engineering practice. Much effort has been put into accelerating SpMV, and a few parallel solutions have ...
GPU accelerated sparse matrix-vector multiplication and sparse matrix-transpose vector multiplication

Many high performance computing applications require computing both sparse matrix-vector product SMVP and sparse matrix-transpose vector product SMTVP for better overall performance. Under such a circumstance, it is critical to maintain a similarly high ...
Sparse matrix-vector multiplication on the Single-Chip Cloud Computer many-core processor

The microprocessor industry has responded to memory, power and ILP walls by turning to many-core processors, increasing parallelism as the primary method to improve processor performance. These processors are expected to consist of tens or even hundreds ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 7, Issue 3

Special Issue on PPoPP 2017 (Part 2) and Regular Papers

September 2020

182 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3407694

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2020 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 June 2020

Accepted: 01 April 2020

Revised: 01 January 2020

Received: 01 June 2019

Published in TOPC Volume 7, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Deutsche Forschungsgemeinschaft

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

73
Total Citations
View Citations
1,939
Total Downloads

Downloads (Last 12 months)604
Downloads (Last 6 weeks)82

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Alappat CThies JHager GFehske HWellein G(2024)Algebraic temporal blocking for sparse iterative solvers on multi-core CPUsThe International Journal of High Performance Computing Applications10.1177/10943420241283828Online publication date: 25-Sep-2024
https://doi.org/10.1177/10943420241283828
Huang ZPanozzo DZorin D(2024)Optimized shock-protecting microstructuresACM Transactions on Graphics10.1145/368776543:6(1-21)Online publication date: 19-Dec-2024
https://dl.acm.org/doi/10.1145/3687765
Huang ZTozoni DGjoka AFerguson ZSchneider TPanozzo DZorin D(2024)Differentiable solver for time-dependent deformation problems with contactACM Transactions on Graphics10.1145/365764843:3(1-30)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3657648
Bruno OPandey A(2024)Direct/Iterative Hybrid Solver for Scattering by Inhomogeneous MediaSIAM Journal on Scientific Computing10.1137/22M152154746:2(A1298-A1326)Online publication date: 11-Apr-2024
https://doi.org/10.1137/22M1521547
Voitus AArghir MHassini M(2024)The Effect of Air–Water Mixture on the Dynamic Response of a Hydrodynamic Journal Bearing With Inclined GroovesJournal of Engineering for Gas Turbines and Power10.1115/1.4066716147:4Online publication date: 26-Oct-2024
https://doi.org/10.1115/1.4066716
Zhao RDong MChen LHu JBagci H(2024)A Coupled Hybridizable Discontinuous Galerkin and Boundary Integral Method for Analyzing Electromagnetic ScatteringIEEE Transactions on Antennas and Propagation10.1109/TAP.2023.329174672:1(75-88)Online publication date: Jan-2024
https://doi.org/10.1109/TAP.2023.3291746
Okanovic PKwasniewski GLabini PBesta MVella FHoefler T(2024)High Performance Unstructured SpMM Computation Using Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00060(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00060
Lee CLee J(2024)A Near-Field Preconditioner for Surface Integral Equations Based on Nested Dissection ReorderingIEEE Antennas and Wireless Propagation Letters10.1109/LAWP.2024.335878523:5(1448-1452)Online publication date: May-2024
https://doi.org/10.1109/LAWP.2024.3358785
Tareen AMeyer MPlessl CKenter T(2024)HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reductions on FPGAs with High Bandwidth Memory2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00014(32-42)Online publication date: 5-May-2024
https://doi.org/10.1109/FCCM60383.2024.00014
Wang JZhou LGao ZLiu P(2024)The phase-field fracture model enriched by interpolation cover functions for brittle fracture problemsThin-Walled Structures10.1016/j.tws.2024.111724197(111724)Online publication date: Apr-2024
https://doi.org/10.1016/j.tws.2024.111724
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents