[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication

Published: 29 June 2020 Publication History

Abstract

The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on today’s multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation that eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially. By means of a parameterized roofline model, we analyze the SymmSpMV performance in detail and discuss outliers. While we focus on SymmSpMV in this article, our algorithm and software are applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring.

References

[1]
Andreas Alvermann. 2019. ScaMaC: The Scalable Matrix Collection. Retrieved from https://bitbucket.org/essex/matrixcollection/.
[2]
Doruk Bozdağ, Ümit Çatalyürek, Assefaw H. Gebremedhin, Fredrik Manne, Erik G. Boman, and Füsun Özgüner. 2010. Distributed-memory parallel algorithms for distance-2 coloring and related problems in derivative computation. SIAM J. Sci. Comput. 32, 4 (2010), 2418--2446.
[3]
Doruk Bozdağ, Assefaw H. Gebremedhin, Fredrik Manne, Erik G. Boman, and Umit V. Catalyurek. 2008. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68, 4 (2008), 515--535.
[4]
Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In Proceedings of the 21st Symposium on Parallelism in Algorithms and Architectures (SPAA’09). ACM, New York, NY, 233--244.
[5]
Aydin Buluç, Samuel Williams, Leonid Oliker, and James Demmel. 2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’11). IEEE Computer Society, Washington, DC, 721--733.
[6]
Ümit V. Catalyurek and Cevdet Aykanat. 1999. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distrib. Syst. 10, 7 (1999), 673--693.
[7]
Elizabeth Cuthill. 1972. Several Strategies for Reducing the Bandwidth of Matrices. Springer US, Boston, MA, 157--166.
[8]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages.
[9]
Josep Díaz, Jordi Petit, and Maria Serna. 2002. A survey of graph layout problems. ACM Comput. Surv. 34, 3 (Sept. 2002), 313--356.
[10]
Athena Elafrou, Vasileios Karakasis, Theodoros Gkountouvas, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2018. SparseX: A library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Trans. Math. Softw. 44, 3, Article 26 (Jan. 2018), 32 pages.
[11]
David J. Evans. 1984. Parallel S.O.R. iterative methods. Parallel Comput. 1, 1 (Aug. 1984), 3--18.
[12]
Fujitsu. 2019. Retrieved from https://www.fujitsu.com/global/Images/post-k_supercomputer_with_fujitsu%27s_original_cpu_a64fx_powered_by_arm_isa.pdf.
[13]
Martin Galgon, Lukas Krämer, Jonas Thies, Achim Basermann, and Bruno Lang. 2015. On the parallel iterative solution of linear systems arising in the FEAST algorithm for computing inner eigenvalues. Parallel Comput. 49, C (Nov. 2015), 153--163.
[14]
Assefaw H. Gebremedhin and Fredrik Manne. 2000. Scalable parallel graph coloring algorithms. Concurr.: Pract. Exper. 12, 12 (2000), 1131--1146.
[15]
Assefaw Hadish Gebremedhin, Fredrik Manne, and Alex Pothen. 2002. Parallel distance-k coloring algorithms for numerical optimization. In Proceedings of the 8th International Euro-Par Conference on Parallel Processing (Euro-Par’02). Springer-Verlag, London, UK, 912--921. Retrieved from http://dl.acm.org/citation.cfm?id=646667.699892.
[16]
Assefaw H. Gebremedhin, Duc Nguyen, Md. Mostofa Ali Patwary, and Alex Pothen. 2013. ColPack: Software for graph coloring and related problems in scientific computing. ACM Trans. Math. Softw. 40, 1, Article 1 (Oct. 2013), 31 pages.
[17]
Theodoros Gkountouvas, Vasileios Karakasis, Kornilios Kourtis, Georgios Goumas, and Nectarios Koziris. 2013. Improving the performance of the symmetric sparse matrix-vector multiplication in multicore. In Proceedings of the IEEE 27th International Symposium on Parallel and Distributed Processing. 273--283.
[18]
Willam D. Gropp, Dinesh K. Kaushik, David E. Keyes, and Barry F. Smith. 2000. Towards realistic performance bounds for implicit CFD codes. In Parallel Computational Fluid Dynamics 1999, D. Keyes, J. Periaux, A. Ecer, N. Satofuka, and P. Fox (Eds.). Elsevier, 241--248.
[19]
Eun-Jin Im, Katherine Yelick, and Richard Vuduc. 2004. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perf. Comput. Applic. 18, 1 (2004), 135--158.
[20]
Intel. 2019. Intel Math Kernel Library. Retrieved from https://software.intel.com/en-us/mkl.
[21]
Takeshi Iwashita, Hiroshi Nakashima, and Yasuhito Takahashi. 2012. Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in ICCG method. In Proceedings of the IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS’12). IEEE Computer Society, Washington, DC, 474--483.
[22]
Mark T. Jones and Paul E. Plassmann. 1994. Scalable iterative solution of sparse linear systems. Parallel Comput. 20, 5 (May 1994), 753--773.
[23]
George Karypis and Vipin Kumar. 1998. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20, 1 (1998), 359--392.
[24]
Kokkos. 2018. Kokkos C++ Performance Portability Programming EcoSystem: The Programming Model—Parallel Execution and Memory Abstraction. Retrieved from http://trilinos.sandia.gov/packages/kokkos.
[25]
Moritz Kreutzer, Dominik Ernst, Alan R. Bishop, Holger Fehske, Georg Hager, Kengo Nakajima, and Gerhard Wellein. 2018. Chebyshev filter diagonalization on modern manycore processors and GPGPUs. In High Performance Computing, Rio Yokota, Michèle Weiland, David Keyes, and Carsten Trinitis (Eds.). Springer International Publishing, Cham, 329--349.
[26]
Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, and Alan R. Bishop. 2014. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36, 5 (2014), C401--C423.
[27]
Marcin Krotkiewski and Marcin Dabrowski. 2010. Parallel symmetric sparse matrix-vector product on scalar multi-core CPUs. Parallel Comput. 36, 4 (Apr. 2010), 181--198.
[28]
C. Y. Lee. 1961. An algorithm for path connections and its applications. IRE Trans. Electron. Comput. EC-10, 3 (Sept. 1961), 346--365.
[29]
Ang Li, Weifeng Liu, Mads R. B. Kristensen, Brian Vinter, Hao Wang, Kaixi Hou, Andres Marquez, and Shuaiwen Leon Song. 2017. Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, Article 26, 14 pages.
[30]
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM International Conference on Supercomputing (ICS’15). ACM, New York, NY, 339--350.
[31]
Weifeng Liu and Brian Vinter. 2015. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Comput. 49, C (Nov. 2015), 179--193.
[32]
Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. 2013. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In Proceedings of the 27th International ACM Conference on Supercomputing (ICS’13). ACM, New York, NY, 273--282.
[33]
Hao Lu, Mahantesh Halappanavar, Daniel Chavarría-Miranda, Assefaw H. Gebremedhin, Ajay Panyala, and Ananth Kalyanaraman. 2017. Algorithms for balanced graph colorings with applications in parallel computing. IEEE Trans. Parallel Distrib. Syst. 28, 5 (May 2017), 1240--1256.
[34]
Michele Martone. 2014. Efficient multithreaded untransposed, transposed, or symmetric sparse matrix-vector multiplication with the recursive sparse blocks format. Parallel Comput. 40, 7 (July 2014), 251--270.
[35]
James McQueen, Marina Meilă, Jacob VanderPlas, and Zhongyue Zhang. 2016. Megaman: Scalable manifold learning in Python. J. Mach. Learn. Res. 17, 148 (2016), 1--5. Retrieved from http://jmlr.org/papers/v17/16-109.html.
[36]
Piotr Mironowicz, Adam Dziekonski, and Michał Mrozowski. 2015. A task-scheduling approach for efficient sparse symmetric matrix-vector multiplication on a GPU. SIAM J. Sci. Comput. 37, 6 (2015), C643--C666.
[37]
Chao-Wei Ou and Sanjay Ranka. 1997. Parallel incremental graph partitioning. IEEE Trans. Parallel Distrib. Syst. 8, 8 (Aug. 1997), 884--896.
[38]
Jongsoo Park, Mikhail Smelyanskiy, Narayanan Sundaram, and Pradeep Dubey. 2014. Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In Proceedings of the 29th International Conference on Supercomputing (ISC’14), Vol. 8488. Springer-Verlag New York, Inc., New York, NY, 124--140.
[39]
Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Md. Mosotofa A. Patwary, Yutong Lu, and Pradeep Dubey. 2014. Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 945--955.
[40]
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems (2nd ed.). Society for Industrial and Applied Mathematics.
[41]
Toby Simpson, Dimosthenis Pasadakis, Drosos Kourounis, Kohei Fujita, Takuma Yamaguchi, Tsuyoshi Ichimura, and Olaf Schenk. 2018. Balanced graph partition refinement using the graph p-Laplacian. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC’18). ACM, New York, NY, Article 8, 11 pages.
[42]
SpMP Development Team. 2015. Sparse Matrix Pre-processing Library. Retrieved from https://github.com/IntelLabs/SpMP.
[43]
Sivan Toledo. 1997. Improving the memory-system performance of sparse matrix-vector multiplication. IBM J. Res. Dev. 41, 6 (Nov. 1997), 711--726.
[44]
Jan Treibig, Georg Hager, and Gerhard Wellein. 2010. LIKWID: A lightweight performance-oriented tool suite for x86 multicore environments. In Proceedings of the 1st International Workshop on Parallel Software Tools and Tool Infrastructures.
[45]
Ulrike von Luxburg. 2007. A tutorial on spectral clustering. Stat. Comput. 17, 4 (01 Dec. 2007), 395--416.
[46]
Richard Vuduc, James W. Demmel, and Katherine A. Yelick. 2005. OSKI: A library of automatically tuned sparse matrix kernels. J. Phys.: Conf. Series 16, 1 (2005), 521. Retrieved from http://stacks.iop.org/1742-6596/16/i=1/a=071.
[47]
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, and James Demmel. 2009. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Comput. 35, 3 (Mar. 2009), 178--194.
[48]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76.
[49]
Albert-Jan N. Yzelman. 2011. Fast Sparse Matrix-Vector Multiplication by Partitioning and Reordering. Ph.D. Dissertation. Utrecht University, Utrecht. Retrieved from https://dspace.library.uu.nl/handle/1874/210147.
[50]
Albert-Jan N. Yzelman. 2015. Generalised vectorisation for sparse matrix-vector multiplication. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (IA3’15). ACM, New York, NY, Article 6, 8 pages.
[51]
Albert-Jan N. Yzelman and Rob H. Bisseling. 2009. Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods. SIAM J. Sci. Comput. 31, 4 (2009), 3128--3154.

Cited By

View all
  • (2024)Algebraic temporal blocking for sparse iterative solvers on multi-core CPUsThe International Journal of High Performance Computing Applications10.1177/10943420241283828Online publication date: 25-Sep-2024
  • (2024)Optimized shock-protecting microstructuresACM Transactions on Graphics10.1145/368776543:6(1-21)Online publication date: 19-Dec-2024
  • (2024)Differentiable solver for time-dependent deformation problems with contactACM Transactions on Graphics10.1145/365764843:3(1-30)Online publication date: 26-Apr-2024
  • Show More Cited By

Index Terms

  1. A Recursive Algebraic Coloring Technique for Hardware-efficient Symmetric Sparse Matrix-vector Multiplication

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Parallel Computing
    ACM Transactions on Parallel Computing  Volume 7, Issue 3
    Special Issue on PPoPP 2017 (Part 2) and Regular Papers
    September 2020
    182 pages
    ISSN:2329-4949
    EISSN:2329-4957
    DOI:10.1145/3407694
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 June 2020
    Accepted: 01 April 2020
    Revised: 01 January 2020
    Received: 01 June 2019
    Published in TOPC Volume 7, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Sparse matrix
    2. graph algorithms
    3. graph coloring
    4. memory hierarchies
    5. scheduling
    6. sparse symmetric matrix-vector multiplication

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)604
    • Downloads (Last 6 weeks)82
    Reflects downloads up to 14 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Algebraic temporal blocking for sparse iterative solvers on multi-core CPUsThe International Journal of High Performance Computing Applications10.1177/10943420241283828Online publication date: 25-Sep-2024
    • (2024)Optimized shock-protecting microstructuresACM Transactions on Graphics10.1145/368776543:6(1-21)Online publication date: 19-Dec-2024
    • (2024)Differentiable solver for time-dependent deformation problems with contactACM Transactions on Graphics10.1145/365764843:3(1-30)Online publication date: 26-Apr-2024
    • (2024)Direct/Iterative Hybrid Solver for Scattering by Inhomogeneous MediaSIAM Journal on Scientific Computing10.1137/22M152154746:2(A1298-A1326)Online publication date: 11-Apr-2024
    • (2024)The Effect of Air–Water Mixture on the Dynamic Response of a Hydrodynamic Journal Bearing With Inclined GroovesJournal of Engineering for Gas Turbines and Power10.1115/1.4066716147:4Online publication date: 26-Oct-2024
    • (2024)A Coupled Hybridizable Discontinuous Galerkin and Boundary Integral Method for Analyzing Electromagnetic ScatteringIEEE Transactions on Antennas and Propagation10.1109/TAP.2023.329174672:1(75-88)Online publication date: Jan-2024
    • (2024)High Performance Unstructured SpMM Computation Using Tensor CoresProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00060(1-14)Online publication date: 17-Nov-2024
    • (2024)A Near-Field Preconditioner for Surface Integral Equations Based on Nested Dissection ReorderingIEEE Antennas and Wireless Propagation Letters10.1109/LAWP.2024.335878523:5(1448-1452)Online publication date: May-2024
    • (2024)HiHiSpMV: Sparse Matrix Vector Multiplication with Hierarchical Row Reductions on FPGAs with High Bandwidth Memory2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00014(32-42)Online publication date: 5-May-2024
    • (2024)The phase-field fracture model enriched by interpolation cover functions for brittle fracture problemsThin-Walled Structures10.1016/j.tws.2024.111724197(111724)Online publication date: Apr-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media