[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

BestSF: A Sparse Meta-Format for Optimizing SpMV on GPU

Published: 04 September 2018 Publication History

Abstract

The Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed to improve this kernel on the recent GPU architectures. However, it has been widely observed that there is no “best-for-all” sparse format for the SpMV kernel on GPU. Indeed, serious performance degradation of an order of magnitude can be observed without a careful selection of the sparse format to use. To address this problem, we propose in this article BestSF (Best Sparse Format), a new learning-based sparse meta-format that automatically selects the most appropriate sparse format for a given input matrix. To do so, BestSF relies on a cost-sensitive classification system trained using Weighted Support Vector Machines (WSVMs) to predict the best sparse format for each input sparse matrix. Our experimental results on two different NVIDIA GPU architectures using a large number of real-world sparse matrices show that BestSF achieved a noticeable overall performance improvement over using a single sparse format. While BestSF is trained to select the best sparse format in terms of performance (GFLOPS), our further experimental investigations revealed that using BestSF also led, in most of the test cases, to the best energy efficiency (MFLOPS/W). To prove its practical effectiveness, we also evaluate the performance and energy efficiency improvement achieved when using BestSF as a building block in a GPU-based Preconditioned Conjugate Gradient (PCG) iterative solver.

References

[1]
Hartwig Anzt, Marc Baboulin, Jack Dongarra, Yvan Fournier, Frank Hulsemann, Amal Khabou, and Yushan Wang. 2016. Accelerating the conjugate gradient algorithm with GPU in CFD simulations. VECPAR.
[2]
Hartwig Anzt, Mark Gates, Jack Dongarra, Moritz Kreutzer, Gerhard Wellein, and Martin Köhler. 2017. Preconditioned Krylov solvers on GPUs. Parallel Comput. (2017).
[3]
Hartwig Anzt, Stanimire Tomov, and Jack Dongarra. 2015. Energy efficiency and performance frontiers for sparse computations on GPU supercomputers. In Proceedings of the 6xth International Workshop on Programming Models and Applications for Multicores and Manycores. ACM, 1--10.
[4]
Arash Ashari, Naser Sedaghati, John Eisenlohr, Srinivasan Parthasarathy, and P. Sadayappan. 2014. Fast sparse matrix-vector multiplication on GPUs for graph applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 781--792.
[5]
Arash Ashari, Naser Sedaghati, John Eisenlohr, and P. Sadayappan. 2014. An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs. In Proceedings of the 28th ACM International Conference on Supercomputing. ACM, 273--282.
[6]
Nathan Bell and Michael Garland. 2008. Efficient Sparse Matrix-vector Multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation.
[7]
Nathan Bell and Jared Hoberock. 2011. Thrust: A productivity-oriented library for CUDA. Retrieved June 2016 from https://thrust.github.io/.
[8]
A. Benatia, W. Ji, Y. Wang, and F. Shi. 2016. Sparse matrix format selection with multiclass SVM for SpMV on GPU. In Proceedings of the 45th International Conference on Parallel Processing (ICPP’16). IEEE, 496--505.
[9]
Bernd Bischl, Pascal Kerschke, Lars Kotthoff, Marius Lindauer, Yuri Malitsky, Alexandre Fréchette, Holger Hoos, Frank Hutter, Kevin Leyton-Brown, Kevin Tierney, and others. 2016. Aslib: A benchmark library for algorithm selection. Artificial Intelligence 237 (2016), 41--58.
[10]
Martin Burtscher, Ivan Zecena, and Ziliang Zong. 2014. Measuring GPU power with the K20 built-in sensor. In Proceedings of Workshop on General Purpose Processing Using GPUs. ACM, 28.
[11]
Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2009. Fast conjugate gradients with multiple GPUs. In International Conference on Computational Science. Springer, 893--903.
[12]
Ali Cevahir, Akira Nukada, and Satoshi Matsuoka. 2010. High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Computer Science-Research and Development 25, 1 (2010), 83--91.
[13]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2, 3 (2011), 27. http://www.csie.ntu.edu.tw/ËIJcjlin/libsvm.
[14]
Jee W. Choi, Amik Singh, and Richard W. Vuduc. 2010. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In ACM SIGPLAN Notices 45 (2010), 115--126.
[15]
J. Coplin and M. Burtscher. 2014. Power characteristics of irregular GPGPU programs. In Workshop on Green Programming, Computing, and Data Processing (GPCDP’14).
[16]
Jared Coplin and Martin Burtscher. 2016. Energy, power, and performance characterization of GPGPU benchmark programs. In Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1190--1199.
[17]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20 (1995), 273--297.
[18]
Steven Dalton, Nathan Bell, Luke Olson, and Michael Garland. 2014. Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations. Retrieved September 2017 from http://cusplibrary.github.io/.
[19]
John D. Davis and Eric S. Chung. 2012. SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place. Technical Report, Microsoft Research Silicon Valley.
[20]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011). Retrieved September 2015 from https://www.cise.ufl.edu/research/sparse/matrices/.
[21]
Salvatore Filippone, Valeria Cardellini, Davide Barbieri, and Alessandro Fanfarillo. 2017. Sparse matrix-vector multiplication on GPGPUs. ACM Transactions on Mathematical Software (TOMS) 43, 4 (2017), 30.
[22]
Serban Georgescu and Hiroshi Okuda. 2010. Conjugate gradients on multiple GPUs. International Journal for Numerical Methods in Fluids 64, 10--12 (2010), 1254--1273.
[23]
Ping Guo, He Huang, Qichang Chen, Liqiang Wang, En-Jui Lee, and Po Chen. 2011. A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs. In Proceedings of the TeraGrid Conference: Extreme Digital Discovery.
[24]
Ping Guo, Liqiang Wang, and Po Chen. 2014. A performance modeling and optimization analysis tool for sparse matrix-vector multiplication on GPUs. IEEE Transactions on Parallel and Distributed Systems 25 (2014), 1112--1123.
[25]
Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ACM SIGARCH Computer Architecture News, Vol. 37. ACM, 152--163.
[26]
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2003. A practical guide to support vector classification. Retrieved October 2015 from https://www.cs.sfu.ca/people/Faculty/teaching/726/spring11/svmguide.pdf.
[27]
Chih-Wei Hsu and Chih-Jen Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13, 2 (2002), 415--425.
[28]
Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1--2 (1997), 273--324.
[29]
Kishore Kothapalli, Rishabh Mukherjee, M. Suhail Rehman, Suryakant Patidar, P. J. Narayanan, and Kannan Srinathan. 2009. A performance prediction model for the CUDA GPGPU platform. In Proceedings of the International Conference on High Performance Computing (HiPC’09). IEE, 463--472.
[30]
Lars Kotthoff, Ian P. Gent, and Ian Miguel. 2011. A preliminary evaluation of machine learning in algorithm selection for search problems. In Proceedings of the 4th Annual Symposium on Combinatorial Search.
[31]
Jakub Kurzak, David A. Bader, and Jack Dongarra. 2010. Scientific Computing with Multicore and Accelerators. CRC Press.
[32]
Daniel Langr and Pavel Tvrdik. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Transactions on Parallel and Distributed Systems 27, 2 (2016), 428--440.
[33]
Ruipeng Li and Yousef Saad. 2013. GPU-accelerated preconditioned iterative linear solvers. The Journal of Supercomputing 63, 2 (2013), 443--466.
[34]
Shaozhong Lin and Zhiqiang Xie. 2017. A Jacobi_PCG solver for sparse linear systems on multi-GPU cluster. The Journal of Supercomputing 73, 1 (2017), 433--454.
[35]
Weifeng Liu and Brian Vinter. 2015. CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 339--350.
[36]
Duane Merrill and Michael Garland. 2016. Merge-based parallel sparse matrix-vector multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 58.
[37]
Sparsh Mittal and Jeffrey S. Vetter. 2015. A survey of methods for analyzing and improving GPU energy efficiency. ACM Computing Surveys (CSUR) 47, 2 (2015), 19.
[38]
Alexander Monakov, Anton Lokhmotov, and Arutyun Avetisyan. 2010. Automatically tuning sparse matrix-vector multiplication for GPU architectures. In International Conference on High-Performance Embedded Architectures and Compilers. Springer, 111--125.
[39]
Julia S. Mullen, Michael M. Wolf, and Anna Klein. 2013. Pakck: Performance and power analysis of key computational kernels on CPUs and GPUs. In IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.
[40]
Saurav Muralidharan, Amit Roy, Mary Hall, Michael Garland, and Piyush Rai. 2016. Architecture-adaptive code variant tuning. ACM SIGPLAN Notices 51, 4 (2016), 325--338.
[41]
Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, and Bryan Catanzaro. 2014. Nitro: A framework for adaptive code variant tuning. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 501--512.
[42]
Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. 2014. Cache-aware sparse matrix formats for Kepler GPU. In Proceedings of the 20th IEEE International Conference on Parallel and Distributed. IEEE.
[43]
NVIDIA. NVIDIA Management Library. Retrieved September 2017 from https://developer.nvidia.com/nvidia-management-library-nvml.
[44]
NVIDIA. 2017. CuBLAS: Dense Linear Algebra on GPUs. Retrieved September 2017 from https://developer.nvidia.com/cublas.
[45]
NVIDIA. 2017. cuSPARSE : Sparse Linear Algebra on GPUs. Retrieved September 2017 from http://docs.nvidia.com/cuda/cusparse.
[46]
Yousef Saad. 2003. Iterative Methods for Sparse Linear Systems. SIAM.
[47]
Naser Sedaghati, Te Mu, Louis-Noël Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic selection of sparse matrix representation on GPUs. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 99--108.
[48]
Jonathan Richard Shewchuk and others. 1994. An introduction to the conjugate gradient method without the agonizing pain.
[49]
Omer Spillinger, David Eliahu, Armando Fox, and James Demmel. 2015. Matrix Multiplication Algorithm Selection with Support Vector Machines. Technical Report. Retrieved October 2015 from http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-29.html.
[50]
Bor-Yiing Su and Kurt Keutzer. 2012. clSpMV: A cross-platform OpenCL SpMV framework on GPUs. In Proceedings of the 26th ACM International Conference on Supercomputing. ACM, 353--364.
[51]
Balaji Subramaniam, Winston Saunders, Tom Scogland, and Wu-chun Feng. 2013. Trends in energy-efficient computing: A perspective from the Green500. In Proceedings of the 2013 International Green Computing Conference (IGCC’13). IEEE, 1--8.
[52]
Mickeal Verschoor and Andrei C. Jalba. 2012. Analysis and performance estimation of the conjugate gradient method on multiple GPUs. Parallel Computing 38, 10--11 (2012), 552--575.
[53]
Rich Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil, Rajesh Nishtala, and Benjamin Lee. 2002. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of the ACM/IEEE 2002 Conference on Supercomputing. IEEE, 26--26.
[54]
Richard Wilson Vuduc and James W. Demmel. 2003. Automatic Performance Tuning of Sparse Matrix Kernels. Vol. 1. University of California, Berkeley.
[55]
Dong Hyuk Woo and Hsien-Hsin S. Lee. 2008. Extending Amdahl’s law for energy-efficient computing in the many-core era. Computer 41, 12 (2008).
[56]
Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2008. SATzilla: Portfolio-based algorithm selection for SAT. Journal of Artificial Intelligence Research 32 (2008), 565--606.
[57]
Lin Xu, Frank Hutter, Jonathan Shen, Holger H. Hoos, and Kevin Leyton-Brown. 2012. SATzilla2012: Improved algorithm selection based on cost-sensitive classification models. Proceedings of SAT Challenge, 2012, 57--58.
[58]
Weizhi Xu, Hao Zhang, Shuai Jiao, Da Wang, Fenglong Song, and Zhiyong Liu. 2012. Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU. In Proceedings of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel and Distributed Computing (SNPD’12). IEEE, 231--235.
[59]
Shengen Yan, Chao Li, Yunquan Zhang, and Huiyang Zhou. 2014. yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices 49 (2014), 107--118.
[60]
Xulei Yang, Qing Song, and Yue Wang. 2007. A weighted support vector machine for data classification. International Journal of Pattern Recognition and Artificial Intelligence 21, 5 (2007), 961--976.

Cited By

View all
  • (2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/367684721:4(1-24)Online publication date: 8-Jul-2024
  • (2024)Matrix-free SBP-SAT finite difference methods and the multigrid preconditioner on GPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656614(400-412)Online publication date: 30-May-2024
  • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
September 2018
322 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3274266
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2018
Accepted: 01 May 2018
Revised: 01 April 2018
Received: 01 November 2017
Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU computing
  2. Sparse matrix-vector multiplication (SpMV)
  3. energy efficiency
  4. iterative solvers
  5. performance modeling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Key R8D Program of China
  • National Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)210
  • Downloads (Last 6 weeks)31
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/367684721:4(1-24)Online publication date: 8-Jul-2024
  • (2024)Matrix-free SBP-SAT finite difference methods and the multigrid preconditioner on GPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656614(400-412)Online publication date: 30-May-2024
  • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • (2023)GPU Sparse Matrix Vector Multiplication Optimization Based on ELLB Storage FormatProceedings of the 2023 12th International Conference on Software and Computer Applications10.1145/3587828.3587834(34-40)Online publication date: 23-Feb-2023
  • (2023)Feature-based SpMV Performance Analysis on Contemporary Devices2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS54959.2023.00072(668-679)Online publication date: May-2023
  • (2023)An adaptive approach for compression format based on bagging algorithmInternational Journal of Parallel, Emergent and Distributed Systems10.1080/17445760.2023.223129138:5(401-423)Online publication date: 13-Jul-2023
  • (2022)Mapping and Optimization Method of SpMV on Multi-DSP AcceleratorElectronics10.3390/electronics1122369911:22(3699)Online publication date: 11-Nov-2022
  • (2022)Auto-Selection of an Optimal Sparse Matrix Format in the Neuro-Simulator ANNarchyFrontiers in Neuroinformatics10.3389/fninf.2022.87794516Online publication date: 23-May-2022
  • (2022)fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC PlatformsACM Transactions on Parallel Computing10.1145/35127709:2(1-29)Online publication date: 11-Apr-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media