More Web Proxy on the site http://driver.im/

research-article

yaSpMV: yet another SpMV framework on GPUs

Authors:

Huiyang ZhouAuthors Info & Claims

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 107 - 118

https://doi.org/10.1145/2555243.2555255

Published: 06 February 2014 Publication History

Abstract

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work has shown impressive progress, load imbalance and high memory bandwidth remain the critical performance bottlenecks for SpMV. In this paper, we present our novel solutions to these problems. First, we devise a new SpMV format, called blocked compressed common coordinate (BCCOO), which uses bit flags to store the row indices in a blocked common coordinate (COO) format so as to alleviate the bandwidth problem. We further improve this format by partitioning the matrix into vertical slices to enhance the cache hit rates when accessing the vector to be multiplied. Second, we revisit the segmented scan approach for SpMV to address the load imbalance problem. We propose a highly efficient matrix-based segmented sum/scan for SpMV and further improve it by eliminating global synchronization. Then, we introduce an auto-tuning framework to choose optimization parameters based on the characteristics of input sparse matrices and target hardware platforms. Our experimental results on GTX680 GPUs and GTX480 GPUs show that our proposed framework achieves significant performance improvement over the vendor tuned CUSPARSE V5.0 (up to 229% and 65% on average on GTX680 GPUs, up to 150% and 42% on average on GTX480 GPUs) and some most recently proposed schemes (e.g., up to 195% and 70% on average over clSpMV on GTX680 GPUs, up to 162% and 40% on average over clSpMV on GTX480 GPUs).

References

[1]

N. Bell and M. Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. SC, 2009.

Digital Library

[2]

A. Buluç, S. Williams, L. Oliker and J. Demmel. Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication. IPDPS, 2011.

Digital Library

[3]

A. Buluć, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. SPAA, 2009.

Digital Library

[4]

G. E. Blelloch, M. A. Heroux and M. Zagha. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Tech. Rep. Carnegie Mellon University-CS-93-173, School of Computer Science, Carnegie Mellon University, Aug 1993.

[5]

G. E. Blelloch. Scans as Primitive Parallel Operations. IEEE Transactions on Computers, 1989.

Digital Library

[6]

J. Bolz, I. Farmer, E. Grinspun and P. Schröder. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. ACM Transactions on Graphics (TOG), July 2003.

Digital Library

[7]

J. W. Choi, A. Singh and R. W. Vuduc. Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs. PPoPP, 2010.

Digital Library

[8]

Y. Dotsenko, N. K. Govindaraju, P.-P. Sloan, C. Boyd and J. Manferdelli. Fast Scan Algorithms on Graphics Processors. ICS, 2008.

Digital Library

[9]

M. Harris, S. Sengupta, and J. D. Owens. CUDPP:CUDA Data Parallel Primitives Library. http://gpgpu.org/developer/cudpp

[10]

Z. Koza, M. Matyka, S. Szkoda and L. Miroslaw. Compressed Multiple-Row Storage Format. CoRR 2008.

[11]

K. Kourtis, V. Karakasis, G. Goumas and N. Koziris. CSX: An Extended Compression Format for SpMV on Shared Memory Systems. PPoPP, 2011.

Digital Library

[12]

A. Monakov, A. Lokhmotov and A. Avetisyan. Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures. HiPEAC, 2010.

Digital Library

[13]

Nvidia. CUSPARSE. https://developer.nvidia.com/ cusparse. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[14]

J. C. Pichel, F. F. Rivera, M. Fernández and A. Rodríguez. Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs. Microprocessors and Microsystems, 36(2), 65--77, Mar 2012.

Digital Library

[15]

M. M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector Multiplication on GPUs using Compile-time and Run-time Strategies. Technical Report RC24704 (W0812-047), IBM, Dec 2008.

[16]

B.-Y. Su and K. Keutzer. clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs. ICS, 2012.

Digital Library

[17]

X. Sun, Y. Zhang, T. Wang, X. Zhang, L. Yuan and L. Rao. Optimizing SpMV for Diagonal Sparse Matrices on GPU. ICPP, 2011.

Digital Library

[18]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan primitives for GPU computing. In Graphics Hardware 2007.

Digital Library

[19]

The Khronos OpenCL Working Group OpenCL. The Open Standard for Parallel Programming of Heterogeneous Systems. http://www.khronos.org/opencl/

[20]

W. Tang et al., Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes, SC 2013.

Digital Library

[21]

F. Vázquez, J. J. Fernández and E. M. Garzón. A new approach for sparse matrix vector product on NVIDIA GPUs. Concurrency Computat.: Pract. Exper. Sep 2010.

[22]

R. Vuduc, J. W. Demmel and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. SciDAC 2005.

[23]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. A. Yelick and J. W. Demmel. Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms. SC, 2007.

Digital Library

[24]

S. Yan, G. Long and Y. Zhang. StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization, PPoPP, 2013.

Digital Library

Cited By

Gao JJi WWang Y(2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/367684721:4(1-24)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3676847
Chen AErickson BKozdon JChoi J(2024)Matrix-free SBP-SAT finite difference methods and the multigrid preconditioner on GPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656614(400-412)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656614
Laut SBorrell RCasas MMencagli GDazzi PLowenthal DBadia R(2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658683
Show More Cited By

Index Terms

yaSpMV: yet another SpMV framework on GPUs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

yaSpMV: yet another SpMV framework on GPUs
PPoPP '14

SpMV is a key linear algebra algorithm and has been widely used in many important application domains. As a result, numerous attempts have been made to optimize SpMV on GPUs to leverage their massive computational throughput. Although the previous work ...
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse ...
A Cross-Platform SpMV Framework on Many-Core Architectures

Sparse Matrix-Vector multiplication (SpMV) is a key operation in engineering and scientific computing. Although the previous work has shown impressive progress in optimizing SpMV on many-core architectures, load imbalance and high memory bandwidth ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

February 2014

412 pages

ISBN:9781450326568

DOI:10.1145/2555243

General Chair:
José Moreira
IBM Research, USA
,
Program Chair:
James Larus
EPFL, Switzerland

ACM SIGPLAN Notices Volume 49, Issue 8
PPoPP '14
August 2014
390 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2692916
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '14

Sponsor:

SIGPLAN

PPoPP '14: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 15 - 19, 2014

Florida, Orlando, USA

Acceptance Rates

PPoPP '14 Paper Acceptance Rate 28 of 184 submissions, 15%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

124
Total Citations
View Citations
879
Total Downloads

Downloads (Last 12 months)78
Downloads (Last 6 weeks)7

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gao JJi WWang Y(2024)Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU SystemsACM Transactions on Architecture and Code Optimization10.1145/367684721:4(1-24)Online publication date: 8-Jul-2024
https://dl.acm.org/doi/10.1145/3676847
Chen AErickson BKozdon JChoi J(2024)Matrix-free SBP-SAT finite difference methods and the multigrid preconditioner on GPUsProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656614(400-412)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656614
Laut SBorrell RCasas MMencagli GDazzi PLowenthal DBadia R(2024)Extending Sparse Patterns to Improve Inverse Preconditioning on GPU ArchitecturesProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658683(200-213)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658683
Wang LJia HXu LWei CLi KJiang XZhang Y(2024)VNEC: A Vectorized Non-Empty Column Format for SpMV on CPUs2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00011(14-25)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00011
Zhao ZZhang GWu YHong RYang YFu Y(2024)Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUsThe Journal of Supercomputing10.1007/s11227-024-05949-680:10(13681-13713)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-05949-6
Guo JLiu JWang QZhu X(2024)Optimizing CSR-Based SpMV on a New MIMD Architecture Pezy-SC3sAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0801-7_2(22-39)Online publication date: 1-Mar-2024
https://doi.org/10.1007/978-981-97-0801-7_2
Chu GHe YDong LDing ZChen DBai HWang XHu CButt AMi NChard K(2023)Efficient Algorithm Design of Optimizing SpMV on GPUProceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3588195.3593002(115-128)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3588195.3593002
Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Pan JXiao LTian MWang LYang CChen RRen ZLiu AZhu G(2023)hsSpMV: A Heterogeneous and SPM-aggregated SpMV for SW26010-Pro many-core processor2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid57682.2023.00016(62-70)Online publication date: May-2023
https://doi.org/10.1109/CCGrid57682.2023.00016
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents