More Web Proxy on the site http://driver.im/

research-article

Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware Compression

Authors:

Weichen LiuDi LiuAuthors Info & Claims

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 42, Issue 12

Pages 4606 - 4617

https://doi.org/10.1109/TCAD.2023.3281715

Published: 31 May 2023 Publication History

Abstract

Sparse matrix–vector multiplication (SpMV) on FPGAs has gained much attention. The performance of SpMV is mainly determined by the number of multiplications between nonzero matrix elements and the corresponding vector values per cycle. On the one side, the off-chip memory bandwidth limits the number of nonzero matrix elements transferred from the off-chip DDR to the FPGA chip per cycle. On the other side, the irregular vector access pattern poses challenges to fetch the corresponding vector values. Besides, the read-after-write (RAW) dependency in the accumulation process shall be solved to enable a fully pipelined design. In this work, we propose an efficient FPGA-based SpMV accelerator with data reuse-aware compression. The key observation is that repeated accesses to a vector value can be omitted by reusing the fetched data. Based on the observation, we propose a reordering algorithm to manually exploit the data reuse of fetched vector values. Further, we propose a novel compressed format called data reuse-aware compressed (DRC) to take full advantage of the data reuse and a fast format conversion algorithm to shorten the preprocessing time. Meanwhile, we propose an HLS-friendly accumulator to solve the RAW dependency. Finally, we implement and evaluate our proposed design on the Xilinx Zynq-UltraScale ZCU106 platform with a set of sparse matrices from the SuiteSparse matrix collection. Our proposed design achieves an average <inline-formula> <tex-math notation="LaTeX">$1.18\times $ </tex-math></inline-formula> performance speedup without the DRC format and an average <inline-formula> <tex-math notation="LaTeX">$1.57\times $ </tex-math></inline-formula> performance speedup with the DRC format w.r.t. the state-of-the-art work, respectively.

References

[1]

P. Sonneveldet al., “IDR (s): A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations,” SIAM J. Sci. Comput., vol. 31, no. 2, pp. 1035–1062, 2008.

Digital Library

[2]

K. Akbudak, E. Kayaaslan, and C. Aykanat, “Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication,” SIAM J. Sci. Comput., vol. 35, no. 3, pp. C237–C262, 2013.

Digital Library

[3]

H. Shuo, Z. Lei, L. Di, L. Weichen, and R. Subramaniam, “ZeroBN: Learning compact neural networks for latency-critical edge systems,” in Proc. 58th ACM/EDAC/IEEE Design Autom. Conf. (DAC), 2021, pp. 151–156.

[4]

H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” 2016, arXiv:1608.08710.

[5]

S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” 2015, arXiv:1510.00149.

[6]

S. Caoet al., “Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2019, pp. 63–72.

[7]

M. Veen, “Sparse matrix vector multiplication on a field programmable gate array,” M.S. thesis, Dept. Elect. Eng. Math. Comput. Sci., Univ. Twente, Enschede, The Netherlands, 2007.

[8]

K. Luet al., “ReDESK: A reconfigurable dataflow engine for sparse kernels on heterogeneous platforms,” in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD), 2019, pp. 1–8.

[9]

S. Li, Y. Wang, W. Wen, Y. Wang, Y. Chen, and H. Li, “A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication Kernel,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), 2016, p. 14.

[10]

P. Grigoras, P. Burovskiy, and W. Luk, “CASK: Open-source custom architectures for sparse kernels,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2016, pp. 179–184.

[11]

G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang, “ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2017, pp. 217–226.

[12]

S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu, “Hitgraph: High-throughput graph processing framework on FPGA,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 10, pp. 2249–2264, Oct. 2019.

[13]

X. Chen, H. Tan, Y. Chen, B. He, W.-F. Wong, and D. Chen, “ThunderGP: HLS-based graph processing framework on fpgas,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2021, pp. 69–80.

[14]

Y. Hu, Y. Du, E. Ustun, and Z. Zhang, “GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs,” in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD), 2021, pp. 1–9.

[15]

Y. Du, Y. Hu, Z. Zhou, and Z. Zhang, “High-performance sparse linear algebra on HBM-equipped FPGAs using HLS: A case study on SpMV,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2022, pp. 54–64.

[16]

C. Meng, S. Yin, P. Ouyang, L. Liu, and S. Wei, “Efficient memory partitioning for parallel data access in multidimensional arrays,” in Proc. 52nd ACM/EDAC/IEEE Design Autom. Conf. (DAC), 2015, pp. 1–6.

[17]

J. Su, F. Yang, X. Zeng, and D. Zhou, “Efficient memory partitioning for parallel data access via data reuse,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2016, pp. 138–147.

[18]

Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong, “Memory partitioning for multidimensional arrays in high-level synthesis,” in Proc. 50th Annu. Design Autom. Conf., 2013, pp. 1–8.

[19]

S. Li, D. Liu, and W. Liu, “Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), 2021, pp. 1–9.

[20]

T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Trans. Math. Softw. (TOMS), vol. 38, pp. 1–25, Dec. 2011.

Digital Library

[21]

R. Dorrance, F. Ren, and D. Markovic, “A scalable sparse matrix-vector multiplication Kernel for energy-efficient sparse-blas on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2014, pp. 161–170.

[22]

L. Zhuo, G. R. Morris, and V. K. Prasanna, “High-performance reduction circuits using deeply pipelined operators on FPGAs,” IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 10, pp. 1377–1392, Oct. 2007.

Digital Library

[23]

M. Huang and D. Andrews, “Modular design of fully pipelined reduction circuits on FPGAs,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 9, pp. 1818–1826, Sep. 2013.

Digital Library

[24]

S. Sun and J. Zambreno, “A floating-point accumulator for FPGA-based high performance computing applications,” in Proc. Int. Conf. Field-Program. Technol., 2009, pp. 493–499.

[25]

Xilinx. “Vivado design suite.” Accessed: May 26, 2023 [Online]. Available: https://www.xilinx.com/products/design-tools/vivado.html

[26]

L. Song, Y. Chi, L. Guo, and J. Cong, “Serpens: A high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication,” in Proc. 59th ACM/IEEE Design Autom. Conf., 2022, pp. 211–216.

[27]

B. Liu and D. Liu, “Towards high-bandwidth-Utilization SpMV on FPGAs via partial vector duplication,” in Proc. 28th Asia South Pac. Design Autom. Conf., 2023, pp. 33–38.

Cited By

Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480

Index Terms

Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware Compression
1. Hardware
  1. Integrated circuits
  2. Very large scale integration design
    1. Application-specific VLSI designs

Index terms have been assigned to the content through auto-classification.

Recommendations

Adaptive sparse matrix representation for efficient matrix---vector multiplication

A wide range of applications in engineering and scientific computing are based on the sparse matrix computation. There exist a variety of data representations to keep the non-zero elements in sparse matrices, and each representation favors some matrices ...
Processor-efficient sparse matrix-vector multiplication

The matrix-vector multiplication operation is the kernel of most numerical algorithms.Typically, matrix-vector multiplication algorithms exploit the sparsity of a matrix, either to reduce the time taken or the memory used. In the case of parallel ...
Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the Recursive Sparse Blocks format

In earlier work we have introduced the ''Recursive Sparse Blocks'' (RSB) sparse matrix storage scheme oriented towards cache efficient matrix-vector multiplication (SpMV) and triangular solution (SpSV) on cache based shared memory parallel computers. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Volume 42, Issue 12

Dec. 2023

830 pages

ISSN:0278-0070

Issue’s Table of Contents

1937-4151 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 31 May 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu YChen RLi SYang JLi Sda Silva B(2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3687480

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents