[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware Compression

Published: 31 May 2023 Publication History

Abstract

Sparse matrix&#x2013;vector multiplication (SpMV) on FPGAs has gained much attention. The performance of SpMV is mainly determined by the number of multiplications between nonzero matrix elements and the corresponding vector values per cycle. On the one side, the off-chip memory bandwidth limits the number of nonzero matrix elements transferred from the off-chip DDR to the FPGA chip per cycle. On the other side, the irregular vector access pattern poses challenges to fetch the corresponding vector values. Besides, the read-after-write (RAW) dependency in the accumulation process shall be solved to enable a fully pipelined design. In this work, we propose an efficient FPGA-based SpMV accelerator with data reuse-aware compression. The key observation is that repeated accesses to a vector value can be omitted by reusing the fetched data. Based on the observation, we propose a reordering algorithm to manually exploit the data reuse of fetched vector values. Further, we propose a novel compressed format called data reuse-aware compressed (DRC) to take full advantage of the data reuse and a fast format conversion algorithm to shorten the preprocessing time. Meanwhile, we propose an HLS-friendly accumulator to solve the RAW dependency. Finally, we implement and evaluate our proposed design on the Xilinx Zynq-UltraScale ZCU106 platform with a set of sparse matrices from the SuiteSparse matrix collection. Our proposed design achieves an average <inline-formula> <tex-math notation="LaTeX">$1.18\times $ </tex-math></inline-formula> performance speedup without the DRC format and an average <inline-formula> <tex-math notation="LaTeX">$1.57\times $ </tex-math></inline-formula> performance speedup with the DRC format w.r.t. the state-of-the-art work, respectively.

References

[1]
P. Sonneveldet al., “IDR (s): A family of simple and fast algorithms for solving large nonsymmetric systems of linear equations,” SIAM J. Sci. Comput., vol. 31, no. 2, pp. 1035–1062, 2008.
[2]
K. Akbudak, E. Kayaaslan, and C. Aykanat, “Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication,” SIAM J. Sci. Comput., vol. 35, no. 3, pp. C237–C262, 2013.
[3]
H. Shuo, Z. Lei, L. Di, L. Weichen, and R. Subramaniam, “ZeroBN: Learning compact neural networks for latency-critical edge systems,” in Proc. 58th ACM/EDAC/IEEE Design Autom. Conf. (DAC), 2021, pp. 151–156.
[4]
H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” 2016, arXiv:1608.08710.
[5]
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” 2015, arXiv:1510.00149.
[6]
S. Caoet al., “Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2019, pp. 63–72.
[7]
M. Veen, “Sparse matrix vector multiplication on a field programmable gate array,” M.S. thesis, Dept. Elect. Eng. Math. Comput. Sci., Univ. Twente, Enschede, The Netherlands, 2007.
[8]
K. Luet al., “ReDESK: A reconfigurable dataflow engine for sparse kernels on heterogeneous platforms,” in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD), 2019, pp. 1–8.
[9]
S. Li, Y. Wang, W. Wen, Y. Wang, Y. Chen, and H. Li, “A data locality-aware design framework for reconfigurable sparse matrix-vector multiplication Kernel,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), 2016, p. 14.
[10]
P. Grigoras, P. Burovskiy, and W. Luk, “CASK: Open-source custom architectures for sparse kernels,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2016, pp. 179–184.
[11]
G. Dai, T. Huang, Y. Chi, N. Xu, Y. Wang, and H. Yang, “ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2017, pp. 217–226.
[12]
S. Zhou, R. Kannan, V. K. Prasanna, G. Seetharaman, and Q. Wu, “Hitgraph: High-throughput graph processing framework on FPGA,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 10, pp. 2249–2264, Oct. 2019.
[13]
X. Chen, H. Tan, Y. Chen, B. He, W.-F. Wong, and D. Chen, “ThunderGP: HLS-based graph processing framework on fpgas,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2021, pp. 69–80.
[14]
Y. Hu, Y. Du, E. Ustun, and Z. Zhang, “GraphLily: Accelerating graph linear algebra on HBM-equipped FPGAs,” in Proc. IEEE/ACM Int. Conf. Comput. Aided Design (ICCAD), 2021, pp. 1–9.
[15]
Y. Du, Y. Hu, Z. Zhou, and Z. Zhang, “High-performance sparse linear algebra on HBM-equipped FPGAs using HLS: A case study on SpMV,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2022, pp. 54–64.
[16]
C. Meng, S. Yin, P. Ouyang, L. Liu, and S. Wei, “Efficient memory partitioning for parallel data access in multidimensional arrays,” in Proc. 52nd ACM/EDAC/IEEE Design Autom. Conf. (DAC), 2015, pp. 1–6.
[17]
J. Su, F. Yang, X. Zeng, and D. Zhou, “Efficient memory partitioning for parallel data access via data reuse,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2016, pp. 138–147.
[18]
Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong, “Memory partitioning for multidimensional arrays in high-level synthesis,” in Proc. 50th Annu. Design Autom. Conf., 2013, pp. 1–8.
[19]
S. Li, D. Liu, and W. Liu, “Optimized data reuse via reordering for sparse matrix-vector multiplication on FPGAs,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), 2021, pp. 1–9.
[20]
T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Trans. Math. Softw. (TOMS), vol. 38, pp. 1–25, Dec. 2011.
[21]
R. Dorrance, F. Ren, and D. Markovic, “A scalable sparse matrix-vector multiplication Kernel for energy-efficient sparse-blas on FPGAs,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, 2014, pp. 161–170.
[22]
L. Zhuo, G. R. Morris, and V. K. Prasanna, “High-performance reduction circuits using deeply pipelined operators on FPGAs,” IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 10, pp. 1377–1392, Oct. 2007.
[23]
M. Huang and D. Andrews, “Modular design of fully pipelined reduction circuits on FPGAs,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 9, pp. 1818–1826, Sep. 2013.
[24]
S. Sun and J. Zambreno, “A floating-point accumulator for FPGA-based high performance computing applications,” in Proc. Int. Conf. Field-Program. Technol., 2009, pp. 493–499.
[25]
Xilinx. “Vivado design suite.” Accessed: May 26, 2023 [Online]. Available: https://www.xilinx.com/products/design-tools/vivado.html
[26]
L. Song, Y. Chi, L. Guo, and J. Cong, “Serpens: A high bandwidth memory based accelerator for general-purpose sparse matrix-vector multiplication,” in Proc. 59th ACM/IEEE Design Autom. Conf., 2022, pp. 211–216.
[27]
B. Liu and D. Liu, “Towards high-bandwidth-Utilization SpMV on FPGAs via partial vector duplication,” in Proc. 28th Asia South Pac. Design Autom. Conf., 2023, pp. 33–38.

Cited By

View all
  • (2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024

Index Terms

  1. Efficient FPGA-Based Sparse Matrix–Vector Multiplication With Data Reuse-Aware Compression
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
      IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems  Volume 42, Issue 12
      Dec. 2023
      830 pages

      Publisher

      IEEE Press

      Publication History

      Published: 31 May 2023

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)FPGA-Based Sparse Matrix Multiplication Accelerators: From State-of-the-Art to Future OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/368748017:4(1-37)Online publication date: 28-Aug-2024

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media