[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3626202.3637568acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Open access

LevelST: Stream-based Accelerator for Sparse Triangular Solver

Published: 02 April 2024 Publication History

Abstract

Over the past decade, much progress has been made to advance the acceleration of sparse linear operators such as SpMM and SpMV on FPGAs. Nevertheless, few works have attempted to address sparse triangular solver (SpTRSV) acceleration, and the performance boost is limited. SpTRSV is an elementary linear operator for many numerical methods, such as the least-square method. These methods, among others, are widely used in various areas, such as physical simulation and signal processing. Therefore, accelerating SpTRSV is crucial. However, many challenges impede accelerating SpTRSV, including (1) resolving dependencies between elements during forward or backward substitutions, (2) random access and unbalanced workloads across memory channels due to sparsity, (3) latency incurred by off-chip memory access for large matrices or vectors, and (4) data reuse for an unpredictable data sharing pattern. To address these issues, we have designed LevelST, the first FPGA accelerator leveraging high bandwidth memory (HBM) for solving sparse triangular systems. LevelST features (1) algorithm-hardware co-design of stream-based dependency resolution with reduced off-chip data movement, (2) resource sharing that improves resource utilization to scale up the architecture, (3) index modulo scheduling to balance workload, and (4) selective data prefetching from off-chip memory. LevelST is prototyped on an AMD Xilinx U280 HBM FPGA and evaluated with 16 sparse triangular matrices. Compared with the NVIDIA V100 and RTX 3060 GPUs over the cuSPARSE library, LevelST achieves a 2.65x speedup and 9.82x higher energy efficiency than the best of the V100 GPU and RTX 3060 GPU. The code is released on https://github.com/OswaldHe/LevelST (DOI: https://doi.org/10.5281/zenodo.10463345).

References

[1]
[n. d.]. Alveo U280 Data Center Accelerator Card Data Sheet. https://docs.xilinx. com/r/en-US/ds963-u280
[2]
[n. d.]. NVIDIA V100. https://www.nvidia.com/en-us/data-center/v100/
[3]
[n. d.]. UltraScale Architecture Libraries Guide (UG974). https://docs.xilinx.com/ r/2021.1-English/ug974-vivado-ultrascale-libraries/RAMB36E2
[4]
Patrick R. Amestoy, Timothy A. Davis, and Iain S. Duff. 1996. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Appl. 17, 4 (1996), 886--905.
[5]
Mario Baldi, Guido Marchetto, and Yoram Ofek. 2007. A scalable solution for engineering streaming traffic in the future internet. Computer Networks 51, 14 (2007), 4092--4111.
[6]
Richard H. Bartels and Gene H. Golub. 1969. The simplex method of linear programming using LU decomposition. Commun. ACM 12, 5 (1969), 266--268.
[7]
Edmond Chow and Aftab Patel. 2015. Fine-grained parallel incomplete LU factorization. SIAM Journal on Scientific Computing 37, 2 (2015), C169--C193.
[8]
Jason Cong and Yi Zou. 2010. A comparative study on the architecture templates for dynamic nested loops. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 251--254.
[9]
Elizabeth Cuthill and James McKee. 1969. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference. 157--172.
[10]
Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1--25.
[11]
Federico Favaro, Ernesto Dufrechou, Pablo Ezzatti, and Juan P. Oliver. 2020. Exploring FPGA optimizations to compute sparse numerical linear algebra kernels. In Applied Reconfigurable Computing. Architectures, Tools, and Applications: 16th International Symposium, ARC 2020, Toledo, Spain, April 1--3, 2020, Proceedings 16. Springer, 258--268.
[12]
Alessandra Flammini, Daniele Marioli, Emiliano Sisinni, and Andrea Taroni. 2007. Least mean square method for LVDT signal processing. IEEE Transactions on Instrumentation and Measurement 56, 6 (2007), 2294--2300.
[13]
Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 36--43.
[14]
John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation. SIAM journal on matrix analysis and applications 13, 1 (1992), 333--356.
[15]
Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, et al. 2022. TAPA: a scalable task-parallel dataflow programming framework for modern FPGAs with co-optimization of HLS and physical design. arXiv preprint arXiv:2209.02663 (2022).
[16]
Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. 2021. AutoBridge: Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 81--92.
[17]
Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, and Jason Cong. 2019. Hardware acceleration of long read pairwise overlapping in genome sequencing: A race between fpga and gpu. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 127--135.
[18]
Per Christian Hansen, Victor Pereyra, and Godela Scherer. 2013. Least squares data fitting with applications. JHU Press.
[19]
Magnus R Hestenes, Eduard Stiefel, et al. 1952. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards 49, 6 (1952), 409--436.
[20]
Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th annual international symposium on Computer architecture. 152--163.
[21]
Intel. [n. d.]. Intel Stratix 10 Datasheet. https://www.intel.com/content/www/ us/en/docs/programmable/683181/current/dsp-block-specifications.html
[22]
Intel. 2023. Intel Agilex 9 FPGA and SoC FPGA. https://www.intel.com/content/ www/us/en/products/details/fpga/agilex/9.html
[23]
Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. 2017. Hbm (high bandwidth memory) dram technology and architecture. In 2017 IEEE International Memory Workshop (IMW). IEEE, 1--4.
[24]
David S Kershaw. 1978. The incomplete Cholesky-conjugate gradient method for the iterative solution of systems of linear equations. Journal of computational physics 26, 1 (1978), 43--65.
[25]
Süreyya Emre Kurt, Aravind Sukumaran-Rajam, Fabrice Rastello, and Ponnuswamy Sadayyapan. 2020. Efficient tiled sparse matrix multiplication through matrix signatures. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.
[26]
Yih-Yih Lin. [n. d.]. A comprehensive study on the performance of implicit LSDYNA. https://www.dynalook.com/conferences/12th-international-ls-dynaconference/ computing-technologies27-a.pdf/view
[27]
Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A synchronization-free algorithm for parallel sparse triangular solves. In Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24--26, 2016, Proceedings 22. Springer, 617-- 630.
[28]
Zhengyang Lu and Weifeng Liu. 2023. TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs. CCF Transactions on High Performance Computing (2023), 1--15.
[29]
Zhengyang Lu, Yuyao Niu, and Weifeng Liu. 2020. Efficient block algorithms for parallel sparse triangular solve. In Proceedings of the 49th International Conference on Parallel Processing. 1--11.
[30]
Matt Martineau, Patrick Atkinson, and Simon McIntosh-Smith. 2018. Benchmarking the NVIDIA v100 GPU and tensor cores. In European Conference on Parallel Processing. Springer, 444--455.
[31]
Starting Matlab. 2012. Matlab. The MathWorks, Natick, MA (2012).
[32]
Aiichiro Nakano. 1997. Parallel multilevel preconditioned conjugate-gradient approach to variable-charge molecular dynamics. Computer Physics Communications 104, 1--3 (1997), 59--69.
[33]
Maxim Naumov, L Chien, Philippe Vandermersch, and Ujval Kapasi. 2010. Cusparse library. In GPU Technology Conference.
[34]
Nvidia. [n. d.]. NVIDIA AMPERE GA102 GPU ARCHITECTURE. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpuarchitecture- whitepaper-v2.pdf
[35]
Lyle Pursell and SY Trimble. 1991. Gram-Schmidt orthogonalization by Gauss elimination. The American Mathematical Monthly 98, 6 (1991), 544--549.
[36]
Alvin C Rencher and MG Schimek. 1997. Methods of multivariate analysis. Computational Statistics 12, 4 (1997), 422--422.
[37]
Linghao Song, Yuze Chi, Licheng Guo, and Jason Cong. 2022. Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication. Association for Computing Machinery, New York, NY, USA, 211--216. https://doi.org/10.1145/3489517.3530420
[38]
Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young-kyu Choi, Jason Lau, and Jason Cong. 2022. Sextans: A Streaming Accelerator for General-Purpose Sparse- Matrix Dense-Matrix Multiplication. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA '22). Association for Computing Machinery, New York, NY, USA, 65--77. https://doi.org/10.1145/3490422.3502357
[39]
Linghao Song, Licheng Guo, Suhail Basalama, Yuze Chi, Robert F. Lucas, and Jason Cong. 2023. Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 247--258.
[40]
Jiya Su, Feng Zhang, Weifeng Liu, Bingsheng He, Ruofan Wu, Xiaoyong Du, and Rujia Wang. 2020. CapelliniSpTRSV: A thread-level synchronization-free sparse triangular solve on GPUs. In Proceedings of the 49th International Conference on Parallel Processing. 1--11.
[41]
Livermore Software Technology. 2022. LS-DYNA. https://www.lstc.com/ products/ls-dyna
[42]
Andreas Wachter. 2002. An interior point algorithm for large-scale nonlinear optimization with applications in process engineering. Ph.D. Dissertation. Carnegie Mellon University.
[43]
Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A fast sparse triangular solve with sparse level tile layout on Sunway architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 338--353.
[44]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.
[45]
Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 7154-- 7163. https://proceedings.mlr.press/v97/yu19a.html
[46]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays. 161--170.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
April 2024
300 pages
ISBN:9798400704185
DOI:10.1145/3626202
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2024

Check for updates

Badges

Author Tags

  1. accelerator
  2. sptrsv
  3. stream-based architecture

Qualifiers

  • Research-article

Funding Sources

Conference

FPGA '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 563
    Total Downloads
  • Downloads (Last 12 months)563
  • Downloads (Last 6 weeks)79
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media