More Web Proxy on the site http://driver.im/

research-article

Open access

LevelST: Stream-based Accelerator for Sparse Triangular Solver

Authors:

Robert F. Lucas,

Jason CongAuthors Info & Claims

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Pages 67 - 77

https://doi.org/10.1145/3626202.3637568

Published: 02 April 2024 Publication History

Abstract

Over the past decade, much progress has been made to advance the acceleration of sparse linear operators such as SpMM and SpMV on FPGAs. Nevertheless, few works have attempted to address sparse triangular solver (SpTRSV) acceleration, and the performance boost is limited. SpTRSV is an elementary linear operator for many numerical methods, such as the least-square method. These methods, among others, are widely used in various areas, such as physical simulation and signal processing. Therefore, accelerating SpTRSV is crucial. However, many challenges impede accelerating SpTRSV, including (1) resolving dependencies between elements during forward or backward substitutions, (2) random access and unbalanced workloads across memory channels due to sparsity, (3) latency incurred by off-chip memory access for large matrices or vectors, and (4) data reuse for an unpredictable data sharing pattern. To address these issues, we have designed LevelST, the first FPGA accelerator leveraging high bandwidth memory (HBM) for solving sparse triangular systems. LevelST features (1) algorithm-hardware co-design of stream-based dependency resolution with reduced off-chip data movement, (2) resource sharing that improves resource utilization to scale up the architecture, (3) index modulo scheduling to balance workload, and (4) selective data prefetching from off-chip memory. LevelST is prototyped on an AMD Xilinx U280 HBM FPGA and evaluated with 16 sparse triangular matrices. Compared with the NVIDIA V100 and RTX 3060 GPUs over the cuSPARSE library, LevelST achieves a 2.65x speedup and 9.82x higher energy efficiency than the best of the V100 GPU and RTX 3060 GPU. The code is released on https://github.com/OswaldHe/LevelST (DOI: https://doi.org/10.5281/zenodo.10463345).

References

[1]

[n. d.]. Alveo U280 Data Center Accelerator Card Data Sheet. https://docs.xilinx. com/r/en-US/ds963-u280

[2]

[n. d.]. NVIDIA V100. https://www.nvidia.com/en-us/data-center/v100/

[3]

[n. d.]. UltraScale Architecture Libraries Guide (UG974). https://docs.xilinx.com/ r/2021.1-English/ug974-vivado-ultrascale-libraries/RAMB36E2

[4]

Patrick R. Amestoy, Timothy A. Davis, and Iain S. Duff. 1996. An approximate minimum degree ordering algorithm. SIAM J. Matrix Anal. Appl. 17, 4 (1996), 886--905.

Digital Library

[5]

Mario Baldi, Guido Marchetto, and Yoram Ofek. 2007. A scalable solution for engineering streaming traffic in the future internet. Computer Networks 51, 14 (2007), 4092--4111.

Digital Library

[6]

Richard H. Bartels and Gene H. Golub. 1969. The simplex method of linear programming using LU decomposition. Commun. ACM 12, 5 (1969), 266--268.

Digital Library

[7]

Edmond Chow and Aftab Patel. 2015. Fine-grained parallel incomplete LU factorization. SIAM Journal on Scientific Computing 37, 2 (2015), C169--C193.

Digital Library

[8]

Jason Cong and Yi Zou. 2010. A comparative study on the architecture templates for dynamic nested loops. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 251--254.

Digital Library

[9]

Elizabeth Cuthill and James McKee. 1969. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference. 157--172.

Digital Library

[10]

Timothy A Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38, 1 (2011), 1--25.

Digital Library

[11]

Federico Favaro, Ernesto Dufrechou, Pablo Ezzatti, and Juan P. Oliver. 2020. Exploring FPGA optimizations to compute sparse numerical linear algebra kernels. In Applied Reconfigurable Computing. Architectures, Tools, and Applications: 16th International Symposium, ARC 2020, Toledo, Spain, April 1--3, 2020, Proceedings 16. Springer, 258--268.

[12]

Alessandra Flammini, Daniele Marioli, Emiliano Sisinni, and Andrea Taroni. 2007. Least mean square method for LVDT signal processing. IEEE Transactions on Instrumentation and Measurement 56, 6 (2007), 2294--2300.

[13]

Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. 2014. A high memory bandwidth fpga accelerator for sparse matrix-vector multiplication. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. IEEE, 36--43.

Digital Library

[14]

John R Gilbert, Cleve Moler, and Robert Schreiber. 1992. Sparse matrices in MATLAB: Design and implementation. SIAM journal on matrix analysis and applications 13, 1 (1992), 333--356.

[15]

Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, et al. 2022. TAPA: a scalable task-parallel dataflow programming framework for modern FPGAs with co-optimization of HLS and physical design. arXiv preprint arXiv:2209.02663 (2022).

[16]

Licheng Guo, Yuze Chi, Jie Wang, Jason Lau, Weikang Qiao, Ecenur Ustun, Zhiru Zhang, and Jason Cong. 2021. AutoBridge: Coupling coarse-grained floorplanning and pipelining for high-frequency HLS design on multi-die FPGAs. In The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 81--92.

Digital Library

[17]

Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, and Jason Cong. 2019. Hardware acceleration of long read pairwise overlapping in genome sequencing: A race between fpga and gpu. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 127--135.

[18]

Per Christian Hansen, Victor Pereyra, and Godela Scherer. 2013. Least squares data fitting with applications. JHU Press.

[19]

Magnus R Hestenes, Eduard Stiefel, et al. 1952. Methods of conjugate gradients for solving linear systems. Journal of research of the National Bureau of Standards 49, 6 (1952), 409--436.

[20]

Sunpyo Hong and Hyesoon Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the 36th annual international symposium on Computer architecture. 152--163.

Digital Library

[21]

Intel. [n. d.]. Intel Stratix 10 Datasheet. https://www.intel.com/content/www/ us/en/docs/programmable/683181/current/dsp-block-specifications.html

[22]

Intel. 2023. Intel Agilex 9 FPGA and SoC FPGA. https://www.intel.com/content/ www/us/en/products/details/fpga/agilex/9.html

[23]

Hongshin Jun, Jinhee Cho, Kangseol Lee, Ho-Young Son, Kwiwook Kim, Hanho Jin, and Keith Kim. 2017. Hbm (high bandwidth memory) dram technology and architecture. In 2017 IEEE International Memory Workshop (IMW). IEEE, 1--4.

[24]

David S Kershaw. 1978. The incomplete Cholesky-conjugate gradient method for the iterative solution of systems of linear equations. Journal of computational physics 26, 1 (1978), 43--65.

[25]

Süreyya Emre Kurt, Aravind Sukumaran-Rajam, Fabrice Rastello, and Ponnuswamy Sadayyapan. 2020. Efficient tiled sparse matrix multiplication through matrix signatures. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.

Digital Library

[26]

Yih-Yih Lin. [n. d.]. A comprehensive study on the performance of implicit LSDYNA. https://www.dynalook.com/conferences/12th-international-ls-dynaconference/ computing-technologies27-a.pdf/view

[27]

Weifeng Liu, Ang Li, Jonathan Hogg, Iain S. Duff, and Brian Vinter. 2016. A synchronization-free algorithm for parallel sparse triangular solves. In Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24--26, 2016, Proceedings 22. Springer, 617-- 630.

Digital Library

[28]

Zhengyang Lu and Weifeng Liu. 2023. TileSpTRSV: a tiled algorithm for parallel sparse triangular solve on GPUs. CCF Transactions on High Performance Computing (2023), 1--15.

[29]

Zhengyang Lu, Yuyao Niu, and Weifeng Liu. 2020. Efficient block algorithms for parallel sparse triangular solve. In Proceedings of the 49th International Conference on Parallel Processing. 1--11.

Digital Library

[30]

Matt Martineau, Patrick Atkinson, and Simon McIntosh-Smith. 2018. Benchmarking the NVIDIA v100 GPU and tensor cores. In European Conference on Parallel Processing. Springer, 444--455.

[31]

Starting Matlab. 2012. Matlab. The MathWorks, Natick, MA (2012).

[32]

Aiichiro Nakano. 1997. Parallel multilevel preconditioned conjugate-gradient approach to variable-charge molecular dynamics. Computer Physics Communications 104, 1--3 (1997), 59--69.

[33]

Maxim Naumov, L Chien, Philippe Vandermersch, and Ujval Kapasi. 2010. Cusparse library. In GPU Technology Conference.

[34]

Nvidia. [n. d.]. NVIDIA AMPERE GA102 GPU ARCHITECTURE. https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpuarchitecture- whitepaper-v2.pdf

[35]

Lyle Pursell and SY Trimble. 1991. Gram-Schmidt orthogonalization by Gauss elimination. The American Mathematical Monthly 98, 6 (1991), 544--549.

[36]

Alvin C Rencher and MG Schimek. 1997. Methods of multivariate analysis. Computational Statistics 12, 4 (1997), 422--422.

[37]

Linghao Song, Yuze Chi, Licheng Guo, and Jason Cong. 2022. Serpens: A High Bandwidth Memory Based Accelerator for General-Purpose Sparse Matrix-Vector Multiplication. Association for Computing Machinery, New York, NY, USA, 211--216. https://doi.org/10.1145/3489517.3530420

Digital Library

[38]

Linghao Song, Yuze Chi, Atefeh Sohrabizadeh, Young-kyu Choi, Jason Lau, and Jason Cong. 2022. Sextans: A Streaming Accelerator for General-Purpose Sparse- Matrix Dense-Matrix Multiplication. In Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Virtual Event, USA) (FPGA '22). Association for Computing Machinery, New York, NY, USA, 65--77. https://doi.org/10.1145/3490422.3502357

Digital Library

[39]

Linghao Song, Licheng Guo, Suhail Basalama, Yuze Chi, Robert F. Lucas, and Jason Cong. 2023. Callipepla: Stream Centric Instruction Set and Mixed Precision for Accelerating Conjugate Gradient Solver. In Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 247--258.

Digital Library

[40]

Jiya Su, Feng Zhang, Weifeng Liu, Bingsheng He, Ruofan Wu, Xiaoyong Du, and Rujia Wang. 2020. CapelliniSpTRSV: A thread-level synchronization-free sparse triangular solve on GPUs. In Proceedings of the 49th International Conference on Parallel Processing. 1--11.

Digital Library

[41]

Livermore Software Technology. 2022. LS-DYNA. https://www.lstc.com/ products/ls-dyna

[42]

Andreas Wachter. 2002. An interior point algorithm for large-scale nonlinear optimization with applications in process engineering. Ph.D. Dissertation. Carnegie Mellon University.

[43]

Xinliang Wang, Weifeng Liu, Wei Xue, and Li Wu. 2018. swSpTRSV: A fast sparse triangular solve with sparse level tile layout on Sunway architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 338--353.

Digital Library

[44]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (2009), 65--76.

Digital Library

[45]

Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 7154-- 7163. https://proceedings.mlr.press/v97/yu19a.html

[46]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA international symposium on field-programmable gate arrays. 161--170.

Digital Library

Index Terms

LevelST: Stream-based Accelerator for Sparse Triangular Solver
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges - (1) the random ...
Unified Communication Optimization Strategies for Sparse Triangular Solver on CPU and GPU Clusters
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

This paper presents a unified communication optimization framework for sparse triangular solve (SpTRSV) algorithms on CPU and GPU clusters. The framework builds upon a 3D communication-avoiding (CA) layout of P_x × P_y × P_z processes that divides a sparse ...
A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

The sparse triangular solver (SpTRSV) is one of the most essential kernels in many scientific and engineering applications. Efficiently parallelizing the SpTRSV on modern many-core architectures is considerably difficult due to inherent dependency of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

April 2024

300 pages

ISBN:9798400704185

DOI:10.1145/3626202

General Chair:
Zhiru Zhang
Cornell University, USA
,
Program Chair:
Andrew Putnam
Microsoft, USA

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 April 2024

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Semiconductor Research Corporation

Conference

FPGA '24

Sponsor:

SIGDA

FPGA '24: The 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

March 3 - 5, 2024

CA, Monterey, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
563
Total Downloads

Downloads (Last 12 months)563
Downloads (Last 6 weeks)79

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents