[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3543622.3573179acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Open access

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA

Published: 12 February 2023 Publication History

Abstract

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most computationally intensive kernel in sparse tensor decomposition. In this paper, we propose a hardware-algorithm co-design on FPGA to minimize the execution time of spMTTKRP along all modes of an input tensor. We introduce FLYCOO, a novel tensor format that eliminates the communication of intermediate values to the FPGA external memory during the computation of spMTTKRP along all the modes. Our remapping of the tensor using FLYCOO also balances the workload among multiple Processing Engines (PEs). We propose a parallel algorithm that can concurrently process multiple partitions of the input tensor independent of each other. The proposed algorithm also orders the tensor dynamically during runtime to increase the data locality of the external memory accesses. We develop a custom FPGA accelerator design with (1) PEs consisting of a collection of pipelines that can concurrently process multiple elements of the input tensor and (2) memory controllers to exploit the spatial and temporal locality of the external memory accesses of the computation. Our work achieves a geometric mean of 8.8X and 3.8X speedup in execution time compared with the state-of-the-art CPU and GPU implementations on widely-used real-world sparse tensor datasets.

References

[1]
Alfred V Aho and John E Hopcroft. 1974. The design and analysis of computer algorithms. (1974).
[2]
Zhiyu Cheng, Baopu Li, Yanwen Fan, and Yingze Bao. 2020. A novel rank selection scheme in tensor ring decomposition based on reinforcement learning for deep neural networks. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3292--3296.
[3]
Sofia Fernandes, Hadi Fanaee-T, and João Gama. 2020. Tensor decomposition for analysing time-evolving social networks: An overview. Artificial Intelligence Review (2020), 1--26.
[4]
Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM journal on Applied Mathematics 17, 2 (1969), 416--429.
[5]
Ahmed E. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa Ranadive, Fabrizio Petrini, and Jeewhan Choi. 2021. ALTO: Adaptive Linearized Storage of Sparse Tensors. In Proceedings of the ACM International Conference on Supercomputing (Virtual Event, USA) (ICS '21). Association for Computing Machinery, New York, NY, USA, 404--416. https://doi.org/10.1145/3447818.3461703
[6]
David Hong, Tamara G. Kolda, and Jed A. Duersch. 2020. Generalized Canonical Polyadic Tensor Decomposition. SIAM Rev. 62, 1 (2020), 133--163. https://doi.org/10.1137/18M1203626 arXiv:https://doi.org/10.1137/18M1203626
[7]
Intel. 2022. External Memory Interfaces Intel Arria 10 FPGA IP User Guide. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-20115.pdf. Online; accessed 29 January 2022.
[8]
Oguz Kaya and Bora Uçar. 2015. Scalable sparse tensor decompositions in distributed memory systems. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
[9]
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943--948.
[10]
Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500.
[11]
Jiajia Li, Yuchen Ma, and Richard Vuduc. 2018. ParTI!: A Parallel Tensor Infrastructure for multicore CPUs and GPUs. http://parti-project.org Last updated: Jan 2020.
[12]
Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 238--252. https://doi.org/10.1109/SC.2018.00022
[13]
Jiajia Li, Bora Uçar, Ümit V Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc. 2019. Efficient and effective sparse tensor reordering. In Proceedings of the ACM International Conference on Supercomputing. 227--237.
[14]
Jiajia Li, Bora Uçar, Ümit V. Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc. 2019. Efficient and Effective Sparse Tensor Reordering. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 227--237. https://doi.org/10.1145/3330345.3330366
[15]
Marco Mondelli and Andrea Montanari. 2019. On the connection between learning two-layer neural networks and tensor decomposition. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1051--1060.
[16]
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and P. Sadayappan. 2019. An Efficient Mixed-Mode Representation of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). Association for Computing Machinery, New York, NY, USA, Article 49, 25 pages. https://doi.org/10.1145/3295500.3356216
[17]
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 123--133. https://doi.org/10.1109/IPDPS.2019.00023
[18]
Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (2017), 3551--3582. https://doi.org/10.1109/TSP.2017.2690524
[19]
Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/
[20]
Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In 2015 IEEE International Parallel and Distributed Processing Symposium. 61--70. https://doi.org/10.1109/IPDPS.2015.27
[21]
Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 689--702. https://doi.org/10.1109/HPCA47549.2020.00062
[22]
Fuxi Wen, Hing Cheung So, and Henk Wymeersch. 2020. Tensor decomposition-based beamspace esprit algorithm for multidimensional harmonic retrieval. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4572--4576.
[23]
Sasindu Wijeratne, Rajgopal Kannan, and Viktor Prasanna. 2021. Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA. In 2021 IEEE High Performance Extreme Computing Conference (HPEC). 1--7. https://doi.org/10.1109/HPEC49654.2021.9622851
[24]
Xilinx. 2022. Alveo U250 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u250.html. Online; accessed 29 January 2022.
[25]
Xilinx. 2022. UltraFast Design Methodology Guide for Xilinx FPGAs and SoCs. https://docs.xilinx.com/r/2021.2-English/ug949-vivado-design-methodology/. Online; accessed 1 August 2022.
[26]
Xilinx. 2022. UltraScale Architecture-Based FPGAs Memory IP v1.4. https://www.xilinx.com/support/documentation/ip_documentation/ultrascale_memory_ip/v1_4/pg150-ultrascale-memory-ip.pdf. Online; accessed 29 January 2022.
[27]
Xilinx. 2022. Vivado Design Suite User Guide. https://docs.xilinx.com/v/u/2018.2- English/ug910-vivado-getting-started. Online; accessed 29 January 2022.

Cited By

View all
  • (2024)Sparse MTTKRP Acceleration for Tensor Decomposition on GPUProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649187(88-96)Online publication date: 7-May-2024
  • (2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
  • (2024)ScalFrag: Efficient Tiled-MTTKRP with Adaptive Launching on GPUs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00036(335-345)Online publication date: 24-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
February 2023
283 pages
ISBN:9781450394178
DOI:10.1145/3543622
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 February 2023

Check for updates

Author Tags

  1. FPGA
  2. hardware accelerators
  3. sparse MTTKRP
  4. tensor decomposition

Qualifiers

  • Research-article

Funding Sources

Conference

FPGA '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)445
  • Downloads (Last 6 weeks)30
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Sparse MTTKRP Acceleration for Tensor Decomposition on GPUProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649187(88-96)Online publication date: 7-May-2024
  • (2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
  • (2024)ScalFrag: Efficient Tiled-MTTKRP with Adaptive Launching on GPUs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00036(335-345)Online publication date: 24-Sep-2024
  • (2023)Dynasor: A Dynamic Memory Layout for Accelerating Sparse MTTKRP for Tensor Decomposition on Multi-core CPU2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00012(23-33)Online publication date: 17-Oct-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media