More Web Proxy on the site http://driver.im/

research-article

Open access

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA

Authors:

Sasindu Wijeratne,

Rajgopal Kannan,

Viktor PrasannaAuthors Info & Claims

FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

Pages 259 - 269

https://doi.org/10.1145/3543622.3573179

Published: 12 February 2023 Publication History

Abstract

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the most computationally intensive kernel in sparse tensor decomposition. In this paper, we propose a hardware-algorithm co-design on FPGA to minimize the execution time of spMTTKRP along all modes of an input tensor. We introduce FLYCOO, a novel tensor format that eliminates the communication of intermediate values to the FPGA external memory during the computation of spMTTKRP along all the modes. Our remapping of the tensor using FLYCOO also balances the workload among multiple Processing Engines (PEs). We propose a parallel algorithm that can concurrently process multiple partitions of the input tensor independent of each other. The proposed algorithm also orders the tensor dynamically during runtime to increase the data locality of the external memory accesses. We develop a custom FPGA accelerator design with (1) PEs consisting of a collection of pipelines that can concurrently process multiple elements of the input tensor and (2) memory controllers to exploit the spatial and temporal locality of the external memory accesses of the computation. Our work achieves a geometric mean of 8.8X and 3.8X speedup in execution time compared with the state-of-the-art CPU and GPU implementations on widely-used real-world sparse tensor datasets.

References

[1]

Alfred V Aho and John E Hopcroft. 1974. The design and analysis of computer algorithms. (1974).

[2]

Zhiyu Cheng, Baopu Li, Yanwen Fan, and Yingze Bao. 2020. A novel rank selection scheme in tensor ring decomposition based on reinforcement learning for deep neural networks. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3292--3296.

[3]

Sofia Fernandes, Hadi Fanaee-T, and João Gama. 2020. Tensor decomposition for analysing time-evolving social networks: An overview. Artificial Intelligence Review (2020), 1--26.

[4]

Ronald L. Graham. 1969. Bounds on multiprocessing timing anomalies. SIAM journal on Applied Mathematics 17, 2 (1969), 416--429.

Digital Library

[5]

Ahmed E. Helal, Jan Laukemann, Fabio Checconi, Jesmin Jahan Tithi, Teresa Ranadive, Fabrizio Petrini, and Jeewhan Choi. 2021. ALTO: Adaptive Linearized Storage of Sparse Tensors. In Proceedings of the ACM International Conference on Supercomputing (Virtual Event, USA) (ICS '21). Association for Computing Machinery, New York, NY, USA, 404--416. https://doi.org/10.1145/3447818.3461703

Digital Library

[6]

David Hong, Tamara G. Kolda, and Jed A. Duersch. 2020. Generalized Canonical Polyadic Tensor Decomposition. SIAM Rev. 62, 1 (2020), 133--163. https://doi.org/10.1137/18M1203626 arXiv:https://doi.org/10.1137/18M1203626

Digital Library

[7]

Intel. 2022. External Memory Interfaces Intel Arria 10 FPGA IP User Guide. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-20115.pdf. Online; accessed 29 January 2022.

[8]

Oguz Kaya and Bora Uçar. 2015. Scalable sparse tensor decompositions in distributed memory systems. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.

Digital Library

[9]

Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943--948.

[10]

Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500.

Digital Library

[11]

Jiajia Li, Yuchen Ma, and Richard Vuduc. 2018. ParTI!: A Parallel Tensor Infrastructure for multicore CPUs and GPUs. http://parti-project.org Last updated: Jan 2020.

[12]

Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 238--252. https://doi.org/10.1109/SC.2018.00022

Digital Library

[13]

Jiajia Li, Bora Uçar, Ümit V Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc. 2019. Efficient and effective sparse tensor reordering. In Proceedings of the ACM International Conference on Supercomputing. 227--237.

Digital Library

[14]

Jiajia Li, Bora Uçar, Ümit V. Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc. 2019. Efficient and Effective Sparse Tensor Reordering. In Proceedings of the ACM International Conference on Supercomputing (Phoenix, Arizona) (ICS '19). Association for Computing Machinery, New York, NY, USA, 227--237. https://doi.org/10.1145/3330345.3330366

Digital Library

[15]

Marco Mondelli and Andrea Montanari. 2019. On the connection between learning two-layer neural networks and tensor decomposition. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 1051--1060.

[16]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and P. Sadayappan. 2019. An Efficient Mixed-Mode Representation of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '19). Association for Computing Machinery, New York, NY, USA, Article 49, 25 pages. https://doi.org/10.1145/3295500.3356216

Digital Library

[17]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and P. Sadayappan. 2019. Load-Balanced Sparse MTTKRP on GPUs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 123--133. https://doi.org/10.1109/IPDPS.2019.00023

[18]

Nicholas D. Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, and Christos Faloutsos. 2017. Tensor Decomposition for Signal Processing and Machine Learning. IEEE Transactions on Signal Processing 65, 13 (2017), 3551--3582. https://doi.org/10.1109/TSP.2017.2690524

Digital Library

[19]

Shaden Smith, Jee W. Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. 2017. FROSTT: The Formidable Repository of Open Sparse Tensors and Tools. http://frostt.io/

[20]

Shaden Smith, Niranjay Ravindran, Nicholas D. Sidiropoulos, and George Karypis. 2015. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication. In 2015 IEEE International Parallel and Distributed Processing Symposium. 61--70. https://doi.org/10.1109/IPDPS.2015.27

Digital Library

[21]

Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, and Zhiru Zhang. 2020. Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 689--702. https://doi.org/10.1109/HPCA47549.2020.00062

[22]

Fuxi Wen, Hing Cheung So, and Henk Wymeersch. 2020. Tensor decomposition-based beamspace esprit algorithm for multidimensional harmonic retrieval. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4572--4576.

[23]

Sasindu Wijeratne, Rajgopal Kannan, and Viktor Prasanna. 2021. Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA. In 2021 IEEE High Performance Extreme Computing Conference (HPEC). 1--7. https://doi.org/10.1109/HPEC49654.2021.9622851

[24]

Xilinx. 2022. Alveo U250 Data Center Accelerator Card. https://www.xilinx.com/products/boards-and-kits/alveo/u250.html. Online; accessed 29 January 2022.

[25]

Xilinx. 2022. UltraFast Design Methodology Guide for Xilinx FPGAs and SoCs. https://docs.xilinx.com/r/2021.2-English/ug949-vivado-design-methodology/. Online; accessed 1 August 2022.

[26]

Xilinx. 2022. UltraScale Architecture-Based FPGAs Memory IP v1.4. https://www.xilinx.com/support/documentation/ip_documentation/ultrascale_memory_ip/v1_4/pg150-ultrascale-memory-ip.pdf. Online; accessed 29 January 2022.

[27]

Xilinx. 2022. Vivado Design Suite User Guide. https://docs.xilinx.com/v/u/2018.2- English/ug910-vivado-getting-started. Online; accessed 29 January 2022.

Cited By

Wijeratne SKannan RPrasanna V(2024)Sparse MTTKRP Acceleration for Tensor Decomposition on GPUProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649187(88-96)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649187
Gao YWang TGong LWang CHu YYang YLiu ZLi XZhou X(2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
https://doi.org/10.1109/TC.2023.3347674
Lin WWang HDeng HSun Q(2024)ScalFrag: Efficient Tiled-MTTKRP with Adaptive Launching on GPUs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00036(335-345)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00036
Show More Cited By

Index Terms

Accelerating Sparse MTTKRP for Tensor Decomposition on FPGA
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
      2. Special purpose systems
2. Theory of computation
  1. Models of computation
    1. Concurrency
      1. Parallel computing models

Recommendations

Sparse MTTKRP Acceleration for Tensor Decomposition on GPU
CF '24: Proceedings of the 21st ACM International Conference on Computing Frontiers

Sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP) is the bottleneck kernel of sparse tensor decomposition. In this work, we propose a GPU-based algorithm design to address the key challenges in accelerating spMTTKRP computation, including (1) ...
Tensor Completion via Fully-Connected Tensor Network Decomposition with Regularized Factors
Abstract
The recently proposed fully-connected tensor network (FCTN) decomposition has a powerful ability to capture the low-rankness of tensors and has achieved great success in tensor completion. However, the FCTN decomposition-based method is highly ...
Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme
Abstract
Nonnegative tensor decomposition is a versatile tool for multiway data analysis, by which the extracted components are nonnegative and usually sparse. Nevertheless, the sparsity is only a side effect and cannot be explicitly controlled without ... $_{}$

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 2023

283 pages

ISBN:9781450394178

DOI:10.1145/3543622

General Chair:
Paolo Ienne
EPFL, Switzerland
,
Program Chair:
Zhiru Zhang
Cornell University, USA

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 February 2023

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

FPGA '23

Sponsor:

SIGDA

FPGA '23: The 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 12 - 14, 2023

CA, Monterey, USA

Acceptance Rates

Overall Acceptance Rate 125 of 627 submissions, 20%

Upcoming Conference

FPGA '25

Sponsor:
sigda

The 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays

February 27 - March 1, 2025

Monterey , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
853
Total Downloads

Downloads (Last 12 months)445
Downloads (Last 6 weeks)30

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wijeratne SKannan RPrasanna V(2024)Sparse MTTKRP Acceleration for Tensor Decomposition on GPUProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649187(88-96)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649187
Gao YWang TGong LWang CHu YYang YLiu ZLi XZhou X(2024)Enhancing Graph Random Walk Acceleration via Efficient Dataflow and Hybrid Memory ArchitectureIEEE Transactions on Computers10.1109/TC.2023.334767473:3(887-901)Online publication date: Mar-2024
https://doi.org/10.1109/TC.2023.3347674
Lin WWang HDeng HSun Q(2024)ScalFrag: Efficient Tiled-MTTKRP with Adaptive Launching on GPUs2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00036(335-345)Online publication date: 24-Sep-2024
https://doi.org/10.1109/CLUSTER59578.2024.00036
Wijeratne SKannan RPrasanna V(2023)Dynasor: A Dynamic Memory Layout for Accelerating Sparse MTTKRP for Tensor Decomposition on Multi-core CPU2023 IEEE 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD59825.2023.00012(23-33)Online publication date: 17-Oct-2023
https://doi.org/10.1109/SBAC-PAD59825.2023.00012

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten