More Web Proxy on the site http://driver.im/

Article

Algorithm Design for Tensor Units

Authors:

Rezaul Chowdhury,

Francesco Silvestri,

Flavio VellaAuthors Info & Claims

Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings

Pages 353 - 367

https://doi.org/10.1007/978-3-030-85665-6_22

Published: 01 September 2021 Publication History

Abstract

To respond to the intense computational load of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.

References

[1]

Afshani, P., Sitchinava, N.: Sorting and permuting without bank conflicts on GPUs. In: Proceedings European Symposium on Algorithms (ESA), pp. 13–24 (2015)

[2]

Ahle, T.D., Silvestri, F.: Similarity search with tensor core units. In: Proceedings of the 13th International Conference on Similarity Search and Application (SISAP), vol. 12440, pp. 76–84 (2020)

[3]

Ahmad, Z., Chowdhury, R., Das, R., Ganapathi, P., Gregory, A., Zhu, Y.: Fast stencil computations using fast Fourier transforms. In: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2021)

[4]

Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM 59(6), 32:1–32:23 (2013)

[5]

Chowdhury, R., Silvestri, F., Vella, F.: Brief announcement: a computational model for tensor core units. In: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2020)

[6]

Chowdhury, R.A., Silvestri, F., Vella, F.: A computational model for tensor core units arxiv preprint arxiv: 1908.06649 (2020)

[7]

Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. The MIT Press, Cambridge (2001)

[8]

Dakkak, A., Li, C., Xiong, J., Gelado, I., Hwu, W.M.: Accelerating reduction and scan using tensor core units. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 46–57 (2019)

[9]

Firoz, J.S., Li, A., Li, J., Barker, K.: On the feasibility of using reduced-precision tensor core operations for graph analytics. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2020)

[10]

Jacob, R., Stöckel, M.: Fast output-sensitive matrix multiplication. In: Proceedings of European Symposium on Algorithms (ESA), pp. 766–778 (2015)

[11]

Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th International Symposium on Computer Architecture (ISCA), pp. 1–12 (2017)

[12]

Karatsuba A and Ofman Y Multiplication of multidigit numbers on automata Soviet Physics Doklady 1963 7 595

[13]

Karsin, B., Weichert, V., Casanova, H., Iacono, J., Sitchinava, N.: Analysis-driven engineering of comparison-based sorting algorithms on GPUs. In: Proceedings of the 32nd International Conference on Supercomputing (ICS), pp. 86–95 (2018)

[14]

Kleinberg, J., Tardos, E.: Algorithm Design. Addison Wesley, Boston (2006)

[15]

Lu, T., Chen, Y., Hechtman, B.A., Wang, T., Anderson, J.R.: Large-scale discrete Fourier transform on TPUs. In: arXiv preprint arXiv: 2002.03260

[16]

Nvidia Tesla V100 GPU architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

[17]

Raz R On the complexity of matrix product SIAM J. Comput. 2003 32 5 1356-1369

[18]

Seidel R On the all-pairs-shortest-path problem in unweighted undirected graphs J. Comput. Syst. Sci. 1995 51 3 400-403

[19]

Sorna, A., Cheng, X., D’Azevedo, E., Won, K., Tomov, S.: Optimizing the fast Fourier transform using mixed precision on tensor core hardware. In: Proceedings of the 25th International Conference on High Performance Computing Workshops (HiPCW), pp. 3–7 (2018)

[20]

Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969).

[21]

Vitter JS Algorithms and data structures for external memory Found. Trends Theor. Comput. Sci. 2006 2 4 305-474

[22]

Zachariadis, O., Satpute, N., Gómez-Luna, J., Olivares, J.: Accelerating sparse matrix–matrix multiplication with GPU tensor cores. Comput. Electr. Eng. 88, 106848 (2020)

Cited By

Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Zouzias AMcColl W(2023)A Parallel Scan Algorithm in the Tensor Core Unit ModelEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_33(489-502)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_33

Index Terms

Algorithm Design for Tensor Units

Index terms have been assigned to the content through auto-classification.

Recommendations

A Computational Model for Tensor Core Units
SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures

To respond to the need for efficient training and inference of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these ...
An iterative algorithm for third-order tensor multi-rank minimization

Recent work by Kilmer et al. (A third-order generalization of the matrix SVD as a product of third-order tensors, Department of Computer Science, Tufts University, Medford, MA, 2008; Linear Algebra Appl 435(3):641---658, 2011; SIAM J Matrix Anal Appl 34(...
Robust low tubal rank tensor completion via factor tensor norm minimization
Highlights
- We give the definitions of tensor double norm and tensor Frobenius/nuclear hybrid norm, and regard them as low-rank regularization penalty of tensor completion.
- We prove that tensor double norm and tensor Frobenius/nuclear hybrid norm ...
Abstract
Recent research has demonstrated that low tubal rank recovery based on tensor has received extensive attention. In this correspondence, we define tensor double nuclear norm and tensor Frobenius/nuclear hybrid norm to induce a surrogate for tensor ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings

Sep 2021

651 pages

ISBN:978-3-030-85664-9

DOI:10.1007/978-3-030-85665-6

Editors:
Leonel Sousa
Universidade de Lisboa, Lisbon, Portugal
,
Nuno Roma
Universidade de Lisboa, Lisbon, Portugal
,
Pedro Tomás
Universidade de Lisboa, Lisbon, Portugal

© Springer Nature Switzerland AG 2021.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 September 2021

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lu YLiu WMohror KArnold DBadia R(2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607051
Zouzias AMcColl W(2023)A Parallel Scan Algorithm in the Tensor Core Unit ModelEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_33(489-502)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-39698-4_33

View Options

View options

Figures

Tables

Media

View Table of Conten