[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-030-85665-6_22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Algorithm Design for Tensor Units

Published: 01 September 2021 Publication History

Abstract

To respond to the intense computational load of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.

References

[1]
Afshani, P., Sitchinava, N.: Sorting and permuting without bank conflicts on GPUs. In: Proceedings European Symposium on Algorithms (ESA), pp. 13–24 (2015)
[2]
Ahle, T.D., Silvestri, F.: Similarity search with tensor core units. In: Proceedings of the 13th International Conference on Similarity Search and Application (SISAP), vol. 12440, pp. 76–84 (2020)
[3]
Ahmad, Z., Chowdhury, R., Das, R., Ganapathi, P., Gregory, A., Zhu, Y.: Fast stencil computations using fast Fourier transforms. In: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2021)
[4]
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM 59(6), 32:1–32:23 (2013)
[5]
Chowdhury, R., Silvestri, F., Vella, F.: Brief announcement: a computational model for tensor core units. In: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2020)
[6]
Chowdhury, R.A., Silvestri, F., Vella, F.: A computational model for tensor core units arxiv preprint arxiv: 1908.06649 (2020)
[7]
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. The MIT Press, Cambridge (2001)
[8]
Dakkak, A., Li, C., Xiong, J., Gelado, I., Hwu, W.M.: Accelerating reduction and scan using tensor core units. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 46–57 (2019)
[9]
Firoz, J.S., Li, A., Li, J., Barker, K.: On the feasibility of using reduced-precision tensor core operations for graph analytics. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2020)
[10]
Jacob, R., Stöckel, M.: Fast output-sensitive matrix multiplication. In: Proceedings of European Symposium on Algorithms (ESA), pp. 766–778 (2015)
[11]
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th International Symposium on Computer Architecture (ISCA), pp. 1–12 (2017)
[12]
Karatsuba A and Ofman Y Multiplication of multidigit numbers on automata Soviet Physics Doklady 1963 7 595
[13]
Karsin, B., Weichert, V., Casanova, H., Iacono, J., Sitchinava, N.: Analysis-driven engineering of comparison-based sorting algorithms on GPUs. In: Proceedings of the 32nd International Conference on Supercomputing (ICS), pp. 86–95 (2018)
[14]
Kleinberg, J., Tardos, E.: Algorithm Design. Addison Wesley, Boston (2006)
[15]
Lu, T., Chen, Y., Hechtman, B.A., Wang, T., Anderson, J.R.: Large-scale discrete Fourier transform on TPUs. In: arXiv preprint arXiv: 2002.03260
[17]
Raz R On the complexity of matrix product SIAM J. Comput. 2003 32 5 1356-1369
[18]
Seidel R On the all-pairs-shortest-path problem in unweighted undirected graphs J. Comput. Syst. Sci. 1995 51 3 400-403
[19]
Sorna, A., Cheng, X., D’Azevedo, E., Won, K., Tomov, S.: Optimizing the fast Fourier transform using mixed precision on tensor core hardware. In: Proceedings of the 25th International Conference on High Performance Computing Workshops (HiPCW), pp. 3–7 (2018)
[20]
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969).
[21]
Vitter JS Algorithms and data structures for external memory Found. Trends Theor. Comput. Sci. 2006 2 4 305-474
[22]
Zachariadis, O., Satpute, N., Gómez-Luna, J., Olivares, J.: Accelerating sparse matrix–matrix multiplication with GPU tensor cores. Comput. Electr. Eng. 88, 106848 (2020)

Cited By

View all
  • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
  • (2023)A Parallel Scan Algorithm in the Tensor Core Unit ModelEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_33(489-502)Online publication date: 28-Aug-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings
Sep 2021
651 pages
ISBN:978-3-030-85664-9
DOI:10.1007/978-3-030-85665-6

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 September 2021

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
  • (2023)A Parallel Scan Algorithm in the Tensor Core Unit ModelEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_33(489-502)Online publication date: 28-Aug-2023

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media