[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-030-85665-6_22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Algorithm Design for Tensor Units

Published: 01 September 2021 Publication History

Abstract

To respond to the intense computational load of deep neural networks, a plethora of domain-specific architectures have been introduced, such as Google Tensor Processing Units and NVIDIA Tensor Cores. A common feature of these architectures is a hardware circuit for efficiently computing a dense matrix multiplication of a given small size. In order to broaden the class of algorithms that exploit these systems, we propose a computational model, named the TCU model, that captures the ability to natively multiply small matrices. We then use the TCU model for designing fast algorithms for several problems, including matrix operations (dense and sparse multiplication, Gaussian Elimination), graph algorithms (transitive closure, all pairs shortest distances), Discrete Fourier Transform, stencil computations, integer multiplication, and polynomial evaluation. We finally highlight a relation between the TCU model and the external memory model.

References

[1]
Afshani, P., Sitchinava, N.: Sorting and permuting without bank conflicts on GPUs. In: Proceedings European Symposium on Algorithms (ESA), pp. 13–24 (2015)
[2]
Ahle, T.D., Silvestri, F.: Similarity search with tensor core units. In: Proceedings of the 13th International Conference on Similarity Search and Application (SISAP), vol. 12440, pp. 76–84 (2020)
[3]
Ahmad, Z., Chowdhury, R., Das, R., Ganapathi, P., Gregory, A., Zhu, Y.: Fast stencil computations using fast Fourier transforms. In: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2021)
[4]
Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of fast matrix multiplication. J. ACM 59(6), 32:1–32:23 (2013)
[5]
Chowdhury, R., Silvestri, F., Vella, F.: Brief announcement: a computational model for tensor core units. In: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA) (2020)
[6]
Chowdhury, R.A., Silvestri, F., Vella, F.: A computational model for tensor core units arxiv preprint arxiv: 1908.06649 (2020)
[7]
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. The MIT Press, Cambridge (2001)
[8]
Dakkak, A., Li, C., Xiong, J., Gelado, I., Hwu, W.M.: Accelerating reduction and scan using tensor core units. In: Proceedings of the International Conference on Supercomputing (ICS), pp. 46–57 (2019)
[9]
Firoz, J.S., Li, A., Li, J., Barker, K.: On the feasibility of using reduced-precision tensor core operations for graph analytics. In: 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7 (2020)
[10]
Jacob, R., Stöckel, M.: Fast output-sensitive matrix multiplication. In: Proceedings of European Symposium on Algorithms (ESA), pp. 766–778 (2015)
[11]
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th International Symposium on Computer Architecture (ISCA), pp. 1–12 (2017)
[12]
Karatsuba A and Ofman Y Multiplication of multidigit numbers on automata Soviet Physics Doklady 1963 7 595
[13]
Karsin, B., Weichert, V., Casanova, H., Iacono, J., Sitchinava, N.: Analysis-driven engineering of comparison-based sorting algorithms on GPUs. In: Proceedings of the 32nd International Conference on Supercomputing (ICS), pp. 86–95 (2018)
[14]
Kleinberg, J., Tardos, E.: Algorithm Design. Addison Wesley, Boston (2006)
[15]
Lu, T., Chen, Y., Hechtman, B.A., Wang, T., Anderson, J.R.: Large-scale discrete Fourier transform on TPUs. In: arXiv preprint arXiv: 2002.03260
[17]
Raz R On the complexity of matrix product SIAM J. Comput. 2003 32 5 1356-1369
[18]
Seidel R On the all-pairs-shortest-path problem in unweighted undirected graphs J. Comput. Syst. Sci. 1995 51 3 400-403
[19]
Sorna, A., Cheng, X., D’Azevedo, E., Won, K., Tomov, S.: Optimizing the fast Fourier transform using mixed precision on tensor core hardware. In: Proceedings of the 25th International Conference on High Performance Computing Workshops (HiPCW), pp. 3–7 (2018)
[20]
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969).
[21]
Vitter JS Algorithms and data structures for external memory Found. Trends Theor. Comput. Sci. 2006 2 4 305-474
[22]
Zachariadis, O., Satpute, N., Gómez-Luna, J., Olivares, J.: Accelerating sparse matrix–matrix multiplication with GPU tensor cores. Comput. Electr. Eng. 88, 106848 (2020)

Cited By

View all
  • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
  • (2023)A Parallel Scan Algorithm in the Tensor Core Unit ModelEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_33(489-502)Online publication date: 28-Aug-2023

Index Terms

  1. Algorithm Design for Tensor Units
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings
    Sep 2021
    651 pages
    ISBN:978-3-030-85664-9
    DOI:10.1007/978-3-030-85665-6

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 01 September 2021

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector MultiplicationProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607051(1-14)Online publication date: 12-Nov-2023
    • (2023)A Parallel Scan Algorithm in the Tensor Core Unit ModelEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_33(489-502)Online publication date: 28-Aug-2023

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media