[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3577193.3593707acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Towards a Unified Implementation of GEMM in BLIS

Published: 21 June 2023 Publication History

Abstract

Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (gemm) that can achieve high performance for both small and large problem sizes. The key is to fuse packing - an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance - with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.

References

[1]
R.C. Agarwal, F.G. Gustavson, and M. Zubair. 1994. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of Research and Development 38, 5 (Sept. 1994).
[2]
Peter Cawley. 2022. Apple AMX. https://github.com/corsix/amx.
[3]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990).
[4]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 14, 1 (March 1988).
[5]
Gianluca Frison, Dimitris Kouzoupis, Tommaso Sartor, Andrea Zanelli, and Moritz Diehl. 2018. BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization. ACM Trans. Math. Softw. 44, 4, Article 42 (jul 2018).
[6]
Kazushige Goto and Robert van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Soft. 34, 3: Article 12, 25 pages (May 2008).
[7]
Kazushige Goto and Robert van de Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Soft. 35, 1 (2008).
[8]
Kazushige Goto and Robert A. van de Geijn. 2002. On Reducing TLB Misses in Matrix Multiplication. Technical Report CS-TR-02-55. Department of Computer Sciences, The University of Texas at Austin.
[9]
Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16). 981--991.
[10]
Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Strassen's Algorithm Reloaded. In Proceedings of the International Conference for High Performance Computing, Networking, Storage Analysis (SC'16) (Salt Lake City, Utah). Article 59. http://dl.acm.org/citation.cfm?id=3014904.3014983
[11]
Yuka Ikarashi, Gilbert Louis Bernstein, Alex Reinking, Hasan Genc, and Jonathan Ragan-Kelley. 2022. Exocompilation for Productive Programming of Hardware Accelerators. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 703--718.
[12]
Kalhan Koul, Jackson Melchert, Kavya Sreedhar, Leonard Truong, Gedeon Nyengele, Keyi Zhang, Qiaoyi Liu, Jeff Setter, Po-Han Chen, Yuchen Mei, Maxwell Strange, Ross Daly, Caleb Donovick, Alex Carsello, Taeyoung Kong, Kathleen Feng, Dillon Huff, Ankita Nayak, Rajsekhar Setaluri, James Thomas, Nikhil Bhagdikar, David Durst, Zachary Myers, Nestan Tsiskaridze, Stephen Richardson, Rick Bahr, Kayvon Fatahalian, Pat Hanrahan, Clark Barrett, Mark Horowitz, Christopher Torng, Fredrik Kjolstad, and Priyanka Raina. 2023. AHA: An Agile Approach to the Design of Coarse-Grained Reconfigurable Accelerators and Compilers. ACM Trans. Embed. Comput. Syst. 22, 2, Article 35 (2023), 34 pages.
[13]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Soft. 5, 3 (Sept. 1979).
[14]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (August 2016).
[15]
Devin Matthews. 2018. High-Performance Tensor Contraction without Transposition. SIAM J. Sci. Comput. 40, 1 (2018). arXiv:https://doi.org/10.1137/16M108968X
[16]
Tyler M Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 28th International Parallel and Distributed Processing Symposium (IPDPS). 1049--1059.
[17]
Tyler M. Smith and Robert A. van de Geijn. 2019. The MOMMS Family of Matrix Multiplication Algorithms. CoRR abs/1904.05717 (2019). arXiv:1904.05717 http://arxiv.org/abs/1904.05717
[18]
Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. 2023. CUTLASS. https://github.com/NVIDIA/cutlass
[19]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10--19.
[20]
Field G. Van Zee. 2019. BLIS Performance. https://github.com/flame/blis/blob/master/docs/Performance.md.
[21]
Field G. Van Zee. 2020. Implementing High-Performance Complex Matrix Multiplication via the 1m Method. SIAM J. Sci. Comput. 42, 5 (September 2020), C221--C244.
[22]
Field G. Van Zee and Tyler M. Smith. 2017. Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans. Math. Soft. 44, 1 (June 2017), 7:1--7:36.
[23]
Field G. Van Zee, Tyler M. Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John A. Gunnels, and Lee Killough. 2016. The BLIS Framework: Experiments in Portability. ACM Trans. Math. Softw. 42, 2, Article 12 (June 2016).
[24]
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015).
[25]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 1--2 (2001).
[26]
Chenhan D. Yu, Jianyu Huang, Woody Austin, Bo Xiao, and George Biros. 2015. Performance Optimization for the K-Nearest Neighbors Kernel on x86 Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'15) (Austin, Texas). Article 7.
[27]
Xianyi Zhang. 2011. OpenBLAS. http://www.openblas.net.

Index Terms

  1. Towards a Unified Implementation of GEMM in BLIS

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing
    June 2023
    505 pages
    ISBN:9798400700569
    DOI:10.1145/3577193
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. BLAS
    2. BLIS
    3. matrix multiplication
    4. performance
    5. CPU

    Qualifiers

    • Research-article

    Conference

    ICS '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 231
      Total Downloads
    • Downloads (Last 12 months)112
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media