More Web Proxy on the site http://driver.im/

research-article

Towards a Unified Implementation of GEMM in BLIS

Authors:

Field G. Van Zee,

Robert A. van de GeijnAuthors Info & Claims

ICS '23: Proceedings of the 37th International Conference on Supercomputing

Pages 111 - 121

https://doi.org/10.1145/3577193.3593707

Published: 21 June 2023 Publication History

Abstract

Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (gemm) that can achieve high performance for both small and large problem sizes. The key is to fuse packing - an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance - with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.

References

[1]

R.C. Agarwal, F.G. Gustavson, and M. Zubair. 1994. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of Research and Development 38, 5 (Sept. 1994).

Digital Library

[2]

Peter Cawley. 2022. Apple AMX. https://github.com/corsix/amx.

[3]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990).

Digital Library

[4]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 14, 1 (March 1988).

Digital Library

[5]

Gianluca Frison, Dimitris Kouzoupis, Tommaso Sartor, Andrea Zanelli, and Moritz Diehl. 2018. BLASFEO: Basic Linear Algebra Subroutines for Embedded Optimization. ACM Trans. Math. Softw. 44, 4, Article 42 (jul 2018).

Digital Library

[6]

Kazushige Goto and Robert van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Soft. 34, 3: Article 12, 25 pages (May 2008).

Digital Library

[7]

Kazushige Goto and Robert van de Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Soft. 35, 1 (2008).

Digital Library

[8]

Kazushige Goto and Robert A. van de Geijn. 2002. On Reducing TLB Misses in Matrix Multiplication. Technical Report CS-TR-02-55. Department of Computer Sciences, The University of Texas at Austin.

[9]

Alexander Heinecke, Greg Henry, Maxwell Hutchinson, and Hans Pabst. 2016. LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'16). 981--991.

[10]

Jianyu Huang, Tyler M. Smith, Greg M. Henry, and Robert A. van de Geijn. 2016. Strassen's Algorithm Reloaded. In Proceedings of the International Conference for High Performance Computing, Networking, Storage Analysis (SC'16) (Salt Lake City, Utah). Article 59. http://dl.acm.org/citation.cfm?id=3014904.3014983

[11]

Yuka Ikarashi, Gilbert Louis Bernstein, Alex Reinking, Hasan Genc, and Jonathan Ragan-Kelley. 2022. Exocompilation for Productive Programming of Hardware Accelerators. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 703--718.

Digital Library

[12]

Kalhan Koul, Jackson Melchert, Kavya Sreedhar, Leonard Truong, Gedeon Nyengele, Keyi Zhang, Qiaoyi Liu, Jeff Setter, Po-Han Chen, Yuchen Mei, Maxwell Strange, Ross Daly, Caleb Donovick, Alex Carsello, Taeyoung Kong, Kathleen Feng, Dillon Huff, Ankita Nayak, Rajsekhar Setaluri, James Thomas, Nikhil Bhagdikar, David Durst, Zachary Myers, Nestan Tsiskaridze, Stephen Richardson, Rick Bahr, Kayvon Fatahalian, Pat Hanrahan, Clark Barrett, Mark Horowitz, Christopher Torng, Fredrik Kjolstad, and Priyanka Raina. 2023. AHA: An Agile Approach to the Design of Coarse-Grained Reconfigurable Accelerators and Compilers. ACM Trans. Embed. Comput. Syst. 22, 2, Article 35 (2023), 34 pages.

Digital Library

[13]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Soft. 5, 3 (Sept. 1979).

[14]

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (August 2016).

Digital Library

[15]

Devin Matthews. 2018. High-Performance Tensor Contraction without Transposition. SIAM J. Sci. Comput. 40, 1 (2018). arXiv:https://doi.org/10.1137/16M108968X

Digital Library

[16]

Tyler M Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 28th International Parallel and Distributed Processing Symposium (IPDPS). 1049--1059.

[17]

Tyler M. Smith and Robert A. van de Geijn. 2019. The MOMMS Family of Matrix Multiplication Algorithms. CoRR abs/1904.05717 (2019). arXiv:1904.05717 http://arxiv.org/abs/1904.05717

[18]

Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. 2023. CUTLASS. https://github.com/NVIDIA/cutlass

[19]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (Phoenix, AZ, USA) (MAPL 2019). Association for Computing Machinery, New York, NY, USA, 10--19.

Digital Library

[20]

Field G. Van Zee. 2019. BLIS Performance. https://github.com/flame/blis/blob/master/docs/Performance.md.

[21]

Field G. Van Zee. 2020. Implementing High-Performance Complex Matrix Multiplication via the 1m Method. SIAM J. Sci. Comput. 42, 5 (September 2020), C221--C244.

Digital Library

[22]

Field G. Van Zee and Tyler M. Smith. 2017. Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans. Math. Soft. 44, 1 (June 2017), 7:1--7:36.

Digital Library

[23]

Field G. Van Zee, Tyler M. Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John A. Gunnels, and Lee Killough. 2016. The BLIS Framework: Experiments in Portability. ACM Trans. Math. Softw. 42, 2, Article 12 (June 2016).

Digital Library

[24]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Softw. 41, 3, Article 14 (June 2015).

Digital Library

[25]

R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Computing 27, 1--2 (2001).

[26]

Chenhan D. Yu, Jianyu Huang, Woody Austin, Bo Xiao, and George Biros. 2015. Performance Optimization for the K-Nearest Neighbors Kernel on x86 Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'15) (Austin, Texas). Article 7.

Digital Library

[27]

Xianyi Zhang. 2011. OpenBLAS. http://www.openblas.net.

Index Terms

Towards a Unified Implementation of GEMM in BLIS
1. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

The BLIS Framework: Experiments in Portability

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The ...
BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...
A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing

June 2023

505 pages

ISBN:9798400700569

DOI:10.1145/3577193

Chair:
Kyle Gallivan,
Co-chair:
Efstratios Gallopoulos,
Program Co-chairs:
Dimitrios S. Nikolopoulos,
Ramon Beivide

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS '23

Sponsor:

SIGARCH

ICS '23: 37th International Conference on Supercomputing

June 21 - 23, 2023

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
231
Total Downloads

Downloads (Last 12 months)112
Downloads (Last 6 weeks)8

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents