More Web Proxy on the site http://driver.im/

research-article

Public Access

Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

Authors:

Field G. Van Zee,

Tyler M. SmithAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 44, Issue 1

Article No.: 7, Pages 1 - 36

https://doi.org/10.1145/3086466

Published: 24 July 2017 Publication History

Abstract

In this article, we explore the implementation of complex matrix multiplication. We begin by briefly identifying various challenges associated with the conventional approach, which calls for a carefully written kernel that implements complex arithmetic at the lowest possible level (i.e., assembly language). We then set out to develop a method of complex matrix multiplication that avoids the need for complex kernels altogether. This constraint promotes code reuse and portability within libraries such as Basic Linear Algebra Subprograms and BLAS-Like Library Instantiation Software (BLIS) and allows kernel developers to focus their efforts on fewer and simpler kernels. We develop two alternative approaches—one based on the 3m method and one that reflects the classic 4m formulation—each with multiple variants, all of which rely only on real matrix multiplication kernels. We discuss the performance characteristics of these “induced” methods and observe that the assembly-level method actually resides along the 4m spectrum of algorithmic variants. Implementations are developed within the BLIS framework, and testing on modern hardware confirms that while the less numerically stable 3m method yields the fastest runtimes, the more stable (and thus widely applicable) 4m method’s performance is somewhat limited due to implementation challenges that appear inherent in nature.

References

[1]

Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, and Lee Rosenberg. 2010. Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures. In Proceedings of the 9th International Meeting on High Performance Computing for Computational Science (VecPar’10).

Digital Library

[2]

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Physics: Conf. Ser. 180, 1 (2009).

[3]

AMD. 2012. AMD Core Math Library. Retrieved from http://developer.amd.com/tools/cpu/acml/pages/default.aspx.

[4]

E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, Jack J. Dongarra, J. Du Croz, S. Hammarling, A. Greenbaum, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (3rd ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA.

[5]

J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, 120--127.

[6]

James W. Demmel and Nicholas J. Higham. 1992. Stability of block algorithms with fast level-3 BLAS. ACM Trans. Math. Soft. 18, 3 (Sep. 1992), 274--291.

Digital Library

[7]

J. J. Dongarra, J. R. Bunch, C. B. Moler, and G. W. Stewart. 1979. LINPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.

[8]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990), 1--17.

Digital Library

[9]

Jack J. Dongarra, Piotr Luszczek, and Antoine Petitet. 2003. The LINPACK benchmark: Past, present, and future. concurrency and computation: Practice and experience. Concurr. Comput.: Pract. Exper. 15, 9 (2003), 803--820.

[10]

Kazushige Goto and Robert van de Geijn. 2008a. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Soft. 34, 3 (May 2008), 12:1--12:25.

Digital Library

[11]

Kazushige Goto and Robert van de Geijn. 2008b. High-performance implementation of the level-3 BLAS. ACM Trans. Math. Soft. 35, 1 (July 2008), 1--14.

Digital Library

[12]

Nicholas J. Higham. 1990. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Soft. 16, 4 (December 1990), 352--368.

Digital Library

[13]

Nicholas J. Higham. 1992. Stability of a method for multiplying complex matrices with three real matrix multiplications. SIAM J. Matrix Anal. Appl. 13, 3 (July 1992), 681--687.

Digital Library

[14]

IBM. 2012. Engineering and Scientific Subroutine Library. Retrieved from http://www.ibm.com/systems/software/essl/. (2012).

[15]

Intel. 2014. Math Kernel Library. Retrieved from http://developer.intel.com/software/products/mkl/. (2014).

[16]

Donald E. Knuth. 1981. The Art of Computer Programming, Volume 2: Seminumerical Algorithms (2nd ed.). Addison-Wesley, Reading, MA.

[17]

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí. 2016. Analytical modeling is enough for high-performance BLIS. ACM Trans. Math. Soft. 43, 2 (August 2016), 12:1--12:18.

Digital Library

[18]

Jack Poulson, Bryan Marker, Robert A. van de Geijn, Jeff R. Hammond, and Nichols A. Romero. 2013. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw. 39, 2 (February 2013), Article 13, 24 pages.

Digital Library

[19]

Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS’14).

Digital Library

[20]

Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.

Digital Library

[21]

Top500. 2015. Top 500 Supercomputer Sites. Retrieved from http://www.top500.org/.

[22]

Field G. Van Zee, Ernie Chan, Robert van de Geijn, Enrique S. Quintana-Ortí, and Gregorio Quintana-Ortí. 2009. The libflame library for dense matrix computations. IEEE Comput. Sci. Eng. 11, 6 (2009), 56--62.

Digital Library

[23]

Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng Low, Bryan Marker, Lee Killough, and Robert A. van de Geijn. 2016. The BLIS framework: Experiments in portability. ACM Trans. Math. Soft. 42, 2 (June 2016), 12:1--12:19. http://doi.acm.org/10.1145/2755561

[24]

Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Soft. 41, 3 (June 2015), 14:1--14:33. http://doi.acm.org/10.1145/2764454

Digital Library

[25]

R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Supercomputing Conference (SC’98).

[26]

Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-driven level 3 BLAS performance optimization on loongson 3A processor. In IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS’12).

Digital Library

Cited By

Błażejowski M(2024)Which C compiler and BLAS/LAPACK library should I use: gretl’s numerical efficiency in different configurationsComputational Statistics10.1007/s00180-024-01461-wOnline publication date: 22-Mar-2024
https://doi.org/10.1007/s00180-024-01461-w
Kouya T(2024)Performance Evaluation of Accelerated Complex Multiple-Precision LU DecompositionComputational Science and Its Applications – ICCSA 2024 Workshops10.1007/978-3-031-65273-8_1(3-19)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-65273-8_1
Andersson MMarkidis S(2023)A Case Study on DaCe Portability & Performance for Batched Discrete Fourier TransformsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3578178.3578239(55-63)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3578178.3578239
Show More Cited By

Index Terms

Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices
  2. Mathematical software
    1. Mathematical software performance

Recommendations

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

We approach the problem of implementing mixed-datatype support within the general matrix multiplication (gemm) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A, B, and C may be stored as single- or ...
The BLIS Framework: Experiments in Portability

BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The ...
BLIS: A Framework for Rapidly Instantiating BLAS Functionality

The BLAS-like Library Instantiation Software (BLIS) framework is a new infrastructure for rapidly instantiating Basic Linear Algebra Subprograms (BLAS) functionality. Its fundamental innovation is that virtually all computation within level-2 (matrix-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 44, Issue 1

March 2018

308 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/3071076

Editor:
Daniel Kressner
EPF Lausanne, Switzerland

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2017

Accepted: 01 April 2017

Revised: 01 February 2017

Received: 01 May 2015

Published in TOMS Volume 44, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
1,663
Total Downloads

Downloads (Last 12 months)398
Downloads (Last 6 weeks)29

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Błażejowski M(2024)Which C compiler and BLAS/LAPACK library should I use: gretl’s numerical efficiency in different configurationsComputational Statistics10.1007/s00180-024-01461-wOnline publication date: 22-Mar-2024
https://doi.org/10.1007/s00180-024-01461-w
Kouya T(2024)Performance Evaluation of Accelerated Complex Multiple-Precision LU DecompositionComputational Science and Its Applications – ICCSA 2024 Workshops10.1007/978-3-031-65273-8_1(3-19)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/978-3-031-65273-8_1
Andersson MMarkidis S(2023)A Case Study on DaCe Portability & Performance for Batched Discrete Fourier TransformsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3578178.3578239(55-63)Online publication date: 27-Feb-2023
https://dl.acm.org/doi/10.1145/3578178.3578239
Xu RVan Zee Fvan de Geijn RGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Towards a Unified Implementation of GEMM in BLISProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593707(111-121)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593707
Kouya T(2023)Acceleration of Complex Matrix Multiplication Using Arbitrary Precision Floating-Point Arithmetic2023 International Conference on Engineering and Emerging Technologies (ICEET)10.1109/ICEET60227.2023.10525846(1-6)Online publication date: 27-Oct-2023
https://doi.org/10.1109/ICEET60227.2023.10525846
Tukanov NSrinivasaraghavan RMoreira JLow T(2022)Modeling Matrix Engines for Portability and Performance2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00117(1173-1183)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00117
Van Zee FParikh DGeijn R(2021)Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS FrameworkACM Transactions on Mathematical Software10.1145/340222547:2(1-26)Online publication date: 20-Apr-2021
https://dl.acm.org/doi/10.1145/3402225
Rudow MRashmi KGuruswami V(2021)A locality-based lens for coded computation2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9518056(1070-1075)Online publication date: 12-Jul-2021
https://doi.org/10.1109/ISIT45174.2021.9518056
Van Zee F(2020)Implementing High-Performance Complex Matrix Multiplication via the 1M MethodSIAM Journal on Scientific Computing10.1137/19M128204042:5(C221-C244)Online publication date: 15-Sep-2020
https://doi.org/10.1137/19M1282040
Rodríguez-Sánchez RIgual FQuintana-Ortí E(2020)Integration and exploitation of intra-routine malleability in BLISThe Journal of Supercomputing10.1007/s11227-019-03078-z76:4(2860-2875)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1007/s11227-019-03078-z
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents