More Web Proxy on the site http://driver.im/

research-article

Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

Authors:

Paolo D'alberto,

Alexandru NicolauAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 38, Issue 1

Article No.: 2, Pages 1 - 30

https://doi.org/10.1145/2049662.2049664

Published: 07 December 2011 Publication History

Abstract

We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen's and Winograd's fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen's until a highly tuned BLAS matrix multiplication allows performance advantages). We investigate how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining.

We have three contributions: first, we present a performance overview for double- and double-complex-precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd's matrix multiplication (achieving up to 20% speedup with respect to regular matrix multiplication). About the latter Winograd's algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we show how this yields up to 10% further performance improvements.

References

[1]

Anderson, E., Bai, Z., Bischof, C., Dongarra, J. D. J., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., and Sorensen, D. 1995. LAPACK User' Guide Release 2.0 2. SIAM.

[2]

Bilmes, J., Asanovic, K., Chin, C., and Demmel, J. 1997. Optimizing matrix multiply using PHiPAC: A portable, high-performance, Ansi C coding methodology. In Proceedings of the International Conference on Supercomputing.

Digital Library

[3]

Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, R. C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151.

Digital Library

[4]

Bodrato, M. 2010. A Strassen-like matrix multiplication suited for squaring and higher power computation. In Proceedings of the International Symposium on Symbolic and Algebraic Computation (ISSAC'10). ACM, New York, http://bodrato.it/papers/&num;ISSAC2010.

Digital Library

[5]

Boyer, B., Dumas, J.-G., Pernet, C., and Zhou, W. 2009. Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm. In Proceedings of the International Symposium on Symbolic and Algebraic Computation (ISSAC). 55--62.

Digital Library

[6]

Brent, R. P. 1970a. Algorithms for matrix multiplication. Tech. rep. TR-CS-70-157, Stanford University.

Digital Library

[7]

Brent, R. P. 1970b. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numer. Math. 16, 145--156.

Digital Library

[8]

Cohn, H., Kleinberg, R., Szegedy, B., and Umans, C. 2005. Group-Theoretic algorithms for matrix multiplication. http://www.cs.caltech.edu/~umans/papers/CKSU05.pdf.

Digital Library

[9]

Coppersmith, D. and Winograd, S. 1987. Matrix multiplication via arithmetic progressions. In Proceedings of the 19th Annual ACM Conference on Theory of Computing. 1--6.

Digital Library

[10]

D'Alberto, P. and Nicolau, A. 2007. Adaptive strassen's matrix multiplication. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS'07). ACM, New York, 284--292.

Digital Library

[11]

D'Alberto, P. and Nicolau, A. 2009. Adaptive Winograd's matrix multiplications. ACM Trans. Math. Softw. 36, 1.

Digital Library

[12]

Demmel, J., Dongarra, J., Eijkhout, E., Fuentes, E., Petitet, E., Vuduc, V., Whaley, R., and Yelick, K. 2005. Self-Adapting linear algebra algorithms and software. Proc. IEEE 93, 2.

[13]

Demmel, J., Dumitriu, J., Holtz, O., and Kleinberg, R. 2006. Fast matrix multiplication is stable. http://arnetminer.org/viewpub.do?pid=822030.

[14]

Demmel, J. and Higham, N. 1992. Stability of block algorithms with fast level-3 BLAS. ACM Trans. Math. Softw. 18, 3, 274--291.

Digital Library

[15]

Dongarra, J. J., Croz, J. D., Duff, I. S., and Hammarling, S. 1990a. Algorithm 679: A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 18--28.

Digital Library

[16]

Dongarra, J. J., Croz, J. D., Duff, I. S., and Hammarling, S. 1990b. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1--17.

Digital Library

[17]

Douglas, C., Heroux, M., Slishman, G., and Smith, R. 1994. GEMMW: A portable level 3 BLAS Winograd variant of Strassen's matrix--matrix multiply algorithm. J. Comput. Phys. 110, 1--10.

Digital Library

[18]

Dumas, J.-G., Giorgi, P., and Pernet, C. 2008. Dense linear algebra over word-size prime fields: The FFLAS and FFPACK packages. ACM Trans. Math. Softw. 35, 3, 1--42.

Digital Library

[19]

Eiron, N., Rodeh, M., and Steinwarts, I. 1998. Matrix multiplication: A case study of algorithm engineering. In Proceedings the Workshop on Algorithm Engineering (WAE'98).

[20]

Frens, J. and Wise, D. 1997. Auto-Blocking matrix-multiplication or tracking BLAS3 performance from source code. In Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming SIGPLAN Not. 32, 7, 206--216.

Digital Library

[21]

Frigo, M. and Johnson, S. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2, 216--231.

[22]

Goto, K. and van de Geijn, R. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3.

Digital Library

[23]

Gunnels, J., Gustavson, F., Henry, G., and van de Geijn, R. 2001. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw. 27, 4, 422--455.

Digital Library

[24]

Higham, N. 1990. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw. 16, 4, 352--368.

Digital Library

[25]

Higham, N. 2002. Accuracy and Stability of Numerical Algorithms, 2nd Ed. SIAM.

Digital Library

[26]

Huss-Lederman, S., Jacobson, E., Johnson, J., Tsao, A., and Turnbull, T. 1996a. Strassen's algorithm for matrix multiplication: Modeling analysis, and implementation. Tech. rep. CCS-TR-96-14, Center for Computing Sciences.

[27]

Huss-Lederman, S., Jacobson, E., Tsao, A., Turnbull, T., and Johnson, J. 1996b. Implementation of Strassen's algorithm for matrix multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM). ACM Press, 32.

Digital Library

[28]

JáJá, J. 1992. An Introduction to Parallel Algorithms. Addison Wesley Longman Publishing Co., Inc., CA.

Digital Library

[29]

Kagstrom, B., Ling, P., and van Loan, C. 1998a. Algorithm 784: GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Softw. 24, 3, 303--316.

Digital Library

[30]

Kagstrom, B., Ling, P., and van Loan, C. 1998b. GEMM-based level 3 BLAS: High-Performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3, 268--302.

Digital Library

[31]

Kaporin, I. 1999. A practical algorithm for faster matrix multiplication. Numer. Linear Algebra Appli. 6, 8, 687--700.

[32]

Kaporin, I. 2004. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theor. Comput. Sci. 315, 2-3, 469--510.

Digital Library

[33]

Lawson, C. L., Hanson, R. J., Kincaid, D., and Krogh, F. T. 1979. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Trans. Math. Softw. 5, 308--323.

Digital Library

[34]

Loos, S. and Wise, D. Strassen's matrix multiplication relabeled. http:/src.acm.org/loos/loos.html.

[35]

Nicolau, A., Li, G., and Kejariwal, A. 2009. Techniques for efficient placement of synchronization primitives. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'09).

Digital Library

[36]

Pan, V. 1978. Strassen's algorithm is not optimal: Trililnear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the Conference on Foundations of Computer Science (FOCS). 166--176.

Digital Library

[37]

Pan, V. 1984. How can we speed up matrix multiplication&quest; SIAM Rev. 26, 3, 393--415.

Digital Library

[38]

Priest, D. 1991. Algorithms for arbitrary precision floating point arithmetic. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic (Arith-10). P. Kornerup and D. W. Matula, Eds., IEEE Computer Society Press, Los Alamitos, CA, 132--144.

[39]

Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE, special issue on “Program Generation, Optimization, and Adaptation” 93, 2.

[40]

Strassen, V. 1969. Gaussian elimination is not optimal. Numer. Math. 14, 3, 354--356.

Digital Library

[41]

Whaley, R. and Dongarra, J. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM). IEEE Computer Society, 1--27.

Digital Library

[42]

Whaley, R. C. and Petitet, A. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exper. 35, 2, 101--121. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps.

Digital Library

Cited By

Schwartz OVaknin N(2023)Pebbling Game and Alternative Basis for High Performance Matrix MultiplicationSIAM Journal on Scientific Computing10.1137/22M150271945:6(C277-C303)Online publication date: 15-Nov-2023
https://doi.org/10.1137/22M1502719
Huang JYu CGeijn R(2020)Strassen’s Algorithm Reloaded on GPUsACM Transactions on Mathematical Software10.1145/337241946:1(1-22)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3372419
Karstadt ESchwartz O(2020)Matrix Multiplication, a Little FasterJournal of the ACM10.1145/336450467:1(1-31)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3364504
Show More Cited By

Index Terms

Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

Recommendations

Adaptive Strassen's matrix multiplication
ICS '07: Proceedings of the 21st annual international conference on Supercomputing

Strassen's matrix multiplication (MM) has benefits with respect to any (highly tuned) implementations of MM because Strassen's reduces the total number of operations. Strassen achieved this operation reduction by replacing computationally expensive MMs ...
A communication-less parallel algorithm for tridiagonal Toeplitz systems

Diagonally dominant tridiagonal Toeplitz systems of linear equations arise in many application areas and have been well studied in the past. Modern interest in numerical linear algebra is often focusing on solving classic problems in parallel. In ...
Fast and Stable Algorithms for Banded Plus Semiseparable Systems of Linear Equations

We present fast and numerically stable algorithms for the solution of linear systems of equations, where the coefficient matrix can be written in the form of a banded plus semiseparable matrix. Such matrices include banded matrices, banded bordered ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 38, Issue 1

November 2011

144 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2049662

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 December 2011

Accepted: 01 November 2010

Revised: 01 September 2010

Received: 01 December 2009

Published in TOMS Volume 38, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
1,068
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schwartz OVaknin N(2023)Pebbling Game and Alternative Basis for High Performance Matrix MultiplicationSIAM Journal on Scientific Computing10.1137/22M150271945:6(C277-C303)Online publication date: 15-Nov-2023
https://doi.org/10.1137/22M1502719
Huang JYu CGeijn R(2020)Strassen’s Algorithm Reloaded on GPUsACM Transactions on Mathematical Software10.1145/337241946:1(1-22)Online publication date: 20-Mar-2020
https://dl.acm.org/doi/10.1145/3372419
Karstadt ESchwartz O(2020)Matrix Multiplication, a Little FasterJournal of the ACM10.1145/336450467:1(1-31)Online publication date: 15-Jan-2020
https://dl.acm.org/doi/10.1145/3364504
Karppa MKaski PChan T(2019)Probabilistic tensors and opportunistic boolean matrix multiplicationProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310466(496-515)Online publication date: 6-Jan-2019
https://dl.acm.org/doi/10.5555/3310435.3310466
Jeannerod CMary TPernet CRoche D(2019)Improving the Complexity of Block Low-Rank Factorizations with Fast Matrix ArithmeticSIAM Journal on Matrix Analysis and Applications10.1137/19M125562840:4(1478-1496)Online publication date: 26-Nov-2019
https://doi.org/10.1137/19M1255628
Gheibi SBanerjee TRanka SSahni S(2019)Multilevel Approaches to Fine Tune Performance of Linear Algebra Libraries2019 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)10.1109/ISSPIT47144.2019.9001832(1-6)Online publication date: Dec-2019
https://doi.org/10.1109/ISSPIT47144.2019.9001832
Huang JMatthews Dvan de Geijn R(2018)Strassen's Algorithm for Tensor ContractionSIAM Journal on Scientific Computing10.1137/17M113557840:3(C305-C326)Online publication date: Jan-2018
https://doi.org/10.1137/17M1135578
Karstadt ESchwartz OScheideler CHajiaghayi M(2017)Matrix Multiplication, a Little FasterProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087579(101-110)Online publication date: 24-Jul-2017
https://dl.acm.org/doi/10.1145/3087556.3087579
Tonks ZSankaran GDavenport J(2017)Fast Matrix Operations in Computer Algebra2017 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)10.1109/SYNASC.2017.00021(67-70)Online publication date: Sep-2017
https://doi.org/10.1109/SYNASC.2017.00021
Huang JRice LMatthews DGeijn R(2017)Generating Families of Practical Fast Matrix Multiplication Algorithms2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.56(656-667)Online publication date: May-2017
https://doi.org/10.1109/IPDPS.2017.56
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents