[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

Published: 07 December 2011 Publication History

Abstract

We present a simple and efficient methodology for the development, tuning, and installation of matrix algorithms such as the hybrid Strassen's and Winograd's fast matrix multiply or their combination with the 3M algorithm for complex matrices (i.e., hybrid: a recursive algorithm as Strassen's until a highly tuned BLAS matrix multiplication allows performance advantages). We investigate how modern Symmetric Multiprocessor (SMP) architectures present old and new challenges that can be addressed by the combination of an algorithm design with careful and natural parallelism exploitation at the function level (optimizations) such as function-call parallelism, function percolation, and function software pipelining.
We have three contributions: first, we present a performance overview for double- and double-complex-precision matrices for state-of-the-art SMP systems; second, we introduce new algorithm implementations: a variant of the 3M algorithm and two new different schedules of Winograd's matrix multiplication (achieving up to 20% speedup with respect to regular matrix multiplication). About the latter Winograd's algorithms: one is designed to minimize the number of matrix additions and the other to minimize the computation latency of matrix additions; third, we apply software pipelining and threads allocation to all the algorithms and we show how this yields up to 10% further performance improvements.

References

[1]
Anderson, E., Bai, Z., Bischof, C., Dongarra, J. D. J., DuCroz, J., Greenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., and Sorensen, D. 1995. LAPACK User' Guide Release 2.0 2. SIAM.
[2]
Bilmes, J., Asanovic, K., Chin, C., and Demmel, J. 1997. Optimizing matrix multiply using PHiPAC: A portable, high-performance, Ansi C coding methodology. In Proceedings of the International Conference on Supercomputing.
[3]
Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., and Whaley, R. C. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28, 2, 135--151.
[4]
Bodrato, M. 2010. A Strassen-like matrix multiplication suited for squaring and higher power computation. In Proceedings of the International Symposium on Symbolic and Algebraic Computation (ISSAC'10). ACM, New York, http://bodrato.it/papers/#ISSAC2010.
[5]
Boyer, B., Dumas, J.-G., Pernet, C., and Zhou, W. 2009. Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm. In Proceedings of the International Symposium on Symbolic and Algebraic Computation (ISSAC). 55--62.
[6]
Brent, R. P. 1970a. Algorithms for matrix multiplication. Tech. rep. TR-CS-70-157, Stanford University.
[7]
Brent, R. P. 1970b. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd's identity. Numer. Math. 16, 145--156.
[8]
Cohn, H., Kleinberg, R., Szegedy, B., and Umans, C. 2005. Group-Theoretic algorithms for matrix multiplication. http://www.cs.caltech.edu/~umans/papers/CKSU05.pdf.
[9]
Coppersmith, D. and Winograd, S. 1987. Matrix multiplication via arithmetic progressions. In Proceedings of the 19th Annual ACM Conference on Theory of Computing. 1--6.
[10]
D'Alberto, P. and Nicolau, A. 2007. Adaptive strassen's matrix multiplication. In Proceedings of the 21st Annual International Conference on Supercomputing (ICS'07). ACM, New York, 284--292.
[11]
D'Alberto, P. and Nicolau, A. 2009. Adaptive Winograd's matrix multiplications. ACM Trans. Math. Softw. 36, 1.
[12]
Demmel, J., Dongarra, J., Eijkhout, E., Fuentes, E., Petitet, E., Vuduc, V., Whaley, R., and Yelick, K. 2005. Self-Adapting linear algebra algorithms and software. Proc. IEEE 93, 2.
[13]
Demmel, J., Dumitriu, J., Holtz, O., and Kleinberg, R. 2006. Fast matrix multiplication is stable. http://arnetminer.org/viewpub.do?pid=822030.
[14]
Demmel, J. and Higham, N. 1992. Stability of block algorithms with fast level-3 BLAS. ACM Trans. Math. Softw. 18, 3, 274--291.
[15]
Dongarra, J. J., Croz, J. D., Duff, I. S., and Hammarling, S. 1990a. Algorithm 679: A set of level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Softw. 16, 18--28.
[16]
Dongarra, J. J., Croz, J. D., Duff, I. S., and Hammarling, S. 1990b. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1--17.
[17]
Douglas, C., Heroux, M., Slishman, G., and Smith, R. 1994. GEMMW: A portable level 3 BLAS Winograd variant of Strassen's matrix--matrix multiply algorithm. J. Comput. Phys. 110, 1--10.
[18]
Dumas, J.-G., Giorgi, P., and Pernet, C. 2008. Dense linear algebra over word-size prime fields: The FFLAS and FFPACK packages. ACM Trans. Math. Softw. 35, 3, 1--42.
[19]
Eiron, N., Rodeh, M., and Steinwarts, I. 1998. Matrix multiplication: A case study of algorithm engineering. In Proceedings the Workshop on Algorithm Engineering (WAE'98).
[20]
Frens, J. and Wise, D. 1997. Auto-Blocking matrix-multiplication or tracking BLAS3 performance from source code. In Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming SIGPLAN Not. 32, 7, 206--216.
[21]
Frigo, M. and Johnson, S. 2005. The design and implementation of FFTW3. Proc. IEEE 93, 2, 216--231.
[22]
Goto, K. and van de Geijn, R. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3.
[23]
Gunnels, J., Gustavson, F., Henry, G., and van de Geijn, R. 2001. FLAME: Formal Linear Algebra Methods Environment. ACM Trans. Math. Softw. 27, 4, 422--455.
[24]
Higham, N. 1990. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw. 16, 4, 352--368.
[25]
Higham, N. 2002. Accuracy and Stability of Numerical Algorithms, 2nd Ed. SIAM.
[26]
Huss-Lederman, S., Jacobson, E., Johnson, J., Tsao, A., and Turnbull, T. 1996a. Strassen's algorithm for matrix multiplication: Modeling analysis, and implementation. Tech. rep. CCS-TR-96-14, Center for Computing Sciences.
[27]
Huss-Lederman, S., Jacobson, E., Tsao, A., Turnbull, T., and Johnson, J. 1996b. Implementation of Strassen's algorithm for matrix multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM). ACM Press, 32.
[28]
JáJá, J. 1992. An Introduction to Parallel Algorithms. Addison Wesley Longman Publishing Co., Inc., CA.
[29]
Kagstrom, B., Ling, P., and van Loan, C. 1998a. Algorithm 784: GEMM-based level 3 BLAS: Portability and optimization issues. ACM Trans. Math. Softw. 24, 3, 303--316.
[30]
Kagstrom, B., Ling, P., and van Loan, C. 1998b. GEMM-based level 3 BLAS: High-Performance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3, 268--302.
[31]
Kaporin, I. 1999. A practical algorithm for faster matrix multiplication. Numer. Linear Algebra Appli. 6, 8, 687--700.
[32]
Kaporin, I. 2004. The aggregation and cancellation techniques as a practical tool for faster matrix multiplication. Theor. Comput. Sci. 315, 2-3, 469--510.
[33]
Lawson, C. L., Hanson, R. J., Kincaid, D., and Krogh, F. T. 1979. Basic Linear Algebra Subprograms for FORTRAN usage. ACM Trans. Math. Softw. 5, 308--323.
[34]
Loos, S. and Wise, D. Strassen's matrix multiplication relabeled. http:/src.acm.org/loos/loos.html.
[35]
Nicolau, A., Li, G., and Kejariwal, A. 2009. Techniques for efficient placement of synchronization primitives. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'09).
[36]
Pan, V. 1978. Strassen's algorithm is not optimal: Trililnear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the Conference on Foundations of Computer Science (FOCS). 166--176.
[37]
Pan, V. 1984. How can we speed up matrix multiplication? SIAM Rev. 26, 3, 393--415.
[38]
Priest, D. 1991. Algorithms for arbitrary precision floating point arithmetic. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic (Arith-10). P. Kornerup and D. W. Matula, Eds., IEEE Computer Society Press, Los Alamitos, CA, 132--144.
[39]
Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gačić, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE, special issue on “Program Generation, Optimization, and Adaptation” 93, 2.
[40]
Strassen, V. 1969. Gaussian elimination is not optimal. Numer. Math. 14, 3, 354--356.
[41]
Whaley, R. and Dongarra, J. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing (CDROM). IEEE Computer Society, 1--27.
[42]
Whaley, R. C. and Petitet, A. 2005. Minimizing development and maintenance costs in supporting persistently optimized BLAS. Softw. Pract. Exper. 35, 2, 101--121. http://www.cs.utsa.edu/~whaley/papers/spercw04.ps.

Cited By

View all
  • (2023)Pebbling Game and Alternative Basis for High Performance Matrix MultiplicationSIAM Journal on Scientific Computing10.1137/22M150271945:6(C277-C303)Online publication date: 15-Nov-2023
  • (2020)Strassen’s Algorithm Reloaded on GPUsACM Transactions on Mathematical Software10.1145/337241946:1(1-22)Online publication date: 20-Mar-2020
  • (2020)Matrix Multiplication, a Little FasterJournal of the ACM10.1145/336450467:1(1-31)Online publication date: 15-Jan-2020
  • Show More Cited By

Index Terms

  1. Exploiting parallelism in matrix-computation kernels for symmetric multiprocessor systems: Matrix-multiplication and matrix-addition algorithm optimizations by software pipelining and threads allocation

                      Recommendations

                      Comments

                      Please enable JavaScript to view thecomments powered by Disqus.

                      Information & Contributors

                      Information

                      Published In

                      cover image ACM Transactions on Mathematical Software
                      ACM Transactions on Mathematical Software  Volume 38, Issue 1
                      November 2011
                      144 pages
                      ISSN:0098-3500
                      EISSN:1557-7295
                      DOI:10.1145/2049662
                      Issue’s Table of Contents
                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      Published: 07 December 2011
                      Accepted: 01 November 2010
                      Revised: 01 September 2010
                      Received: 01 December 2009
                      Published in TOMS Volume 38, Issue 1

                      Permissions

                      Request permissions for this article.

                      Check for updates

                      Author Tags

                      1. Matrix multiplications
                      2. fast algorithms
                      3. parallelism
                      4. software pipeline

                      Qualifiers

                      • Research-article
                      • Research
                      • Refereed

                      Contributors

                      Other Metrics

                      Bibliometrics & Citations

                      Bibliometrics

                      Article Metrics

                      • Downloads (Last 12 months)21
                      • Downloads (Last 6 weeks)1
                      Reflects downloads up to 28 Dec 2024

                      Other Metrics

                      Citations

                      Cited By

                      View all
                      • (2023)Pebbling Game and Alternative Basis for High Performance Matrix MultiplicationSIAM Journal on Scientific Computing10.1137/22M150271945:6(C277-C303)Online publication date: 15-Nov-2023
                      • (2020)Strassen’s Algorithm Reloaded on GPUsACM Transactions on Mathematical Software10.1145/337241946:1(1-22)Online publication date: 20-Mar-2020
                      • (2020)Matrix Multiplication, a Little FasterJournal of the ACM10.1145/336450467:1(1-31)Online publication date: 15-Jan-2020
                      • (2019)Probabilistic tensors and opportunistic boolean matrix multiplicationProceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms10.5555/3310435.3310466(496-515)Online publication date: 6-Jan-2019
                      • (2019)Improving the Complexity of Block Low-Rank Factorizations with Fast Matrix ArithmeticSIAM Journal on Matrix Analysis and Applications10.1137/19M125562840:4(1478-1496)Online publication date: 26-Nov-2019
                      • (2019)Multilevel Approaches to Fine Tune Performance of Linear Algebra Libraries2019 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)10.1109/ISSPIT47144.2019.9001832(1-6)Online publication date: Dec-2019
                      • (2018)Strassen's Algorithm for Tensor ContractionSIAM Journal on Scientific Computing10.1137/17M113557840:3(C305-C326)Online publication date: Jan-2018
                      • (2017)Matrix Multiplication, a Little FasterProceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3087556.3087579(101-110)Online publication date: 24-Jul-2017
                      • (2017)Fast Matrix Operations in Computer Algebra2017 19th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC)10.1109/SYNASC.2017.00021(67-70)Online publication date: Sep-2017
                      • (2017)Generating Families of Practical Fast Matrix Multiplication Algorithms2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.56(656-667)Online publication date: May-2017
                      • Show More Cited By

                      View Options

                      Login options

                      Full Access

                      View options

                      PDF

                      View or Download as a PDF file.

                      PDF

                      eReader

                      View online with eReader.

                      eReader

                      Media

                      Figures

                      Other

                      Tables

                      Share

                      Share

                      Share this Publication link

                      Share on social media