[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472456.3473519acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

Published: 05 October 2021 Publication History

Abstract

The multiplication of a matrix by its transpose, ATA, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (AtA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to the time required by Strassen’s algorithm, amounting to floating point operations. AtA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of AtA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of ATA. Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation, when compared with other methods in the literature.

References

[1]
G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. 2012. Communication-optimal parallel algorithm for strassen’s matrix multiplication. In Proc. 24th ACM Symp. Parallelism in Algorithms and Architectures (SPAA). 193–204.
[2]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. 2011. Graph expansion and communication costs of fast matrix multiplication. In Proc. 23rd ACM Symp. Parallelism in Algorithms and Architectures (SPAA). 1–12.
[3]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. 2013. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM) 59, 6 (2013), 1–23.
[4]
A.R. Benson and G. Ballard. 2015. A Framework for Practical Parallel Fast Matrix Multiplication. In Proc. 20th ACM SIGPLAN PPoPP. 42–53.
[5]
R. P. Brent. 1970. Algorithms for matrix multiplication. Tech. Rep. TR-CS-70-157, Stanford University.
[6]
R. P. Brent. 1970. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd’s identity. Numer. Math. 16(1970), 145–156.
[7]
A. Charara, D. Keyes, and H. Ltaief. 2019. Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans Math. Softw. (TOMS) 45, 2 (2019), 1–28.
[8]
A. Charara, H. Ltaief, and D. Keyes. 2016. Redesigning triangular dense matrix computations on GPUs. In Euro-Par. Springer, 477–489.
[9]
D. Coppersmith and S. Winograd. 1987. Matrix multiplication via arithmetic progressions. In Proc. 19th ACM Symp. Theory of Computing (STOC). 1–6.
[10]
P. D’Alberto and A. Nicolau. 2007. Adaptive Strassen’s matrix multiplication. In Proc. 21st Int. Conf. on Supercomputing. ACM, 284–292.
[11]
J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz O., and Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. In IEEE 27th Int. Symp. on Parallel and Distributed Processing. IEEE, 261–272.
[12]
F. Desprez and F. Suter. 2004. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms. Concurr. Comput.: Pract. Exper. 16, 8 (2004), 771–797.
[13]
J. Dumas, C. Pernet, and A. Sedoglavic. 2020. On Fast Multiplication of a Matrix by Its Transpose. In Proc. 45th Int. Symp. Symbolic and Algebraic Computation (Kalamata, Greece) (ISSAC ’20). Association for Computing Machinery, New York, NY, USA, 162–169. https://doi.org/10.1145/3373207.3404021
[14]
D. Eliahu, O. Spillinger, A. Fox, and J. Demmel. 2015. Frpa: A framework for recursive parallel algorithms. Technical Report UCB/EECS-2015-28. EECS Department, University of California, Berkeley.
[15]
E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM review 46, 1 (2004), 3–45.
[16]
[16] Developer Reference for Intel® Math Kernel Library C.2019. (2019). https://software.intel.com/en-us/download/developer-reference-for-intel-math-kernel-library-c
[17]
M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. 1999. Cache-oblivious algorithms. In 40th Symp. Foundations of Computer Science (FOCS). IEEE, 285–297.
[18]
F. Le Gall. 2014. Powers of tensors and fast matrix multiplication. In Proc. 39th Int. Symp. Symbolic and Algebraic Computation. 296–303.
[19]
B. Grayson, A. Shah, and R. van de Geijn. 1995. A high performance parallel Strassen implementation. Parallel Processing Letters 6 (1995), 3–12.
[20]
N.J. Higham. 1990. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw. 16, 4 (1990), 352–368.
[21]
S. Hunold, T. Rauber, and G. Runger. 2008. Combining building blocks for parallel multi–level matrix multiplication. Parallel Comput. 34(2008), 411–426.
[22]
S. Huss-Lederman, E.M. Jacobson, A. Tsao, T. Turnbull, and J.R. Johnson. 1996. Implementation of Strassen’s algorithm for matrix multiplication. In Proc. ACM/IEEE Conf. on Supercomputing.
[23]
H. Jia-Wei and H. T. Kung. 1981. I/O Complexity: The Red-Blue Pebble Game. In Proc. ACM Symp. Theory of Computing (STOC)(Milwaukee, Wisconsin, USA). ACM, 326–333.
[24]
M. Kadhum, M. H. Qasem, A. Sleit, and A. Sharieh. 2017. Efficient MapReduce matrix multiplication with optimized mapper set. In Computer Science On-line Conference. Springer, 186–196.
[25]
Bo Kågström. 2004. Management of deep memory hierarchies–recursive blocked algorithms and hybrid data structures for dense matrix computations. In Int. Workshop on Applied Parallel Computing. Springer, 21–32.
[26]
G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler. 2019. Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis(SC ’19). Article 24, 22 pages. https://doi.org/10.1145/3295500.3356181
[27]
Q. Luo and J. Drake. 1995. A scalable parallel Strassen’s matrix multiplication algorithm for distributed-memory computers. In Proc. ACM Symp. Applied Computing, SAC’95. 221–226.
[28]
E. Peise and P. Bientinesi. 2017. Algorithm 979: recursive algorithms for dense linear algebra—the ReLAPACK collection. ACM Trans. Math. Softw. (TOMS) 44, 2 (2017), 1–19.
[29]
M. H. Qasem, A. A. Sarhan, R. Qaddoura, and B. A. Mahafzah. 2017. Matrix multiplication of big data using mapreduce: a review. In 2nd Int. Conf. Applications of Information Technology in Developing Renewable Energy Processes & Systems (IT-DREPS). IEEE, 1–6.
[30]
F. Song, J. Dongarra, and S. Moore. 2006. Experiments with Strassen’s algorithm: From sequential to parallel. In Proc. Parallel and Distributed Computing and Systems, (PDCS).
[31]
A.J. Stothers. 2003. On the complexity of matrix multiplication. Journal of Complexity 19(2003), 43–60. Issue 1.
[32]
G. Strang. 2006. Linear Algebra and Its Applications, Fourth Ed.Thomson Brooks/Cole.
[33]
V. Strassen. 1969. Gaussian elimination is not optimal. Numerische mathematik 13, 4 (1969), 354–356.
[34]
M. Thottethodi, S. Chatterjee, and A.R. Lebeck. 1998. Tuning Strassen’s matrix multiplication for memory efficiency. In Proc. 1998 ACM/IEEE Conf. on Supercomputing (SC’98). IEEE, 36–36.
[35]
E. Wang, Q. Zhang, B. Shenand G. Zhang, X. Lu, Q. Wu, and Y. Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167–188.
[36]
V.V. Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proc. 44th ACM Symp. Theory of Computing (STOC). 887–898.
[37]
C-Q. Yang and B. P. Miller. 1988. Critical path analysis for the execution of parallel and distributed programs. In Proc. 8th Int. Conf. on Distributed Computing Systems (ICDCS). IEEE, 366–373.
[38]
W. Zeng, R. Guo, F. Luo, and X. Gu. 2012. Discrete heat kernel determines discrete Riemannian metric. Graphical Models 74, 4 (2012), 121–129.

Cited By

View all
  • (2023)An Efficient Parallel Divide-and-Conquer Algorithm for Generalized Matrix Multiplication2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)10.1109/CCWC57344.2023.10099141(0442-0449)Online publication date: 8-Mar-2023
  • (2023)Acceleration of Fully Connected Layers on FPGA using the Strassen Matrix Multiplication2023 IEEE 5th International Conference on BioInspired Processing (BIP)10.1109/BIP60195.2023.10379257(1-6)Online publication date: 28-Nov-2023
  • (2022)Fast Blockwise Matrix-Matrix Multiplication Using AVX and Prefetching on Shared Memory2022 13th Asian Control Conference (ASCC)10.23919/ASCC56756.2022.9828222(448-452)Online publication date: 4-May-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. C/C++.
  2. Cache-oblivious algorithm
  3. Fast Linear Algebra
  4. MPI
  5. Matrix Multiplication
  6. OpenMP

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • ERC

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)5
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)An Efficient Parallel Divide-and-Conquer Algorithm for Generalized Matrix Multiplication2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)10.1109/CCWC57344.2023.10099141(0442-0449)Online publication date: 8-Mar-2023
  • (2023)Acceleration of Fully Connected Layers on FPGA using the Strassen Matrix Multiplication2023 IEEE 5th International Conference on BioInspired Processing (BIP)10.1109/BIP60195.2023.10379257(1-6)Online publication date: 28-Nov-2023
  • (2022)Fast Blockwise Matrix-Matrix Multiplication Using AVX and Prefetching on Shared Memory2022 13th Asian Control Conference (ASCC)10.23919/ASCC56756.2022.9828222(448-452)Online publication date: 4-May-2022
  • (2022)pylspack: Parallel Algorithms and Data Structures for Sketching, Column Subset Selection, Regression, and Leverage ScoresACM Transactions on Mathematical Software10.1145/355537048:4(1-27)Online publication date: 19-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media