More Web Proxy on the site http://driver.im/

research-article

Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose

Authors:

Viviana Arrigoni,

Filippo Maggioli,

Annalisa Massini,

Emanuele RodolàAuthors Info & Claims

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Article No.: 56, Pages 1 - 12

https://doi.org/10.1145/3472456.3473519

Published: 05 October 2021 Publication History

Abstract

The multiplication of a matrix by its transpose, ATA, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (AtA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to the time required by Strassen’s algorithm, amounting to floating point operations. AtA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of AtA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of ATA. Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation, when compared with other methods in the literature.

References

[1]

G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz. 2012. Communication-optimal parallel algorithm for strassen’s matrix multiplication. In Proc. 24th ACM Symp. Parallelism in Algorithms and Architectures (SPAA). 193–204.

[2]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. 2011. Graph expansion and communication costs of fast matrix multiplication. In Proc. 23rd ACM Symp. Parallelism in Algorithms and Architectures (SPAA). 1–12.

[3]

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz. 2013. Graph expansion and communication costs of fast matrix multiplication. Journal of the ACM (JACM) 59, 6 (2013), 1–23.

Digital Library

[4]

A.R. Benson and G. Ballard. 2015. A Framework for Practical Parallel Fast Matrix Multiplication. In Proc. 20th ACM SIGPLAN PPoPP. 42–53.

[5]

R. P. Brent. 1970. Algorithms for matrix multiplication. Tech. Rep. TR-CS-70-157, Stanford University.

[6]

R. P. Brent. 1970. Error analysis of algorithms for matrix multiplication and triangular decomposition using Winograd’s identity. Numer. Math. 16(1970), 145–156.

Digital Library

[7]

A. Charara, D. Keyes, and H. Ltaief. 2019. Batched triangular dense linear algebra kernels for very small matrix sizes on GPUs. ACM Trans Math. Softw. (TOMS) 45, 2 (2019), 1–28.

Digital Library

[8]

A. Charara, H. Ltaief, and D. Keyes. 2016. Redesigning triangular dense matrix computations on GPUs. In Euro-Par. Springer, 477–489.

[9]

D. Coppersmith and S. Winograd. 1987. Matrix multiplication via arithmetic progressions. In Proc. 19th ACM Symp. Theory of Computing (STOC). 1–6.

[10]

P. D’Alberto and A. Nicolau. 2007. Adaptive Strassen’s matrix multiplication. In Proc. 21st Int. Conf. on Supercomputing. ACM, 284–292.

[11]

J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz, O. Schwartz O., and Spillinger. 2013. Communication-optimal parallel recursive rectangular matrix multiplication. In IEEE 27th Int. Symp. on Parallel and Distributed Processing. IEEE, 261–272.

Digital Library

[12]

F. Desprez and F. Suter. 2004. Impact of mixed-parallelism on parallel implementations of the Strassen and Winograd matrix multiplication algorithms. Concurr. Comput.: Pract. Exper. 16, 8 (2004), 771–797.

Digital Library

[13]

J. Dumas, C. Pernet, and A. Sedoglavic. 2020. On Fast Multiplication of a Matrix by Its Transpose. In Proc. 45th Int. Symp. Symbolic and Algebraic Computation (Kalamata, Greece) (ISSAC ’20). Association for Computing Machinery, New York, NY, USA, 162–169. https://doi.org/10.1145/3373207.3404021

Digital Library

[14]

D. Eliahu, O. Spillinger, A. Fox, and J. Demmel. 2015. Frpa: A framework for recursive parallel algorithms. Technical Report UCB/EECS-2015-28. EECS Department, University of California, Berkeley.

[15]

E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM review 46, 1 (2004), 3–45.

[16]

[16] Developer Reference for Intel® Math Kernel Library C.2019. (2019). https://software.intel.com/en-us/download/developer-reference-for-intel-math-kernel-library-c

[17]

M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. 1999. Cache-oblivious algorithms. In 40th Symp. Foundations of Computer Science (FOCS). IEEE, 285–297.

[18]

F. Le Gall. 2014. Powers of tensors and fast matrix multiplication. In Proc. 39th Int. Symp. Symbolic and Algebraic Computation. 296–303.

Digital Library

[19]

B. Grayson, A. Shah, and R. van de Geijn. 1995. A high performance parallel Strassen implementation. Parallel Processing Letters 6 (1995), 3–12.

[20]

N.J. Higham. 1990. Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans. Math. Softw. 16, 4 (1990), 352–368.

Digital Library

[21]

S. Hunold, T. Rauber, and G. Runger. 2008. Combining building blocks for parallel multi–level matrix multiplication. Parallel Comput. 34(2008), 411–426.

Digital Library

[22]

S. Huss-Lederman, E.M. Jacobson, A. Tsao, T. Turnbull, and J.R. Johnson. 1996. Implementation of Strassen’s algorithm for matrix multiplication. In Proc. ACM/IEEE Conf. on Supercomputing.

[23]

H. Jia-Wei and H. T. Kung. 1981. I/O Complexity: The Red-Blue Pebble Game. In Proc. ACM Symp. Theory of Computing (STOC)(Milwaukee, Wisconsin, USA). ACM, 326–333.

Digital Library

[24]

M. Kadhum, M. H. Qasem, A. Sleit, and A. Sharieh. 2017. Efficient MapReduce matrix multiplication with optimized mapper set. In Computer Science On-line Conference. Springer, 186–196.

[25]

Bo Kågström. 2004. Management of deep memory hierarchies–recursive blocked algorithms and hybrid data structures for dense matrix computations. In Int. Workshop on Applied Parallel Computing. Springer, 21–32.

[26]

G. Kwasniewski, M. Kabić, M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler. 2019. Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis(SC ’19). Article 24, 22 pages. https://doi.org/10.1145/3295500.3356181

Digital Library

[27]

Q. Luo and J. Drake. 1995. A scalable parallel Strassen’s matrix multiplication algorithm for distributed-memory computers. In Proc. ACM Symp. Applied Computing, SAC’95. 221–226.

[28]

E. Peise and P. Bientinesi. 2017. Algorithm 979: recursive algorithms for dense linear algebra—the ReLAPACK collection. ACM Trans. Math. Softw. (TOMS) 44, 2 (2017), 1–19.

Digital Library

[29]

M. H. Qasem, A. A. Sarhan, R. Qaddoura, and B. A. Mahafzah. 2017. Matrix multiplication of big data using mapreduce: a review. In 2nd Int. Conf. Applications of Information Technology in Developing Renewable Energy Processes & Systems (IT-DREPS). IEEE, 1–6.

[30]

F. Song, J. Dongarra, and S. Moore. 2006. Experiments with Strassen’s algorithm: From sequential to parallel. In Proc. Parallel and Distributed Computing and Systems, (PDCS).

[31]

A.J. Stothers. 2003. On the complexity of matrix multiplication. Journal of Complexity 19(2003), 43–60. Issue 1.

Digital Library

[32]

G. Strang. 2006. Linear Algebra and Its Applications, Fourth Ed.Thomson Brooks/Cole.

[33]

V. Strassen. 1969. Gaussian elimination is not optimal. Numerische mathematik 13, 4 (1969), 354–356.

[34]

M. Thottethodi, S. Chatterjee, and A.R. Lebeck. 1998. Tuning Strassen’s matrix multiplication for memory efficiency. In Proc. 1998 ACM/IEEE Conf. on Supercomputing (SC’98). IEEE, 36–36.

[35]

E. Wang, Q. Zhang, B. Shenand G. Zhang, X. Lu, Q. Wu, and Y. Wang. 2014. Intel math kernel library. In High-Performance Computing on the Intel® Xeon Phi™. Springer, 167–188.

[36]

V.V. Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proc. 44th ACM Symp. Theory of Computing (STOC). 887–898.

Digital Library

[37]

C-Q. Yang and B. P. Miller. 1988. Critical path analysis for the execution of parallel and distributed programs. In Proc. 8th Int. Conf. on Distributed Computing Systems (ICDCS). IEEE, 366–373.

[38]

W. Zeng, R. Guo, F. Luo, and X. Gu. 2012. Discrete heat kernel determines discrete Riemannian metric. Graphical Models 74, 4 (2012), 121–129.

Digital Library

Cited By

Eagan JHerdman MVaughn CBean NKern SPirouz M(2023)An Efficient Parallel Divide-and-Conquer Algorithm for Generalized Matrix Multiplication2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)10.1109/CCWC57344.2023.10099141(0442-0449)Online publication date: 8-Mar-2023
https://doi.org/10.1109/CCWC57344.2023.10099141
León-Vega LChacón-Rodríguez ASalazar-Villalobos ECastro-Godínez J(2023)Acceleration of Fully Connected Layers on FPGA using the Strassen Matrix Multiplication2023 IEEE 5th International Conference on BioInspired Processing (BIP)10.1109/BIP60195.2023.10379257(1-6)Online publication date: 28-Nov-2023
https://doi.org/10.1109/BIP60195.2023.10379257
Oo NChaikan P(2022)Fast Blockwise Matrix-Matrix Multiplication Using AVX and Prefetching on Shared Memory2022 13th Asian Control Conference (ASCC)10.23919/ASCC56756.2022.9828222(448-452)Online publication date: 4-May-2022
https://doi.org/10.23919/ASCC56756.2022.9828222
Show More Cited By

Recommendations

Implementation of Strassen's algorithm for matrix multiplication
Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing

In this paper we report on the development of an efficient and portable implementation of Strassen's matrix multiplication algorithm for matrices of arbitrary size. Our technique for defining the criterion which stops the recursions is more detailed ...
Fast sparse matrix multiplication

Let A and B two n×n matrices over a ring R (e.g., the reals or the integers) each containing at most m nonzero elements. We present a new algorithm that multiplies A and B using O(m^0.7n^1.2+n^2+o(1)) algebraic operations (i.e., multiplications, additions ...
Matrix Multiplication with Diagonals: Structured Sparse Matrices and Beyond
HP3C '23: Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications

Matrix-matrix multiplication is a fundamental computation in a wide range of scientific problems. In this paper, we investigate algorithmic aspects of this important operation whereby the underlying arithmetic calculations are ordered in a way that is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

August 2021

927 pages

ISBN:9781450390682

DOI:10.1145/3472456

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

ERC

Conference

ICPP 2021

ICPP 2021: 50th International Conference on Parallel Processing

August 9 - 12, 2021

IL, Lemont, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
128
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)5

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Eagan JHerdman MVaughn CBean NKern SPirouz M(2023)An Efficient Parallel Divide-and-Conquer Algorithm for Generalized Matrix Multiplication2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC)10.1109/CCWC57344.2023.10099141(0442-0449)Online publication date: 8-Mar-2023
https://doi.org/10.1109/CCWC57344.2023.10099141
León-Vega LChacón-Rodríguez ASalazar-Villalobos ECastro-Godínez J(2023)Acceleration of Fully Connected Layers on FPGA using the Strassen Matrix Multiplication2023 IEEE 5th International Conference on BioInspired Processing (BIP)10.1109/BIP60195.2023.10379257(1-6)Online publication date: 28-Nov-2023
https://doi.org/10.1109/BIP60195.2023.10379257
Oo NChaikan P(2022)Fast Blockwise Matrix-Matrix Multiplication Using AVX and Prefetching on Shared Memory2022 13th Asian Control Conference (ASCC)10.23919/ASCC56756.2022.9828222(448-452)Online publication date: 4-May-2022
https://doi.org/10.23919/ASCC56756.2022.9828222
Sobczyk AGallopoulos E(2022)pylspack: Parallel Algorithms and Data Structures for Sketching, Column Subset Selection, Regression, and Leverage ScoresACM Transactions on Mathematical Software10.1145/355537048:4(1-27)Online publication date: 19-Dec-2022
https://dl.acm.org/doi/10.1145/3555370

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents