More Web Proxy on the site http://driver.im/

research-article

Block-Cyclic Dense Linear Algebra

Authors:

Woody Lichtenstein,

S. Lennart JohnssonAuthors Info & Claims

SIAM Journal on Scientific Computing, Volume 14, Issue 6

Pages 1259 - 1288

https://doi.org/10.1137/0914075

Published: 01 November 1993 Publication History

Abstract

Block-cyclic order elimination algorithms for LU and OR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two-dimensional arrays of arbitrary shape. The cyclic-order elimination, together with a consecutive data allocation, yields good load balance for both the factorization and solution phases for the solution of dense systems of equations by LU and OR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS (basic linear algebra subroutines) are ideal for operations local to a node. High-rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update.

This paper shows that in many parallel implementations, the $O(N^2 )$ work in the factorization may be of the same significance as the $O(N^3 )$ work, even for large matrices. The $O(N^2 )$ work is poorly load balanced in two-dimensional nodal arrays, which are shown to be optimal with respect to communication for consecutive data allocation, block-cyclic order elimination, and a simple, but fairly general, communications model.

In this Connection Machine system CM-200 implementation, the peak performance for LU factorization is about 9.4 Gflops/s in 64-bit precision and 16 Gflops/s in 32-bit precision. Blocking offers an overall performance enhancement of an approximate factor of two. The broadcast-and-reduce operations fully utilize the bandwidth available in the Boolean cube network interconnecting the nodes along each axis of the two-dimensional nodal array embedded in the cube network.

References

[1]

E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Ducroz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, Preliminary LAPACK User's Guide, Tech. Rep., University of Tennessee, 1991, July

[2]

J.-P. Brunet, S. L. Johnsson, All-to-all broadcast with applications on the Connection Machine, Internat. J. Supercomputer Appl., 3 (1992),

[3]

Mee Yee Chan, Embedding of grids into optimal hypercubes, SIAM J. Comput., 20 (1991), 834–864

Digital Library

[4]

J. J. Dongarra, J. DuCroz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, Tech. Rep. Reprint, 1, Argonne National Laboratories, Mathematics and Computer Science Division, Argonne, IL, 1988, August

[5]

J. J. Dongarra, J. DuCroz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs, Tech. Rep. Reprint, 2, Argonne National Laboratories, Mathematics and Computer Science Division, Argonne, IL, 1988, August

[6]

J. J. Dongarra, J. DuCroz, S. Hammarling, R. J. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, Tech. Rep. Tech. Memo., 41, Argonne National Laboratories, Mathematics and Computer Science Division, Argonne, IL, 1986, November

[7]

G. C. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, M. Wu, Fortran D language specification, Tech. Rep., TR90-141, Department of Computer Science, Rice University, Houston, TX, 1990, December

[8]

G. C. Fox, M. A. Johnsson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, W. Furmanski, Solving Problems on Concurrent Processors, Prentice-Hall, Englewood Cliffs, NJ, 1988

Digital Library

[9]

W. Morven Gentleman, Some complexity results for matrix computations on parallel processors, J. Assoc. Comput. Mach., 25 (1978), 112–115

Digital Library

[10]

P. Harten, H.-Y. Lin, H. Rajic, D. Wells, BLAS and FFT performance on the Intel i860 microprocessor, Tech. Rep., 1991

[11]

Ivan Havel, Jaroslav Morávek, B-valuations of graphs, Czechoslovak Math. J., 22(97) (1972), 338–351

[12]

C.-T. Ho, S. L. Johnsson, Embedding meshes in Boolean cubes by graph decomposition, J. Parallel Distributed Computing, 4 (1990), 325–339

Digital Library

[13]

S. L. Johnsson, Band matrix systems solvers on ensemble architecturesAlgorithms, Architecture, and the Future of Scientific Computation, University of Texas Press, Austin, TX, 1985, 195–216

[14]

S. L. Johnsson, Dense matrix operations on a torus and a Boolean cube, 1985, in The National Computer Conference, July

[15]

S. L. Johnsson, Fast banded systems solvers for ensemble architectures, Tech. Rep., YALEU/DCS/RR-379, Department of Computer Science, Yale University, New Haven, CT, 1985, March

[16]

S. Lennart Johnsson, Solving narrow banded systems on ensemble architectures, ACM Trans. Math. Software, 11 (1985), 271–288

Digital Library

[17]

S. L. Johnsson, Communication efficient basic linear algebra computations on hypercube architectures, J. Parallel Distributed Computing, 2 (1987), 133–172

[18]

S. Lennart Johnsson, Ching-Tien Ho, Optimum broadcasting and personalized communication in hypercubes, IEEE Trans. Comput., 38 (1989), 1249–1268

Digital Library

[19]

S. L. Johnsson, Ching-Tien Ho, Generalized shuffle permutations on Boolean cubes, J. Parallel Distributed Computing, 1 (1992), 1–14

[20]

S. L. Johnsson, L. F. Ortiz, Local Basic Linear Algebra Subroutines (BLAS) on the Connection Machine System CM-200, Tech. Rep., TMC-226, Thinking Machines Corp., Cambridge, MA, 1992, April

[21]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, F. T. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Trans. on Math. Software, 3 (1979), 308–323

Digital Library

[22]

G. Li, T. F. Coleman, A parallel triangular solver for a distributed-memory multiprocessor, SIAM J. Sci. Statist. Comput., 9 (1988), 485–502

Digital Library

[23]

G. Li, T. F. Coleman, A new method for solving triangular systems on distributed-memory message-passing multiprocessors, SIAM J. Sci. Statist. Comput., 10 (1989), 382–396

Digital Library

[24]

K. K. Mathur, S. L. Johnsson, Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer, Tech. Rep., 216, Thinking Machines Corp., Cambridge, MA, 1991, December

[25]

Edward M. Reingold, Jurg Nievergelt, Narsingh Deo, Combinatorial algorithms: theory and practice, Prentice-Hall Inc., Englewood Cliffs, N.J., 1977xii+433

Digital Library

[26]

B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbov, Y. Ikebe, V. C. Klema, C. B. Moler, Matrix eigensystem routines—EISPACK guide, Springer-Verlag, Berlin, 1976ix+551, Lecture Notes in Computer Science, Vol. 6, New York

[27]

CMSSL for Fortran, 1990

[28]

CM-200 Technical Summary, 1991

[29]

, CM Fortran optimization notes: slicewise model, version 1.0, 1991

[30]

R. A. van de Geijn, Massively parallel LINPACK benchmark on the Intel Touchstone Delta and iPSC/860 systems, Tech. Rep., The University of Texas, Austin, 1991, July

[31]

H. Zima, P. Brezany, B. Chapman, P. Mehrotra, A. Schwald, Vienna Fortran—A Language Specification version 1.1, Tech. Rep., 21, Institute for Computer Applications in Science and Engineering, Hampton, VA, 1992, Interim Rep., March

Cited By

Zhang YChen YTang YKahaner D(2007)Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputersProceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing10.1145/1375783.1375809(115-125)Online publication date: 11-Nov-2007
https://dl.acm.org/doi/10.1145/1375783.1375809
Chen ZDongarra JLuszczek PRoche K(2003)Self-adapting software for numerical linear algebra and LAPACK for clustersParallel Computing10.1016/j.parco.2003.05.01429:11-12(1723-1743)Online publication date: 1-Nov-2003
https://dl.acm.org/doi/10.1016/j.parco.2003.05.014
Al-Ayyoub ADay K(2003)Node-ranking schemes for the star networksJournal of Parallel and Distributed Computing10.1016/S0743-7315(02)00056-463:3(239-250)Online publication date: 1-Mar-2003
https://dl.acm.org/doi/10.1016/S0743-7315%2802%2900056-4
Show More Cited By

Index Terms

Block-Cyclic Dense Linear Algebra

Index terms have been assigned to the content through auto-classification.

Recommendations

Parallel Randomized and Matrix-Free Direct Solvers for Large Structured Dense Linear Systems
^† Special Section on Two Themes: CSE Software and Big Data in CSE

We design efficient and distributed-memory parallel randomized direct solvers for large structured dense linear systems, including a fully matrix-free version based on matrix-vector multiplications and a partially matrix-free one. The dense coefficient ...
Some results on determinants and inverses of nonsingular pentadiagonal matrices

A block matrix analysis is proposed to justify, and modify, a known algorithm for computing in O ( n ) time the determinant of a nonsingular n n pentadiagonal matrix ( n 6 ) having nonzero entries on its second subdiagonal. Also, we describe a procedure ...
Improving the Complexity of Block Low-Rank Factorizations with Fast Matrix Arithmetic

We consider the LU factorization of an $n\times n$ matrix $A$ represented as a block low-rank (BLR) matrix: most of its off-diagonal blocks are approximated by matrices of small rank $r$, which reduces the asymptotic complexity of computing the LU ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing

SIAM Journal on Scientific Computing Volume 14, Issue 6

Nov 1993

245 pages

ISSN:1064-8275

Issue’s Table of Contents

Copyright © 1993 Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 November 1993

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YChen YTang YKahaner D(2007)Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputersProceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing10.1145/1375783.1375809(115-125)Online publication date: 11-Nov-2007
https://dl.acm.org/doi/10.1145/1375783.1375809
Chen ZDongarra JLuszczek PRoche K(2003)Self-adapting software for numerical linear algebra and LAPACK for clustersParallel Computing10.1016/j.parco.2003.05.01429:11-12(1723-1743)Online publication date: 1-Nov-2003
https://dl.acm.org/doi/10.1016/j.parco.2003.05.014
Al-Ayyoub ADay K(2003)Node-ranking schemes for the star networksJournal of Parallel and Distributed Computing10.1016/S0743-7315(02)00056-463:3(239-250)Online publication date: 1-Mar-2003
https://dl.acm.org/doi/10.1016/S0743-7315%2802%2900056-4
Choi J(2001)Parallel Factorizations with Algorithmic BlockingProceedings of the International Conference on Computational Sciences-Part I10.5555/645455.653746(802-811)Online publication date: 28-May-2001
https://dl.acm.org/doi/10.5555/645455.653746
Al-Ayyoub AOuld-Khaoua MDay K(2001)On the Performance of Parallel Matrix Factorisation on the HypermeshThe Journal of Supercomputing10.1023/A:101114020352820:1(37-53)Online publication date: 1-Aug-2001
https://dl.acm.org/doi/10.1023/A%3A1011140203528
Petitet ADongarra J(1999)Algorithmic Redistribution Methods for Block-Cyclic DecompositionsIEEE Transactions on Parallel and Distributed Systems10.1109/71.81994410:12(1201-1216)Online publication date: 1-Dec-1999
https://dl.acm.org/doi/10.1109/71.819944
Huang YMcColl W(1999)A BSP Bareiss Algorithm for Toeplitz SystemsJournal of Parallel and Distributed Computing10.1006/jpdc.1998.150956:2(99-121)Online publication date: 1-Feb-1999
https://dl.acm.org/doi/10.1006/jpdc.1998.1509
Balle SPedersen P(1997)SVD Computations on the Connection Machine CM-5/CM-5ESIAM Journal on Scientific Computing10.1137/S106482759528135618:5(1462-1478)Online publication date: 1-Sep-1997
https://dl.acm.org/doi/10.1137/S1064827595281356
Al-Ayyoub ADay K(1997)Matrix Decomposition on the Star GraphIEEE Transactions on Parallel and Distributed Systems10.1109/71.6057678:8(803-812)Online publication date: 1-Aug-1997
https://dl.acm.org/doi/10.1109/71.605767
Choi JDongarra JPozo RSorensen DWalker D(1994)Crpc Research Into Linear Algebra Software for High Performance ComputersInternational Journal of High Performance Computing Applications10.1177/1094342094008002038:2(99-118)Online publication date: 1-Jun-1994
https://dl.acm.org/doi/10.1177/109434209400800203
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents