[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Block-Cyclic Dense Linear Algebra

Published: 01 November 1993 Publication History

Abstract

Block-cyclic order elimination algorithms for LU and OR factorization and solve routines are described for distributed memory architectures with processing nodes configured as two-dimensional arrays of arbitrary shape. The cyclic-order elimination, together with a consecutive data allocation, yields good load balance for both the factorization and solution phases for the solution of dense systems of equations by LU and OR decomposition. Blocking may offer a substantial performance enhancement on architectures for which the level-2 or level-3 BLAS (basic linear algebra subroutines) are ideal for operations local to a node. High-rank updates local to a node may have a performance that is a factor of four or more higher than a rank-1 update.
This paper shows that in many parallel implementations, the $O(N^2 )$ work in the factorization may be of the same significance as the $O(N^3 )$ work, even for large matrices. The $O(N^2 )$ work is poorly load balanced in two-dimensional nodal arrays, which are shown to be optimal with respect to communication for consecutive data allocation, block-cyclic order elimination, and a simple, but fairly general, communications model.
In this Connection Machine system CM-200 implementation, the peak performance for LU factorization is about 9.4 Gflops/s in 64-bit precision and 16 Gflops/s in 32-bit precision. Blocking offers an overall performance enhancement of an approximate factor of two. The broadcast-and-reduce operations fully utilize the bandwidth available in the Boolean cube network interconnecting the nodes along each axis of the two-dimensional nodal array embedded in the cube network.

References

[1]
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Ducroz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, Preliminary LAPACK User's Guide, Tech. Rep., University of Tennessee, 1991, July
[2]
J.-P. Brunet, S. L. Johnsson, All-to-all broadcast with applications on the Connection Machine, Internat. J. Supercomputer Appl., 3 (1992),
[3]
Mee Yee Chan, Embedding of grids into optimal hypercubes, SIAM J. Comput., 20 (1991), 834–864
[4]
J. J. Dongarra, J. DuCroz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms, Tech. Rep. Reprint, 1, Argonne National Laboratories, Mathematics and Computer Science Division, Argonne, IL, 1988, August
[5]
J. J. Dongarra, J. DuCroz, I. Duff, S. Hammarling, A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs, Tech. Rep. Reprint, 2, Argonne National Laboratories, Mathematics and Computer Science Division, Argonne, IL, 1988, August
[6]
J. J. Dongarra, J. DuCroz, S. Hammarling, R. J. Hanson, An Extended Set of Fortran Basic Linear Algebra Subprograms, Tech. Rep. Tech. Memo., 41, Argonne National Laboratories, Mathematics and Computer Science Division, Argonne, IL, 1986, November
[7]
G. C. Fox, S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, M. Wu, Fortran D language specification, Tech. Rep., TR90-141, Department of Computer Science, Rice University, Houston, TX, 1990, December
[8]
G. C. Fox, M. A. Johnsson, G. A. Lyzenga, S. W. Otto, J. K. Salmon, W. Furmanski, Solving Problems on Concurrent Processors, Prentice-Hall, Englewood Cliffs, NJ, 1988
[9]
W. Morven Gentleman, Some complexity results for matrix computations on parallel processors, J. Assoc. Comput. Mach., 25 (1978), 112–115
[10]
P. Harten, H.-Y. Lin, H. Rajic, D. Wells, BLAS and FFT performance on the Intel i860 microprocessor, Tech. Rep., 1991
[11]
Ivan Havel, Jaroslav Morávek, B-valuations of graphs, Czechoslovak Math. J., 22(97) (1972), 338–351
[12]
C.-T. Ho, S. L. Johnsson, Embedding meshes in Boolean cubes by graph decomposition, J. Parallel Distributed Computing, 4 (1990), 325–339
[13]
S. L. Johnsson, Band matrix systems solvers on ensemble architecturesAlgorithms, Architecture, and the Future of Scientific Computation, University of Texas Press, Austin, TX, 1985, 195–216
[14]
S. L. Johnsson, Dense matrix operations on a torus and a Boolean cube, 1985, in The National Computer Conference, July
[15]
S. L. Johnsson, Fast banded systems solvers for ensemble architectures, Tech. Rep., YALEU/DCS/RR-379, Department of Computer Science, Yale University, New Haven, CT, 1985, March
[16]
S. Lennart Johnsson, Solving narrow banded systems on ensemble architectures, ACM Trans. Math. Software, 11 (1985), 271–288
[17]
S. L. Johnsson, Communication efficient basic linear algebra computations on hypercube architectures, J. Parallel Distributed Computing, 2 (1987), 133–172
[18]
S. Lennart Johnsson, Ching-Tien Ho, Optimum broadcasting and personalized communication in hypercubes, IEEE Trans. Comput., 38 (1989), 1249–1268
[19]
S. L. Johnsson, Ching-Tien Ho, Generalized shuffle permutations on Boolean cubes, J. Parallel Distributed Computing, 1 (1992), 1–14
[20]
S. L. Johnsson, L. F. Ortiz, Local Basic Linear Algebra Subroutines (BLAS) on the Connection Machine System CM-200, Tech. Rep., TMC-226, Thinking Machines Corp., Cambridge, MA, 1992, April
[21]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, F. T. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Trans. on Math. Software, 3 (1979), 308–323
[22]
G. Li, T. F. Coleman, A parallel triangular solver for a distributed-memory multiprocessor, SIAM J. Sci. Statist. Comput., 9 (1988), 485–502
[23]
G. Li, T. F. Coleman, A new method for solving triangular systems on distributed-memory message-passing multiprocessors, SIAM J. Sci. Statist. Comput., 10 (1989), 382–396
[24]
K. K. Mathur, S. L. Johnsson, Multiplication of Matrices of Arbitrary Shape on a Data Parallel Computer, Tech. Rep., 216, Thinking Machines Corp., Cambridge, MA, 1991, December
[25]
Edward M. Reingold, Jurg Nievergelt, Narsingh Deo, Combinatorial algorithms: theory and practice, Prentice-Hall Inc., Englewood Cliffs, N.J., 1977xii+433
[26]
B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbov, Y. Ikebe, V. C. Klema, C. B. Moler, Matrix eigensystem routines—EISPACK guide, Springer-Verlag, Berlin, 1976ix+551, Lecture Notes in Computer Science, Vol. 6, New York
[27]
CMSSL for Fortran, 1990
[28]
CM-200 Technical Summary, 1991
[29]
, CM Fortran optimization notes: slicewise model, version 1.0, 1991
[30]
R. A. van de Geijn, Massively parallel LINPACK benchmark on the Intel Touchstone Delta and iPSC/860 systems, Tech. Rep., The University of Texas, Austin, 1991, July
[31]
H. Zima, P. Brezany, B. Chapman, P. Mehrotra, A. Schwald, Vienna Fortran—A Language Specification version 1.1, Tech. Rep., 21, Institute for Computer Applications in Science and Engineering, Hampton, VA, 1992, Interim Rep., March

Cited By

View all
  • (2007)Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputersProceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing10.1145/1375783.1375809(115-125)Online publication date: 11-Nov-2007
  • (2003)Self-adapting software for numerical linear algebra and LAPACK for clustersParallel Computing10.1016/j.parco.2003.05.01429:11-12(1723-1743)Online publication date: 1-Nov-2003
  • (2003)Node-ranking schemes for the star networksJournal of Parallel and Distributed Computing10.1016/S0743-7315(02)00056-463:3(239-250)Online publication date: 1-Mar-2003
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Scientific Computing
SIAM Journal on Scientific Computing  Volume 14, Issue 6
Nov 1993
245 pages

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 November 1993

Author Tags

  1. 15A23
  2. 65F05
  3. 68-04
  4. 68N99
  5. 68Q25

Author Tags

  1. linear algebra
  2. distributed memory
  3. LU
  4. OR
  5. cyclic order

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2007)Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputersProceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing10.1145/1375783.1375809(115-125)Online publication date: 11-Nov-2007
  • (2003)Self-adapting software for numerical linear algebra and LAPACK for clustersParallel Computing10.1016/j.parco.2003.05.01429:11-12(1723-1743)Online publication date: 1-Nov-2003
  • (2003)Node-ranking schemes for the star networksJournal of Parallel and Distributed Computing10.1016/S0743-7315(02)00056-463:3(239-250)Online publication date: 1-Mar-2003
  • (2001)Parallel Factorizations with Algorithmic BlockingProceedings of the International Conference on Computational Sciences-Part I10.5555/645455.653746(802-811)Online publication date: 28-May-2001
  • (2001)On the Performance of Parallel Matrix Factorisation on the HypermeshThe Journal of Supercomputing10.1023/A:101114020352820:1(37-53)Online publication date: 1-Aug-2001
  • (1999)Algorithmic Redistribution Methods for Block-Cyclic DecompositionsIEEE Transactions on Parallel and Distributed Systems10.1109/71.81994410:12(1201-1216)Online publication date: 1-Dec-1999
  • (1999)A BSP Bareiss Algorithm for Toeplitz SystemsJournal of Parallel and Distributed Computing10.1006/jpdc.1998.150956:2(99-121)Online publication date: 1-Feb-1999
  • (1997)SVD Computations on the Connection Machine CM-5/CM-5ESIAM Journal on Scientific Computing10.1137/S106482759528135618:5(1462-1478)Online publication date: 1-Sep-1997
  • (1997)Matrix Decomposition on the Star GraphIEEE Transactions on Parallel and Distributed Systems10.1109/71.6057678:8(803-812)Online publication date: 1-Aug-1997
  • (1994)Crpc Research Into Linear Algebra Software for High Performance ComputersInternational Journal of High Performance Computing Applications10.1177/1094342094008002038:2(99-118)Online publication date: 1-Jun-1994
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media