[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Algorithmic Redistribution Methods for Block-Cyclic Decompositions

Published: 01 December 1999 Publication History

Abstract

This article presents various data redistribution methods for block-partitioned linear algebra algorithms operating on dense matrices that are distributed in a block-cyclic fashion. Because the algorithmic partitioning unit and the distribution blocking factor are most often chosen to be equal, severe alignment restrictions are induced on the operands, and optimal values with respect to performance are architecture dependent. The techniques presented in this paper redistribute data on the fly, so that the user's data distribution blocking factor becomes independent from the architecture dependent algorithmic partitioning. These techniques are applied to the matrix-matrix multiplication operation. A performance analysis along with experimental results shows that alignment restrictions can then be removed and that high performance can be maintained across platforms independently from the user's data distribution blocking factor.

References

[1]
M. Aboelaze N. Chrisochoides and E. Houstis, “The Parallelization of Level 2 and 3 BLAS Operations on Distributed-Memory Machines,” Technical Report CSD-TR-91-007, Purdue Univ., West Lafayette, Ind., 1991.
[2]
R. Agarwal F. Gustavson and M. Zubair, “A High Performance Matrix Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication,” IBM J. Research and Development, vol. 38, no. 6, pp.673–681, 1994.
[3]
T. Agerwala J. Martin J. Mirza D. Sadler D. Dias and M. Snir, “SP2 System Architecture,” IBM Systems J., vol. 34, no. 2, pp. 153–184, 1995.
[4]
C. Ancourt F. Coelho F. Irigoin R. Keryell, “A linear Algebra Framework for Static HPF Code Distribution,” Technical Report A-278-CRI, CRI-Ecole des Mines, Fontainebleau, France, 1995. (Available at http://www.cri.ensmp.fr.)
[5]
E. Anderson Z. Bai C. Bischof J. Demmel J. Dongarra J. Du Croz A. Greenbaum S. Hammarling A. McKenney S. Ostrouchov and D. Sorensen, LAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1995.
[6]
C. Ashcraft, “The Distributed Solution of Linear Systems Using the Torus-Wrap Data Mapping,” Technical Report ECA-TR-147, Boeing Computer Services, Seattle, Wash., 1990.
[7]
P. Bangalore, “The Data-Distribution-Independent Approach to Scalable Parallel Libraries,” master's thesis, Mississippi State Univ., 1995.
[8]
J. Bilmes K. Asanovic J. Demmel D. Lam and C. Chin, “Optimizing Matrix Multiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology,” Technical Report UT CS-96-326, LAPACK Working Note 111, Univ. Tennessee, 1996.
[9]
R. Bisseling and J. van der Vorst, “Parallel LU Decomposition on a Transputer Network,” Lecture Notes in Computer Sciences, G. van Zee and J. van der Vorst, eds., vol. 384, pp. 61–77, 1989.
[10]
R. Bisseling and J. van der Vorst, “Parallel Triangular System Solving on a Mesh Network of Transputers,” SIAM J. Scientific and Statistical Computing, vol. 12, pp. 787–799, 1991.
[11]
L. Blackford J. Choi A. Cleary E. D'Azevedo J. Demmel I. Dhillon J. Dongarra S. Hammarling G. Henry A. Petitet K. Stanley D. Walker and R.C. Whaley, ScaLAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1997.
[12]
R. Brent and P. Strazdins, “Implementation of BLAS Level 3 and LINPACK Benchmark on the AP1000,” Fujitsu Scientific and Technical J., vol. 5, no. 1, pp. 61–70, 1993.
[13]
S. Chatterjee J. Gilbert F. Long R. Schreiber and S. Tseng, “Generating Local Adresses and Communication Sets for Data Parallel Programs,” J. Parallel and Distributed Computing, vol. 26, pp. 72–84, 1995.
[14]
J. Choi, “A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers,” Technical Report UT CS-97-369, LAPACK Working Note 129, Univ. Tennessee, 1997.
[15]
J. Choi J. Dongarra and D. Walker, “PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed-Memory Concurrent Computers,” Concurrency: Practice and Experience, vol. 6, no. 7, pp. 543–570, 1994.
[16]
J. Choi J. Dongarra and D. Walker, “PB-BLAS: A Set of Parallel Block Basic Linear Algebra Subroutines” Concurrency: Practice and Experience, vol. 8, no. 7, pp. 517–535, 1996.
[17]
A. Chtchelkanova J. Gunnels G. Morrow J. Overfelt and R. van de Geijn, “Parallel Implementation of BLAS: General Techniques for Level 3 BLAS,” Concurrency: Practice and Experience, vol. 9, no. 9, pp. 837–857, 1997.
[18]
E. Chu and A. George, “QR Factorization of a Dense Matrix on a Hypercube Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 11, pp. 990–1,028, 1990.
[19]
M. Dayde I. Duff and A. Petitet, “A Parallel Block Implementation of Level 3 BLAS for MIMD Vector Processors,” ACM Trans. Mathematical Software, vol. 20, no. 2, pp. 178–193, 1994.
[20]
F. Desprez J. Dongarra and A. Petitet C. Randriamaro Y. Robert, “Scheduling Block-Cyclic Array Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 2, pp. 192–205 1998.
[21]
J. Dongarra R. van de Geijn and D. Walker, “Scalability Issues in the Design of a Library for Dense Linear Algebra,” J. Parallel and Distributed Computing, vol. 22, no. 3, pp. 523–537, 1994.
[22]
J. Dongarra and D. Walker, “Software Libraries for Linear Algebra Computations on High Performance Computers,” SIAM Review, vol. 37, no. 2, pp. 151–180, 1995.
[23]
J. Dongarra and R.C. Whaley, “A User's Guide to the BLACS v1.0,” Technical Report UT CS-95-281, LAPACK Working Note 94, Univ. Tennessee, 1995. (http://www.netlib.org/blacs/)
[24]
G. Fox M. Johnson G. Lyzenga S. Otto J. Salmon and D. Walker, Solving Problems on Concurrent Processors. Englewood Cliffs, N.J.: Prentice Hall, 1988.
[25]
G. Fox S. Otto and A. Hey, “Matrix Algorithms on a Hyper-cube I: Matrix Multiplication,” Parallel Computing, vol. 3, pp. 17–31, 1987.
[26]
G. Geist and C. Romine, “LU Factorization Algorithms on Distributed-Memory Multiprocessor Architectures,” SIAM J. Scientific and Statistical Computing, vol. 9, pp. 639–649, 1988.
[27]
M. Heath and C. Romine, “Parallel Solution Triangular Systems on Distributed-Memory Multiprocessors,” SIAM J. Scientific and Statistical Computing, vol. 9, pp. 558–588, 1988.
[28]
B. Hendrickson E. Jessup and C. Smith, “A Parallel Eigensolver for Dense Symmetric Matrices,” personal communication, 1996.
[29]
B. Hendrickson and D. Womble, “The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers,” J. Scientific and Statistical Computing, vol. 15, no. 5, pp. 1,201–1,226, Sept. 1994.
[30]
G. Henry and R. van de Geijn, “Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality,” Technical Report UT CS-94-244, LAPACK Working Note 79, Univ. Tennessee, 1994.
[31]
S. Huss-Lederman E. Jacobson A. Tsao and G. Zhang, “Matrix Multiplication on the Intel Touchstone DELTA,” Concurrency: Practice and Experience, vol. 6, no. 7, pp. 571–594, 1994.
[32]
B. Kågström P. Ling and C. van Loan, “GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark,” Technical Report UMINF 95-18, Dept. Computing Science, Umeå Univ., 1995.
[33]
E. Kalns and L. Ni, “Processor Mapping Techniques towards Efficient Data Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 6, pp. 1,234–1,247, 1995.
[34]
K. Kennedy N. Nedeljkovic and A. Sethi, “A Linear-Time Algorithm for Computing the Memory Access Sequence in Data Parallel Programs,” Proc. Fifth ACM SIGPLAN, Symp. Principles and Practice of Parallel Programming, 1995.
[35]
C. Koebel D. Loveman R. Schreiber G. Steele and M. Zosel, The High Performance Fortran Handbook. Cambridge, Mass.: MIT Press, 1994.
[36]
V. Kumar A. Grama A. Gupta and G. Karypis, Introduction to Parallel Computing. Redwood City, Calif.: Benjamin/Cummings Publishing Company, Inc., 1994.
[37]
G. Li and T. Coleman, “A New Method for Solving Triangular Systems on Distributed-Memory Message-Passing Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 10, no. 2, pp. 382–396, 1989.
[38]
W. Lichtenstein and S.L. Johnsson, “Block-Cyclic Dense Linear Algebra,” SIAM J. Scientific and Statistical Computing, vol. 14, no. 6, pp. 1,259–1,288 1993.
[39]
Y. Lim P. Bhat and V. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Technical Report CENG 97-10, Dept. Electrical Engineering-Systems, Univ. Southern California, Los Angeles, Calif., 1997.
[40]
K. Mathur S.L. Johnsson, “Multiplication of Matrices of Arbitrary Shapes on a Data Parallel Computer,” Parallel Computing, vol. 20, pp. 919–951, 1994.
[41]
A. Petitet, Algorithmic Redistribution Methods for Block Cyclic Decompositions, doctoral thesis, Univ. Tennessee, Knoxville, 1996.
[42]
L. Prylli and B. Tourancheau, “Fast Runtime Block Cyclic Data Redistribution on Multiprocessors,” J. Parallel and Distributed Computing, vol. 45, 1997.
[43]
P. Strazdins, “Matrix Factorization using Distributed Panels on the Fujitsu AP1000,” Proc. IEEE First Int'l Conf. Algorithms and Architectures for Parallel Processing (ICA3PP-95), 1995.
[44]
P. Strazdins and H. Koesmarno, “A High Performance Version of Parallel LAPACK: Preliminary Report,” Proc. Sixth Parallel Computing Workshop, Fujitsu Parallel Computing Center, 1996.
[45]
C. Stunkel D. Shea B. Abali M. Atkins C. Bender D. Grice P. Hochshild D. Joseph B. Nathanson R. Swetz R. Stucke M. Tsao and P. Varker, “The SP2 High-Performance Switch,” IBM Systems J., vol. 34, no. 2, pp. 185–204, 1995.
[46]
A. Thirumalai and J. Ramanujam, “Fast Address Sequence Generation for Data Parallel Programs Using Integer Lattices,” Languages and Compilers for Parallel Computing: Lecture Notes in Computer Science. P. Sadayappan et al., eds., Springer-Verlag, 1996.
[47]
R. van de Geijn and J. Watts, “SUMMA: Scalable Universal Matrix Multiplication Algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997.
[48]
E. van de Velde, “Experiments with Multicomputer LU Decomposition,” Concurrency: Practice and Experience, vol. 2, pp. 1–26, 1990.
[49]
D. Walker and S. Otto, “Redistribution of Block-Cyclic Data Distributions Using MPI,” Concurrency: Practice and Experience, vol. 8, no. 9, pp. 707–728, 1996.
[50]
L. Wang J. Stichnoth S. Chatterjee, “Runtime Performance of Par-allel Array Assignment: An Empirical Study,” Proc. Supercomputing, 1996. (http://www.supercomp.org/sc96/proceedings/).
[51]
R. Whaley and J. Dongarra, “Automatically Tuned Linear Algebra Software,” Technical Report UT CS-97-366, LAPACK Working Note 131, Univ. Tennessee, 1997.

Cited By

View all
  • (2023)Efficient Parallelization of Dynamic Programming for Large ApplicationsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593600(1-9)Online publication date: 23-Jul-2023
  • (2022)DeinsumProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571918(1-15)Online publication date: 13-Nov-2022
  • (2018)ASPEN: An Efficient Algorithm for Data Redistribution Between Producer and Consumer GridsEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_14(171-182)Online publication date: 27-Aug-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 10, Issue 12
December 1999
152 pages
ISSN:1045-9219
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 December 1999

Author Tags

  1. Algorithmic blocking
  2. block-cyclic decomposition.
  3. redistribution

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient Parallelization of Dynamic Programming for Large ApplicationsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593600(1-9)Online publication date: 23-Jul-2023
  • (2022)DeinsumProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571918(1-15)Online publication date: 13-Nov-2022
  • (2018)ASPEN: An Efficient Algorithm for Data Redistribution Between Producer and Consumer GridsEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_14(171-182)Online publication date: 27-Aug-2018
  • (2009)Two-dimensional matrix partitioning for parallel computing on heterogeneous processors based on their functional performance modelsProceedings of the 2009 international conference on Parallel processing10.5555/1884795.1884811(112-121)Online publication date: 25-Aug-2009
  • (2008)Scalability of Neville elimination using checkerboard partitioningInternational Journal of Computer Mathematics10.1080/0020716060116707885:3-4(309-317)Online publication date: 1-Mar-2008
  • (2007)On the complexity of the max-edge-coloring problem with its variantsProceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies10.5555/2399256.2399288(350-361)Online publication date: 7-Apr-2007
  • (2007)ISOProceedings of the 9th international conference on Parallel Computing Technologies10.5555/2392094.2392149(507-515)Online publication date: 3-Sep-2007
  • (2007)Data Partitioning with a Functional Performance Model of Heterogeneous ProcessorsInternational Journal of High Performance Computing Applications10.1177/109434200607486421:1(76-90)Online publication date: 1-Feb-2007
  • (2007)Scheduling contention-free irregular redistributions in parallelizing compilersThe Journal of Supercomputing10.1007/s11227-006-0024-140:3(229-247)Online publication date: 1-Jun-2007
  • (2006)Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing CompilersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2006.16217:11(1226-1241)Online publication date: 1-Nov-2006
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media