More Web Proxy on the site http://driver.im/

research-article

Algorithmic Redistribution Methods for Block-Cyclic Decompositions

Authors:

Antoine P. Petitet,

Jack J. DongarraAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 10, Issue 12

Pages 1201 - 1216

https://doi.org/10.1109/71.819944

Published: 01 December 1999 Publication History

Abstract

This article presents various data redistribution methods for block-partitioned linear algebra algorithms operating on dense matrices that are distributed in a block-cyclic fashion. Because the algorithmic partitioning unit and the distribution blocking factor are most often chosen to be equal, severe alignment restrictions are induced on the operands, and optimal values with respect to performance are architecture dependent. The techniques presented in this paper redistribute data on the fly, so that the user's data distribution blocking factor becomes independent from the architecture dependent algorithmic partitioning. These techniques are applied to the matrix-matrix multiplication operation. A performance analysis along with experimental results shows that alignment restrictions can then be removed and that high performance can be maintained across platforms independently from the user's data distribution blocking factor.

References

[1]

M. Aboelaze N. Chrisochoides and E. Houstis, “The Parallelization of Level 2 and 3 BLAS Operations on Distributed-Memory Machines,” Technical Report CSD-TR-91-007, Purdue Univ., West Lafayette, Ind., 1991.

[2]

R. Agarwal F. Gustavson and M. Zubair, “A High Performance Matrix Multiplication Algorithm on a Distributed-Memory Parallel Computer, Using Overlapped Communication,” IBM J. Research and Development, vol. 38, no. 6, pp.673–681, 1994.

Digital Library

[3]

T. Agerwala J. Martin J. Mirza D. Sadler D. Dias and M. Snir, “SP2 System Architecture,” IBM Systems J., vol. 34, no. 2, pp. 153–184, 1995.

Digital Library

[4]

C. Ancourt F. Coelho F. Irigoin R. Keryell, “A linear Algebra Framework for Static HPF Code Distribution,” Technical Report A-278-CRI, CRI-Ecole des Mines, Fontainebleau, France, 1995. (Available at http://www.cri.ensmp.fr.)

[5]

E. Anderson Z. Bai C. Bischof J. Demmel J. Dongarra J. Du Croz A. Greenbaum S. Hammarling A. McKenney S. Ostrouchov and D. Sorensen, LAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1995.

Digital Library

[6]

C. Ashcraft, “The Distributed Solution of Linear Systems Using the Torus-Wrap Data Mapping,” Technical Report ECA-TR-147, Boeing Computer Services, Seattle, Wash., 1990.

[7]

P. Bangalore, “The Data-Distribution-Independent Approach to Scalable Parallel Libraries,” master's thesis, Mississippi State Univ., 1995.

[8]

J. Bilmes K. Asanovic J. Demmel D. Lam and C. Chin, “Optimizing Matrix Multiply using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology,” Technical Report UT CS-96-326, LAPACK Working Note 111, Univ. Tennessee, 1996.

Digital Library

[9]

R. Bisseling and J. van der Vorst, “Parallel LU Decomposition on a Transputer Network,” Lecture Notes in Computer Sciences, G. van Zee and J. van der Vorst, eds., vol. 384, pp. 61–77, 1989.

Digital Library

[10]

R. Bisseling and J. van der Vorst, “Parallel Triangular System Solving on a Mesh Network of Transputers,” SIAM J. Scientific and Statistical Computing, vol. 12, pp. 787–799, 1991.

Digital Library

[11]

L. Blackford J. Choi A. Cleary E. D'Azevedo J. Demmel I. Dhillon J. Dongarra S. Hammarling G. Henry A. Petitet K. Stanley D. Walker and R.C. Whaley, ScaLAPACK Users' Guide. Philadelphia, Penn.: SIAM, 1997.

Digital Library

[12]

R. Brent and P. Strazdins, “Implementation of BLAS Level 3 and LINPACK Benchmark on the AP1000,” Fujitsu Scientific and Technical J., vol. 5, no. 1, pp. 61–70, 1993.

[13]

S. Chatterjee J. Gilbert F. Long R. Schreiber and S. Tseng, “Generating Local Adresses and Communication Sets for Data Parallel Programs,” J. Parallel and Distributed Computing, vol. 26, pp. 72–84, 1995.

Digital Library

[14]

J. Choi, “A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers,” Technical Report UT CS-97-369, LAPACK Working Note 129, Univ. Tennessee, 1997.

Digital Library

[15]

J. Choi J. Dongarra and D. Walker, “PUMMA: Parallel Universal Matrix Multiplication Algorithms on Distributed-Memory Concurrent Computers,” Concurrency: Practice and Experience, vol. 6, no. 7, pp. 543–570, 1994.

[16]

J. Choi J. Dongarra and D. Walker, “PB-BLAS: A Set of Parallel Block Basic Linear Algebra Subroutines” Concurrency: Practice and Experience, vol. 8, no. 7, pp. 517–535, 1996.

[17]

A. Chtchelkanova J. Gunnels G. Morrow J. Overfelt and R. van de Geijn, “Parallel Implementation of BLAS: General Techniques for Level 3 BLAS,” Concurrency: Practice and Experience, vol. 9, no. 9, pp. 837–857, 1997.

[18]

E. Chu and A. George, “QR Factorization of a Dense Matrix on a Hypercube Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 11, pp. 990–1,028, 1990.

Digital Library

[19]

M. Dayde I. Duff and A. Petitet, “A Parallel Block Implementation of Level 3 BLAS for MIMD Vector Processors,” ACM Trans. Mathematical Software, vol. 20, no. 2, pp. 178–193, 1994.

Digital Library

[20]

F. Desprez J. Dongarra and A. Petitet C. Randriamaro Y. Robert, “Scheduling Block-Cyclic Array Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 2, pp. 192–205 1998.

Digital Library

[21]

J. Dongarra R. van de Geijn and D. Walker, “Scalability Issues in the Design of a Library for Dense Linear Algebra,” J. Parallel and Distributed Computing, vol. 22, no. 3, pp. 523–537, 1994.

Digital Library

[22]

J. Dongarra and D. Walker, “Software Libraries for Linear Algebra Computations on High Performance Computers,” SIAM Review, vol. 37, no. 2, pp. 151–180, 1995.

Digital Library

[23]

J. Dongarra and R.C. Whaley, “A User's Guide to the BLACS v1.0,” Technical Report UT CS-95-281, LAPACK Working Note 94, Univ. Tennessee, 1995. (http://www.netlib.org/blacs/)

Digital Library

[24]

G. Fox M. Johnson G. Lyzenga S. Otto J. Salmon and D. Walker, Solving Problems on Concurrent Processors. Englewood Cliffs, N.J.: Prentice Hall, 1988.

Digital Library

[25]

G. Fox S. Otto and A. Hey, “Matrix Algorithms on a Hyper-cube I: Matrix Multiplication,” Parallel Computing, vol. 3, pp. 17–31, 1987.

[26]

G. Geist and C. Romine, “LU Factorization Algorithms on Distributed-Memory Multiprocessor Architectures,” SIAM J. Scientific and Statistical Computing, vol. 9, pp. 639–649, 1988.

Digital Library

[27]

M. Heath and C. Romine, “Parallel Solution Triangular Systems on Distributed-Memory Multiprocessors,” SIAM J. Scientific and Statistical Computing, vol. 9, pp. 558–588, 1988.

Digital Library

[28]

B. Hendrickson E. Jessup and C. Smith, “A Parallel Eigensolver for Dense Symmetric Matrices,” personal communication, 1996.

[29]

B. Hendrickson and D. Womble, “The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers,” J. Scientific and Statistical Computing, vol. 15, no. 5, pp. 1,201–1,226, Sept. 1994.

Digital Library

[30]

G. Henry and R. van de Geijn, “Parallelizing the QR Algorithm for the Unsymmetric Algebraic Eigenvalue Problem: Myths and Reality,” Technical Report UT CS-94-244, LAPACK Working Note 79, Univ. Tennessee, 1994.

Digital Library

[31]

S. Huss-Lederman E. Jacobson A. Tsao and G. Zhang, “Matrix Multiplication on the Intel Touchstone DELTA,” Concurrency: Practice and Experience, vol. 6, no. 7, pp. 571–594, 1994.

[32]

B. Kågström P. Ling and C. van Loan, “GEMM-Based Level 3 BLAS: High-Performance Model Implementations and Performance Evaluation Benchmark,” Technical Report UMINF 95-18, Dept. Computing Science, Umeå Univ., 1995.

Digital Library

[33]

E. Kalns and L. Ni, “Processor Mapping Techniques towards Efficient Data Redistribution,” IEEE Trans. Parallel and Distributed Systems, vol. 12, no. 6, pp. 1,234–1,247, 1995.

Digital Library

[34]

K. Kennedy N. Nedeljkovic and A. Sethi, “A Linear-Time Algorithm for Computing the Memory Access Sequence in Data Parallel Programs,” Proc. Fifth ACM SIGPLAN, Symp. Principles and Practice of Parallel Programming, 1995.

Digital Library

[35]

C. Koebel D. Loveman R. Schreiber G. Steele and M. Zosel, The High Performance Fortran Handbook. Cambridge, Mass.: MIT Press, 1994.

Digital Library

[36]

V. Kumar A. Grama A. Gupta and G. Karypis, Introduction to Parallel Computing. Redwood City, Calif.: Benjamin/Cummings Publishing Company, Inc., 1994.

Digital Library

[37]

G. Li and T. Coleman, “A New Method for Solving Triangular Systems on Distributed-Memory Message-Passing Multiprocessor,” SIAM J. Scientific and Statistical Computing, vol. 10, no. 2, pp. 382–396, 1989.

Digital Library

[38]

W. Lichtenstein and S.L. Johnsson, “Block-Cyclic Dense Linear Algebra,” SIAM J. Scientific and Statistical Computing, vol. 14, no. 6, pp. 1,259–1,288 1993.

Digital Library

[39]

Y. Lim P. Bhat and V. Prasanna, “Efficient Algorithms for Block-Cyclic Redistribution of Arrays,” Technical Report CENG 97-10, Dept. Electrical Engineering-Systems, Univ. Southern California, Los Angeles, Calif., 1997.

[40]

K. Mathur S.L. Johnsson, “Multiplication of Matrices of Arbitrary Shapes on a Data Parallel Computer,” Parallel Computing, vol. 20, pp. 919–951, 1994.

Digital Library

[41]

A. Petitet, Algorithmic Redistribution Methods for Block Cyclic Decompositions, doctoral thesis, Univ. Tennessee, Knoxville, 1996.

Digital Library

[42]

L. Prylli and B. Tourancheau, “Fast Runtime Block Cyclic Data Redistribution on Multiprocessors,” J. Parallel and Distributed Computing, vol. 45, 1997.

Digital Library

[43]

P. Strazdins, “Matrix Factorization using Distributed Panels on the Fujitsu AP1000,” Proc. IEEE First Int'l Conf. Algorithms and Architectures for Parallel Processing (ICA3PP-95), 1995.

[44]

P. Strazdins and H. Koesmarno, “A High Performance Version of Parallel LAPACK: Preliminary Report,” Proc. Sixth Parallel Computing Workshop, Fujitsu Parallel Computing Center, 1996.

[45]

C. Stunkel D. Shea B. Abali M. Atkins C. Bender D. Grice P. Hochshild D. Joseph B. Nathanson R. Swetz R. Stucke M. Tsao and P. Varker, “The SP2 High-Performance Switch,” IBM Systems J., vol. 34, no. 2, pp. 185–204, 1995.

Digital Library

[46]

A. Thirumalai and J. Ramanujam, “Fast Address Sequence Generation for Data Parallel Programs Using Integer Lattices,” Languages and Compilers for Parallel Computing: Lecture Notes in Computer Science. P. Sadayappan et al., eds., Springer-Verlag, 1996.

Digital Library

[47]

R. van de Geijn and J. Watts, “SUMMA: Scalable Universal Matrix Multiplication Algorithm,” Concurrency: Practice and Experience, vol. 9, no. 4, pp. 255–274, 1997.

Digital Library

[48]

E. van de Velde, “Experiments with Multicomputer LU Decomposition,” Concurrency: Practice and Experience, vol. 2, pp. 1–26, 1990.

Digital Library

[49]

D. Walker and S. Otto, “Redistribution of Block-Cyclic Data Distributions Using MPI,” Concurrency: Practice and Experience, vol. 8, no. 9, pp. 707–728, 1996.

[50]

L. Wang J. Stichnoth S. Chatterjee, “Runtime Performance of Par-allel Array Assignment: An Empirical Study,” Proc. Supercomputing, 1996. (http://www.supercomp.org/sc96/proceedings/).

Digital Library

[51]

R. Whaley and J. Dongarra, “Automatically Tuned Linear Algebra Software,” Technical Report UT CS-97-366, LAPACK Working Note 131, Univ. Tennessee, 1997.

Digital Library

Cited By

Ferguson MFontes LNewhall T(2023)Efficient Parallelization of Dynamic Programming for Large ApplicationsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593600(1-9)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.1145/3569951.3593600
Ziogas AKwasniewski GBen-Nun TSchneider THoefler TWolf FShende SCulhane CAlam SJagode H(2022)DeinsumProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571918(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571918
Foyer CTate AMcIntosh-Smith S(2018)ASPEN: An Efficient Algorithm for Data Redistribution Between Producer and Consumer GridsEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_14(171-182)Online publication date: 27-Aug-2018
https://dl.acm.org/doi/10.1007/978-3-030-10549-5_14
Show More Cited By

Index Terms

Algorithmic Redistribution Methods for Block-Cyclic Decompositions
1. Computing methodologies
  1. Symbolic and algebraic manipulation
    1. Symbolic and algebraic algorithms
      1. Linear algebra algorithms
2. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
      1. Computations on matrices

Recommendations

Scheduling Block-Cyclic Array Redistribution

This article is devoted to the run-time redistribution of one-dimensional arrays that are distributed in a block-cyclic fashion over a processor grid. While previous studies have concentrated on efficiently generating the communication messages to be ...
A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution

In many scientific applications, dynamic array redistribution is usually required to enhance the performance of an algorithm. In this paper, we present a generalized basic-cycle calculation (GBCC) method to efficiently perform a BLOCK-CYCLIC(s) over P ...
Algorithmic redistribution methods for block cyclic decompositions

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 10, Issue 12

December 1999

152 pages

ISSN:1045-9219

Editor:
John A. Stankovic
Univ. of Virginia, Charlottesville

Issue’s Table of Contents

Copyright © Copyright © 1999 IEEE. All Rights Reserved.

Publisher

IEEE Press

Publication History

Published: 01 December 1999

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ferguson MFontes LNewhall T(2023)Efficient Parallelization of Dynamic Programming for Large ApplicationsPractice and Experience in Advanced Research Computing 2023: Computing for the Common Good10.1145/3569951.3593600(1-9)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.1145/3569951.3593600
Ziogas AKwasniewski GBen-Nun TSchneider THoefler TWolf FShende SCulhane CAlam SJagode H(2022)DeinsumProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571918(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571918
Foyer CTate AMcIntosh-Smith S(2018)ASPEN: An Efficient Algorithm for Data Redistribution Between Producer and Consumer GridsEuro-Par 2018: Parallel Processing Workshops10.1007/978-3-030-10549-5_14(171-182)Online publication date: 27-Aug-2018
https://dl.acm.org/doi/10.1007/978-3-030-10549-5_14
Lastovetsky AReddy R(2009)Two-dimensional matrix partitioning for parallel computing on heterogeneous processors based on their functional performance modelsProceedings of the 2009 international conference on Parallel processing10.5555/1884795.1884811(112-121)Online publication date: 25-Aug-2009
https://dl.acm.org/doi/10.5555/1884795.1884811
Alonso PCortina RDíaz IRanilla J(2008)Scalability of Neville elimination using checkerboard partitioningInternational Journal of Computer Mathematics10.1080/0020716060116707885:3-4(309-317)Online publication date: 1-Mar-2008
https://dl.acm.org/doi/10.1080/00207160601167078
Yu C(2007)On the complexity of the max-edge-coloring problem with its variantsProceedings of the First international conference on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies10.5555/2399256.2399288(350-361)Online publication date: 7-Apr-2007
https://dl.acm.org/doi/10.5555/2399256.2399288
Chen SHsu C(2007)ISOProceedings of the 9th international conference on Parallel Computing Technologies10.5555/2392094.2392149(507-515)Online publication date: 3-Sep-2007
https://dl.acm.org/doi/10.5555/2392094.2392149
Lastovetsky AReddy R(2007)Data Partitioning with a Functional Performance Model of Heterogeneous ProcessorsInternational Journal of High Performance Computing Applications10.1177/109434200607486421:1(76-90)Online publication date: 1-Feb-2007
https://dl.acm.org/doi/10.1177/1094342006074864
Hsu CChen SLan C(2007)Scheduling contention-free irregular redistributions in parallelizing compilersThe Journal of Supercomputing10.1007/s11227-006-0024-140:3(229-247)Online publication date: 1-Jun-2007
https://dl.acm.org/doi/10.1007/s11227-006-0024-1
Hsu CChen MYang CLi K(2006)Optimizing Communications of Dynamic Data Redistribution on Symmetrical Matrices in Parallelizing CompilersIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2006.16217:11(1226-1241)Online publication date: 1-Nov-2006
https://dl.acm.org/doi/10.1109/TPDS.2006.162
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents