More Web Proxy on the site http://driver.im/

research-article

Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences

Authors:

Cheng-Zhong Xu,

Vipin ChaudharyAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 12, Issue 5

Pages 433 - 450

https://doi.org/10.1109/71.926166

Published: 01 May 2001 Publication History

Abstract

This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves upon previous algorithms of the same generality by exploiting parallelism among consecutive reads of the same memory element. Two variants of the algorithm are considered: One allows partially concurrent reads (PCR) and the other allows fully concurrent reads (FCR). Analyses of their time complexities derive a necessary condition with respect to the iteration workload for runtime parallelization. Experimental results for a Gaussian elimination loop, as well as an extensive set of synthetic loops on a 12-way SMP server, show that the time stamp algorithms outperform iteration-level parallelization techniques in most test cases and gain speedups over sequential execution for loops that have heavy iteration workloads. The PCR algorithm performs best because it makes a better trade-off between maximizing the parallelism and minimizing the analysis overhead. For loops with light or unknown iteration loads, an alternative speculative runtime parallelization technique is preferred.

References

[1]

J. R. Allen K. Kennedy C. Porterfield and J. Warren, “Conversion of Control Dependence to Data Dependence,” <i>Proc. Conf. 10th ACM Symp. Principles of Programming Languages,</i> pp. 177-189, Jan. 1983.]]

Digital Library

[2]

D.F. Bacon S.L. Graham and O.J. Sharp, “Compiler Transformations for High-Performance Computing,” <i>ACM Computing Surveys,</i> vol. 26, no. 4, pp. 345-420, Dec. 1994.]]

Digital Library

[3]

U. Banerjee R. Eigenmann A. Nicolau and D. Padua, “Automatic Program Parallelization,” <i>Proc. IEEE,</i> vol. 81, no. 2, 1993.]]

[4]

W. Blume and W. Eigenmann, “Nonlinear and Symbolic Data Dependence Testing,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 9, no. 12, pp. 1180-1194, Dec. 1998.]]

Digital Library

[5]

L. Carter J. Feo and A. Snavely, “Performance and Programming Experience on the Tera MTA,” <i>Proc. SIAM Conf. Parallel Processing for Scientific Computing,</i> 1999.]]

[6]

D.-K. Chen and P.-C. Yew, “An Empirical Study on Doacross Loops,” <i>Proc. Supercomputing,</i> pp. 620-632, 1991.]]

[7]

D.-K. Chen and P.-C. Yew, “On Effective Execution of Nonuniform DOACROSS Loops,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 7,no. 5 pp. 463-476, May 1996.]]

Digital Library

[8]

D.-K. Chen and P.-C. Yew, “Redundant Synchronization Elimination for DOACROSS Loops,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 10, no. 5, pp. 459-470, May 1999.]]

Digital Library

[9]

D.-K. Chen P.-C. Yew and J. Torrellas, “An Efficient Algorithm for the Run-Time Parallelization of Doacross Loops,” <i>Proc. Supercomputing,</i> pp. 518-527, Nov. 1994.]]

[10]

“Overview of Cray MTA Multithreaded Architecture System,” Cray Research Inc., http://tera.com/products/systems/craymta, 2000.]]

[11]

D. Culler J.P. Singh and A. Gupta, <i>Parallel Computer Architecture: A Hardware/Software Approach.</i> Morgan Kaufmann, 1998.]]

Digital Library

[12]

R. Cytron, “DOACROSS: Beyond Vectorization for Multiiprocessors,” <i>Proc. Int'l Conf. Parallel Processing,</i> pp. 836-844, 1986.]]

[13]

R. Eigenmann J. Hoeflinger and D. Padua, “On the Automatic Parallelization of the Perfect Benchmarks,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 9, no. 1, pp. 5-23, Jan. 1998.]]

Digital Library

[14]

M. Gupta and R. Nim, “Techniques for Speculative Runtime Parallelization of Loops,” <i>Proc. Supercomputing (SC98),</i> 1998.]]

Digital Library

[15]

J. Hoeflinger, “Run-Time Dependence Testing by Integer Sequence Analysis,” technical report, Center for Supercomputing Research and Development, Univ. of Illinois, Urbana-Champaign, Jan. 1992.]]

[16]

A. Hurson K. Kavi and J. Lim, “Cyclic Staggered Scheme: A Loop Allocation Policy for DOACROSS Loops,” <i>IEEE Trans. Computers,</i> vol. 47, no. 2, pp. 251-255, Feb. 1998.]]

Digital Library

[17]

A. Hurson J. Lim K. Kavi and B. Lee, “Parallelization of DOALL and DOACROSS Loops: A Survey,” <i>Advances in Computers,</i> vol. 45, 1997.]]

[18]

Y.-S. Hwang R. Das J. Saltz B. Brooks and M. Hodoscek, “Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library,” <i>IEEE Computational Science and Eng.,</i> vol. 2, no. 2, pp. 18-29, 1995.]]

Digital Library

[19]

J. Ju and V. Chaudhary, “Unique Sets Oriented Partitioning of Nested Loops with Nonuniform Dependences,” <i>Computer J.,</i> vol. 40, no. 6, pp. 322-339, 1997.]]

[20]

V. Krishnan and J. Torrellas, “A Chip-Multiprocessor Architecture with Speculative Multithreading,” <i>IEEE Trans. Computers,</i> vol. 48, no. 9, pp. 866-880, Sept. 1999.]]

Digital Library

[21]

V. Kumar A. Grama A. Gupta and G. Karypis, <i>Introduction to Parallel Computing.</i> Benjamin/Cummings, 1994.]]

Digital Library

[22]

S.T. Leung and J. Zahorjan, “Improving the Performance of Runtime Parallelization,” <i>Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming,</i> pp. 83-91, May 1993.]]

Digital Library

[23]

S.T. Leung and J. Zahorjan, “Extending the Applicability and Improving the Peformance of Runtime Parallelization,” technical report, Dept. of Computer Science, Univ. of Washington, 1995.]]

[24]

J. Lim A. Hurson K. Kavi and B. Lee, “A Loop Allocation Policy for DOACROSS Loops,” <i>Proc. Symp. Parallel and Distributed Processing,</i> pp. 240-249, 1996.]]

Digital Library

[25]

S. Midkiff and D. Padua, “Compiler Algorithms for Synchronization,” <i>IEEE Trans. Computers,</i> vol. 36, no. 12, Dec. 1987.]]

Digital Library

[26]

“Harwell-Boeing Collection of Sparse Matrices,” Nat'l Inst. of Standards and Technology, http://math.nist.gov/matrixMarket/data/Harwell-Boeing, 2000.]]

[27]

J. Oplinger D. Heine S.-W. Liao B. Nayfeh M. Lam and K. Olukotun, “Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor,” Technical Report CSL-TR-97-715, Computer Science Lab., Stanford Univ., Feb. 1997.]]

Digital Library

[28]

D. Patterson and J. Hennessy, <i>Computer Architecture: A Quantiative Approach,</i> second ed., Morgan Kaufmann, 1996.]]

Digital Library

[29]

L. Rauchwerger, “Run-Time Parallelization: Its Time Has Come,” <i>Parallel Computing,</i> vol. 24, nos. 3-4, pp. 527-556, 1998.]]

Digital Library

[30]

L. Rauchwerger N. Amato and D. Padua, “Run-Time Methods for Parallelizing Partially Parallel Loops,” <i>Int'l J. Parallel Programming,</i> vol. 23, no. 6, pp. 537-576, 1995.]]

Digital Library

[31]

L. Rauchwerger and D. Padua, “The LRPD Test: Speculative Runtime Parallelization of Loops with Privatization and Reduction Parallelization,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 10, no. 2, pp. 160-180, Feb. 1999.]]

Digital Library

[32]

J. Saltz R. Mirchandaney and K. Crowley, “Run-Time Parallelization and Scheduling of Loops,” <i>IEEE Trans. Computers,</i> vol. 40, no. 5, May 1991.]]

Digital Library

[33]

Z. Shen Z. Li and P.C. Yew, “An Empirical Study on Array Subscripts and Data Dependencies,” <i>Proc. Int'l Conf. Parallel Processing,</i> vol. II, pp. 145-152, 1989.]]

[34]

J.-Y. Tsai J. Huang C. Amlo D. Lilja and P.-C. Yew, “The Superthreaded Processor Architecture,” <i>IEEE Trans. Computers,</i> vol. 48, no. 9, pp. 881-902, Sept. 1999.]]

Digital Library

[35]

T.H. Tzen and L.M. Ni, “Dependence Uniformization: A Loop Parallelization Technique,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 4,no. 5 pp. 547-558, May 1993.]]

Digital Library

[36]

C. Xu, “Effects of Parallelism Degree on Runtime Parallelization of Loops,” <i>Proc. 31st Hawaii Int'l Conf. Systems Science,</i> pp. 86-95, 1998.]]

Digital Library

[37]

C. Xu and F. Lau, <i>Load Balancing in Parallel Computers: Theory and Practice.</i> Kluwer Academic, 1997.]]

Digital Library

[38]

Y. Yan X. Zhang and Z. Zhang, “A Memory-Layout Oriented Runtime Technique for Locality Optimization,” <i>Proc. Int'l Conf. Parallel Processing,</i> 1998.]]

Digital Library

[39]

H. Yu and L. Rauchwerger, “Techniques for Reducing the Overhead of Runtime Parallelization,” <i>Proc. 9th Int'l Conf. Compiler Construction,</i> Mar. 2000.]]

Digital Library

[40]

C. Zhu and P.C. Yew, “A Scheme to Enforce Data Dependence on Large Multi-Processor Systems,” <i>IEEE Trans. Software Eng.,</i> vol. 13, no. 6, pp. 726-739, June 1987.]]

Digital Library

Cited By

Deiana ESt-Amour VDinda PHardavellas NCampanoni S(2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173181
Deiana ESt-Amour VDinda PHardavellas NCampanoni SShen XTuck JBianchini RSarkar V(2018)Unconventional Parallelization of Nondeterministic ApplicationsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173181(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173181
Mahjoub SVojoudi H(2016)The UTFLAThe Journal of Supercomputing10.1007/s11227-016-1725-872:6(2221-2234)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s11227-016-1725-8
Show More Cited By

Index Terms

Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences

Recommendations

Removal of Redundant Dependences in DOACROSS Loops with Constant Dependences

An efficient algorithm to remove redundant dependences in simple loops with constant dependences is presented. Dependences constrain the parallel execution of programs and are typically enforced by synchronization instructions. The synchronization ...
An efficient algorithm for the run-time parallelization of DOACROSS loops
Supercomputing '94: Proceedings of the 1994 ACM/IEEE conference on Supercomputing

While automatic parallelization of loops usually relies on compile-time analysis of data dependences, for some loops the data dependences cannot be determined at compile time. An example is loops accessing arrays with subscripted subscripts. To ...
A practical approach to DOACROSS parallelization
Euro-Par'12: Proceedings of the 18th international conference on Parallel Processing

Loops with cross-iteration dependences (doacross loops) often contain significant amounts of parallelism that can potentially be exploited on modern manycore processors. However, most production-strength compilers focus their automatic parallelization ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 12, Issue 5

May 2001

95 pages

ISSN:1045-9219

Editor:
John A. Stankovic
Univ. of Virginia, Charlottesville

Issue’s Table of Contents

Copyright © Copyright © 2001 IEEE. All Rights Reserved.

Publisher

IEEE Press

Publication History

Published: 01 May 2001

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Deiana ESt-Amour VDinda PHardavellas NCampanoni S(2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173181
Deiana ESt-Amour VDinda PHardavellas NCampanoni SShen XTuck JBianchini RSarkar V(2018)Unconventional Parallelization of Nondeterministic ApplicationsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173181(432-447)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173181
Mahjoub SVojoudi H(2016)The UTFLAThe Journal of Supercomputing10.1007/s11227-016-1725-872:6(2221-2234)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1007/s11227-016-1725-8
Campanoni SHolloway GWei GBrooks DOlukotun KSmith AHundt RMars J(2015)HELIX-UPProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738630(235-245)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.5555/2738600.2738630
Zhang CHan GWang CEpema D(2013)GPU-TLSProceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2013.34(120-127)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1109/CCGrid.2013.34
Campanoni SJones THolloway GReddi VWei GBrooks DEidt CHoller ASrinivasan UAmarasinghe S(2012)HELIXProceedings of the Tenth International Symposium on Code Generation and Optimization10.1145/2259016.2259028(84-93)Online publication date: 31-Mar-2012
https://dl.acm.org/doi/10.1145/2259016.2259028
Arenaz MTouriño JDoallo R(2004)An inspector-executor algorithm for irregular assignment parallelizationProceedings of the Second international conference on Parallel and Distributed Processing and Applications10.1007/978-3-540-30566-8_4(4-15)Online publication date: 13-Dec-2004
https://dl.acm.org/doi/10.1007/978-3-540-30566-8_4
Arenaz MTouriño JDoallo RBanerjee UGallivan KGonzalez A(2003)A GSA-based compiler infrastructure to extract parallelism from complex loopsProceedings of the 17th annual international conference on Supercomputing10.1145/782814.782842(193-204)Online publication date: 23-Jun-2003
https://dl.acm.org/doi/10.1145/782814.782842
Martín MSingh DTouriño JRivera F(2002)Improving Locality in the Parallelization of Doacross Loops (Research Note)Proceedings of the 8th International Euro-Par Conference on Parallel Processing10.5555/646667.700009(275-279)Online publication date: 27-Aug-2002
https://dl.acm.org/doi/10.5555/646667.700009

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents