[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Time Stamp Algorithms for Runtime Parallelization of DOACROSS Loops with Dynamic Dependences

Published: 01 May 2001 Publication History

Abstract

This paper presents a time stamp algorithm for runtime parallelization of general DOACROSS loops that have indirect access patterns. The algorithm follows the INSPECTOR/EXECUTOR scheme and exploits parallelism at a fine-grained memory reference level. It features a parallel inspector and improves upon previous algorithms of the same generality by exploiting parallelism among consecutive reads of the same memory element. Two variants of the algorithm are considered: One allows partially concurrent reads (PCR) and the other allows fully concurrent reads (FCR). Analyses of their time complexities derive a necessary condition with respect to the iteration workload for runtime parallelization. Experimental results for a Gaussian elimination loop, as well as an extensive set of synthetic loops on a 12-way SMP server, show that the time stamp algorithms outperform iteration-level parallelization techniques in most test cases and gain speedups over sequential execution for loops that have heavy iteration workloads. The PCR algorithm performs best because it makes a better trade-off between maximizing the parallelism and minimizing the analysis overhead. For loops with light or unknown iteration loads, an alternative speculative runtime parallelization technique is preferred.

References

[1]
J. R. Allen K. Kennedy C. Porterfield and J. Warren, “Conversion of Control Dependence to Data Dependence,” <i>Proc. Conf. 10th ACM Symp. Principles of Programming Languages,</i> pp. 177-189, Jan. 1983.]]
[2]
D.F. Bacon S.L. Graham and O.J. Sharp, “Compiler Transformations for High-Performance Computing,” <i>ACM Computing Surveys,</i> vol. 26, no. 4, pp. 345-420, Dec. 1994.]]
[3]
U. Banerjee R. Eigenmann A. Nicolau and D. Padua, “Automatic Program Parallelization,” <i>Proc. IEEE,</i> vol. 81, no. 2, 1993.]]
[4]
W. Blume and W. Eigenmann, “Nonlinear and Symbolic Data Dependence Testing,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 9, no. 12, pp. 1180-1194, Dec. 1998.]]
[5]
L. Carter J. Feo and A. Snavely, “Performance and Programming Experience on the Tera MTA,” <i>Proc. SIAM Conf. Parallel Processing for Scientific Computing,</i> 1999.]]
[6]
D.-K. Chen and P.-C. Yew, “An Empirical Study on Doacross Loops,” <i>Proc. Supercomputing,</i> pp. 620-632, 1991.]]
[7]
D.-K. Chen and P.-C. Yew, “On Effective Execution of Nonuniform DOACROSS Loops,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 7,no. 5 pp. 463-476, May 1996.]]
[8]
D.-K. Chen and P.-C. Yew, “Redundant Synchronization Elimination for DOACROSS Loops,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 10, no. 5, pp. 459-470, May 1999.]]
[9]
D.-K. Chen P.-C. Yew and J. Torrellas, “An Efficient Algorithm for the Run-Time Parallelization of Doacross Loops,” <i>Proc. Supercomputing,</i> pp. 518-527, Nov. 1994.]]
[10]
“Overview of Cray MTA Multithreaded Architecture System,” Cray Research Inc., http://tera.com/products/systems/craymta, 2000.]]
[11]
D. Culler J.P. Singh and A. Gupta, <i>Parallel Computer Architecture: A Hardware/Software Approach.</i> Morgan Kaufmann, 1998.]]
[12]
R. Cytron, “DOACROSS: Beyond Vectorization for Multiiprocessors,” <i>Proc. Int'l Conf. Parallel Processing,</i> pp. 836-844, 1986.]]
[13]
R. Eigenmann J. Hoeflinger and D. Padua, “On the Automatic Parallelization of the Perfect Benchmarks,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 9, no. 1, pp. 5-23, Jan. 1998.]]
[14]
M. Gupta and R. Nim, “Techniques for Speculative Runtime Parallelization of Loops,” <i>Proc. Supercomputing (SC98),</i> 1998.]]
[15]
J. Hoeflinger, “Run-Time Dependence Testing by Integer Sequence Analysis,” technical report, Center for Supercomputing Research and Development, Univ. of Illinois, Urbana-Champaign, Jan. 1992.]]
[16]
A. Hurson K. Kavi and J. Lim, “Cyclic Staggered Scheme: A Loop Allocation Policy for DOACROSS Loops,” <i>IEEE Trans. Computers,</i> vol. 47, no. 2, pp. 251-255, Feb. 1998.]]
[17]
A. Hurson J. Lim K. Kavi and B. Lee, “Parallelization of DOALL and DOACROSS Loops: A Survey,” <i>Advances in Computers,</i> vol. 45, 1997.]]
[18]
Y.-S. Hwang R. Das J. Saltz B. Brooks and M. Hodoscek, “Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the CHAOS Runtime Support Library,” <i>IEEE Computational Science and Eng.,</i> vol. 2, no. 2, pp. 18-29, 1995.]]
[19]
J. Ju and V. Chaudhary, “Unique Sets Oriented Partitioning of Nested Loops with Nonuniform Dependences,” <i>Computer J.,</i> vol. 40, no. 6, pp. 322-339, 1997.]]
[20]
V. Krishnan and J. Torrellas, “A Chip-Multiprocessor Architecture with Speculative Multithreading,” <i>IEEE Trans. Computers,</i> vol. 48, no. 9, pp. 866-880, Sept. 1999.]]
[21]
V. Kumar A. Grama A. Gupta and G. Karypis, <i>Introduction to Parallel Computing.</i> Benjamin/Cummings, 1994.]]
[22]
S.T. Leung and J. Zahorjan, “Improving the Performance of Runtime Parallelization,” <i>Proc. Fourth ACM SIGPLAN Symp. Principles and Practice of Parallel Programming,</i> pp. 83-91, May 1993.]]
[23]
S.T. Leung and J. Zahorjan, “Extending the Applicability and Improving the Peformance of Runtime Parallelization,” technical report, Dept. of Computer Science, Univ. of Washington, 1995.]]
[24]
J. Lim A. Hurson K. Kavi and B. Lee, “A Loop Allocation Policy for DOACROSS Loops,” <i>Proc. Symp. Parallel and Distributed Processing,</i> pp. 240-249, 1996.]]
[25]
S. Midkiff and D. Padua, “Compiler Algorithms for Synchronization,” <i>IEEE Trans. Computers,</i> vol. 36, no. 12, Dec. 1987.]]
[26]
“Harwell-Boeing Collection of Sparse Matrices,” Nat'l Inst. of Standards and Technology, http://math.nist.gov/matrixMarket/data/Harwell-Boeing, 2000.]]
[27]
J. Oplinger D. Heine S.-W. Liao B. Nayfeh M. Lam and K. Olukotun, “Software and Hardware for Exploiting Speculative Parallelism with a Multiprocessor,” Technical Report CSL-TR-97-715, Computer Science Lab., Stanford Univ., Feb. 1997.]]
[28]
D. Patterson and J. Hennessy, <i>Computer Architecture: A Quantiative Approach,</i> second ed., Morgan Kaufmann, 1996.]]
[29]
L. Rauchwerger, “Run-Time Parallelization: Its Time Has Come,” <i>Parallel Computing,</i> vol. 24, nos. 3-4, pp. 527-556, 1998.]]
[30]
L. Rauchwerger N. Amato and D. Padua, “Run-Time Methods for Parallelizing Partially Parallel Loops,” <i>Int'l J. Parallel Programming,</i> vol. 23, no. 6, pp. 537-576, 1995.]]
[31]
L. Rauchwerger and D. Padua, “The LRPD Test: Speculative Runtime Parallelization of Loops with Privatization and Reduction Parallelization,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 10, no. 2, pp. 160-180, Feb. 1999.]]
[32]
J. Saltz R. Mirchandaney and K. Crowley, “Run-Time Parallelization and Scheduling of Loops,” <i>IEEE Trans. Computers,</i> vol. 40, no. 5, May 1991.]]
[33]
Z. Shen Z. Li and P.C. Yew, “An Empirical Study on Array Subscripts and Data Dependencies,” <i>Proc. Int'l Conf. Parallel Processing,</i> vol. II, pp. 145-152, 1989.]]
[34]
J.-Y. Tsai J. Huang C. Amlo D. Lilja and P.-C. Yew, “The Superthreaded Processor Architecture,” <i>IEEE Trans. Computers,</i> vol. 48, no. 9, pp. 881-902, Sept. 1999.]]
[35]
T.H. Tzen and L.M. Ni, “Dependence Uniformization: A Loop Parallelization Technique,” <i>IEEE Trans. Parallel and Distributed Systems,</i> vol. 4,no. 5 pp. 547-558, May 1993.]]
[36]
C. Xu, “Effects of Parallelism Degree on Runtime Parallelization of Loops,” <i>Proc. 31st Hawaii Int'l Conf. Systems Science,</i> pp. 86-95, 1998.]]
[37]
C. Xu and F. Lau, <i>Load Balancing in Parallel Computers: Theory and Practice.</i> Kluwer Academic, 1997.]]
[38]
Y. Yan X. Zhang and Z. Zhang, “A Memory-Layout Oriented Runtime Technique for Locality Optimization,” <i>Proc. Int'l Conf. Parallel Processing,</i> 1998.]]
[39]
H. Yu and L. Rauchwerger, “Techniques for Reducing the Overhead of Runtime Parallelization,” <i>Proc. 9th Int'l Conf. Compiler Construction,</i> Mar. 2000.]]
[40]
C. Zhu and P.C. Yew, “A Scheme to Enforce Data Dependence on Large Multi-Processor Systems,” <i>IEEE Trans. Software Eng.,</i> vol. 13, no. 6, pp. 726-739, June 1987.]]

Cited By

View all
  • (2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
  • (2018)Unconventional Parallelization of Nondeterministic ApplicationsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173181(432-447)Online publication date: 19-Mar-2018
  • (2016)The UTFLAThe Journal of Supercomputing10.1007/s11227-016-1725-872:6(2221-2234)Online publication date: 1-Jun-2016
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 12, Issue 5
May 2001
95 pages
ISSN:1045-9219
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 May 2001

Author Tags

  1. Compiler
  2. doacross loop
  3. dynamic dependence.
  4. inspector-executor
  5. parallelizing compiler
  6. runtime support

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Unconventional Parallelization of Nondeterministic ApplicationsACM SIGPLAN Notices10.1145/3296957.317318153:2(432-447)Online publication date: 19-Mar-2018
  • (2018)Unconventional Parallelization of Nondeterministic ApplicationsProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173181(432-447)Online publication date: 19-Mar-2018
  • (2016)The UTFLAThe Journal of Supercomputing10.1007/s11227-016-1725-872:6(2221-2234)Online publication date: 1-Jun-2016
  • (2015)HELIX-UPProceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization10.5555/2738600.2738630(235-245)Online publication date: 7-Feb-2015
  • (2013)GPU-TLSProceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2013.34(120-127)Online publication date: 13-May-2013
  • (2012)HELIXProceedings of the Tenth International Symposium on Code Generation and Optimization10.1145/2259016.2259028(84-93)Online publication date: 31-Mar-2012
  • (2004)An inspector-executor algorithm for irregular assignment parallelizationProceedings of the Second international conference on Parallel and Distributed Processing and Applications10.1007/978-3-540-30566-8_4(4-15)Online publication date: 13-Dec-2004
  • (2003)A GSA-based compiler infrastructure to extract parallelism from complex loopsProceedings of the 17th annual international conference on Supercomputing10.1145/782814.782842(193-204)Online publication date: 23-Jun-2003
  • (2002)Improving Locality in the Parallelization of Doacross Loops (Research Note)Proceedings of the 8th International Euro-Par Conference on Parallel Processing10.5555/646667.700009(275-279)Online publication date: 27-Aug-2002

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media