Maximizing loop parallelism and improving data locality via loop fusion and distribution

Ken Kennedy¹ &
Kathryn S. McKinley²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 768))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

338 Accesses
46 Citations

Abstract

Loop fusion is a program transformation that merges multiple loops into one. It is effective for reducing the synchronization overhead of parallel loops and for improving data locality. This paper presents three results for fusion: (1) a new algorithm for fusing a collection of parallel and sequential loops, minimizing parallel loop synchronization while maximizing parallelism; (2) a proof that performing fusion to maximize data locality is NP-hard; and (3) two polynomial-time algorithms for improving data locality. These techniques also apply to loop distribution, which is shown to be essentially equivalent to loop fusion. Our approach is general enough to support other fusion heuristics. Preliminary experimental results validate our approach for improving performance by exploiting data locality and increasing the granularity of parallelism.

This research was supported by the Center for Research on Parallel Computation, a NSF Science and Technology Center. Use of the Sequent Symmetry S81 was provided under NSF Cooperative Agreement No. CDA-8619393.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Practical and Aggressive Loop Fission Technique

Distributing and Parallelizing Non-canonical Loops

Semi-automatic Composition of Data Layout Transformations for Loop Vectorization

References

W. Abu-Sufah. Improving the Performance of Virtual Memory Computers. PhD thesis, Dept. of Computer Science, University of Illinois at Urbana-Champaign, 1979.
Google Scholar
F. Allen and J. Cocke. A catalogue of optimizing transformations. In J. Rustin, editor, Design and Optimization of Compilers. Prentice-Hall, 1972.
Google Scholar
J. R. Allen, D. Callahan, and K. Kennedy. Automatic decomposition of scientific programs for parallel execution. In Proceedings of the Fourteenth Annual ACM Symposium on the Principles of Programming Languages, Munich, Germany, Jan. 1987.
Google Scholar
J. R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Transactions on Programming Languages and Systems, 9(4):491–542, Oct. 1987.
Article Google Scholar
A. J. Bernstein. Analysis of programs for parallel processing. IEEE Transactions on Electronic Computers, 15(5):757–763, Oct. 1966.
Google Scholar
D. Callahan. A Global Approach to Detection of Parallelism. PhD thesis, Dept. of Computer Science, Rice University, Mar. 1987.
Google Scholar
D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for sub-scripted variables. In Proceedings of the SIGPLAN '90 Conference on Program Language Design and Implementation, White Plains, NY, June 1990.
Google Scholar
S. Carr, K. Kennedy, K. S. McKinley, and C. Tseng. Compiler optimizations for improving data locality. Technical Report TR92-195, Dept. of Computer Science, Rice University, Nov. 1992.
Google Scholar
G. Cybenko, L. Kipp, L. Pointer, and D. Kuck. Supercomputer performance evaluation and the Perfect benchmarks. In Proceedings of the 1990 ACM International Conference on Supercomputing, Amsterdam, The Netherlands, June 1990.
Google Scholar
R. Cytron, J. Ferrante, and V. Sarkar. Experiences using control dependence in PTRAN. In D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing. The MIT Press, 1990.
Google Scholar
E. Dahlhaus, D. S. Johnson, C. H. Papadimitriou, P. D. Seymour, and M. Yannakakis. The complexity of multiway cuts. In Proceedings of the 24th Annual ACM Symposium on the Theory of Computing, May 1992.
Google Scholar
J. Ferrante, K. Ottenstein, and J. Warren. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems, 9(3):319–349, July 1987.
Article Google Scholar
Ford, Jr., L. R. and D. R. Fulkerson. Flows in Networks. Princeton University Press, Princeton, NJ, 1962.
Google Scholar
G. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing, New Haven, CT, Aug. 1992.
Google Scholar
A. Goldberg and R. Paige. Stream processing. In Conference Record of the 1984 ACM Symposium on Lisp and Functional Programming, pages 228–234, Aug. 1984.
Google Scholar
A. V. Goldberg and R. E. Tarjan. A new approach to the maximum-flow problem. Journal of the Association for Computing Machinery, 35(4):921–940, Oct. 1988.
Google Scholar
M. W. Hall, K. Kennedy, and K. S. McKinley. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91, Albuquerque, NM, Nov. 1991.
Google Scholar
K. Kennedy and K. S. McKinley. Loop distribution with arbitrary control flow. In Proceedings of Supercomputing '90, New York, NY, Nov. 1990.
Google Scholar
K. Kennedy and K. S. McKinley. Optimizing for parallelism and data locality. In Proceedings of the 199S ACM International Conference on Supercomputing, Washington, DC, July 1992.
Google Scholar
K. Kennedy and K. S. McKinley. Typed fusion with applications to parallel and sequential code generation. Technical Report TR93-208, Dept. of Computer Science, Rice University, Aug. 1993.
Google Scholar
K. Kennedy, K. S. McKinley, and C. Tseng. Analysis and transformation in an interactive parallel programming tool. Concurrency: Practice & Experience, to appear 1993.
Google Scholar
K. S. McKinley. Automatic and Interactive Parallelization. PhD thesis, Dept. of Computer Science, Rice University, Apr. 1992.
Google Scholar
A. Porterfield. Software Methods for Improvement of Cache Performance. PhD thesis, Dept. of Computer Science, Rice University, May 1989.
Google Scholar
V. Sarkar and G. Gao. Optimization of array accesses by collective loop transformations. In Proceedings of the 1991 ACM International Conference on Supercomputing, Cologne, Germany, June 1991.
Google Scholar
J. Warren. A hierachical basis for reordering transformations. In Conference Record of the Eleventh Annual ACM Symposium on the Principles of Programming Languages, Salt Lake City, UT, Jan. 1984.
Google Scholar
M. Yannakakis, P. C. Kanellakis, S. C. Cosmadakis, and C. H. Papadimitriou. Cutting and partitioning a graph after a fixed pattern. Automata, Languages, and Programming — Lecture Notes in Computer Science, 154:712–722, 1983.
Google Scholar

Download references

Author information

Authors and Affiliations

Rice University, 77251-1892, Houston, TX
Ken Kennedy
University of Massachusetts, 01003-4610, Amherst, MA
Kathryn S. McKinley

Authors

Ken Kennedy
View author publications
You can also search for this author in PubMed Google Scholar
Kathryn S. McKinley
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Utpal Banerjee David Gelernter Alex Nicolau David Padua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kennedy, K., McKinley, K.S. (1994). Maximizing loop parallelism and improving data locality via loop fusion and distribution. In: Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds) Languages and Compilers for Parallel Computing. LCPC 1993. Lecture Notes in Computer Science, vol 768. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-57659-2_18

Download citation

DOI: https://doi.org/10.1007/3-540-57659-2_18
Published: 31 May 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-57659-4
Online ISBN: 978-3-540-48308-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Maximizing loop parallelism and improving data locality via loop fusion and distribution

Abstract

Access this chapter

Preview

Similar content being viewed by others

A Practical and Aggressive Loop Fission Technique

Distributing and Parallelizing Non-canonical Loops

Semi-automatic Composition of Data Layout Transformations for Loop Vectorization

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Maximizing loop parallelism and improving data locality via loop fusion and distribution

Abstract

Access this chapter

Preview

Similar content being viewed by others

A Practical and Aggressive Loop Fission Technique

Distributing and Parallelizing Non-canonical Loops

Semi-automatic Composition of Data Layout Transformations for Loop Vectorization

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation