More Web Proxy on the site http://driver.im/

article

A compiler technique for improving whole-program locality

Author:

Mahmut Taylan KandemirAuthors Info & Claims

ACM SIGPLAN Notices, Volume 36, Issue 3

Pages 179 - 192

https://doi.org/10.1145/373243.360219

Published: 01 January 2001 Publication History

Abstract

Exploiting spatial and temporal locality is essential for obtaining high performance on modern computers. Writing programs that exhibit high locality of reference is difficult and error-prone. Compiler researchers have developed loop transformations that allow the conversion of programs to exploit locality. Recently, transformations that change the memory layouts of multi-dimensional arrays---called data transformations---have been proposed. Unfortunately, both data and loop transformations have some important draw-backs. In this work, we present an integrated framework that uses loop and data transformations in concert to exploit the benefits of both approaches while minimizing the impact of their disadvantages. Our approach works inter-procedurally on acyclic call graphs, uses profile data to eliminate layout conflicts, and is unique in its capability of resolving conflicting layout requirements of different references to the same array in the same nest and in different nests for regular array-based applications.The optimization technique presented in this paper has been implemented in a source-to-source translator. We evaluate its performance using standard benchmark suites and several math libraries (complete programs) with large input sizes. Experimental results show that our approach reduces the overall execution times of original codes by 17.5% on the average. This reduction comes from three important characteristics of the technique, namely, resolving layout conflicts between references to the same array in a loop nest, determining a suitable order to propagate layout modifications across loop nests, and propagating layouts between different procedures in the program --- all in a unified framework.

References

[1]

S. P. Amarasinghe, J. M. Anderson, M. S. Lain, and C. W. Tseng The SUIF compiler for scalable parallel machines. In Proc. the Seventh SIAM Conference on Parallel Processing for Scientific Computing, February, 1995.]]

[2]

J. Anderson. Automatic Computation and Data Decomposition for Multiprocessors. Ph.D. dissertation, Stanford University, CA, March 1997. Also available as Technical Report CSL-TR-97-179, Computer Systems Laboratory, Stanford University.]]

Digital Library

[3]

J. Anderson, S. Amarasinghe, and M. Lain. Data and computation transformations for multiprocessors. In Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, July 1995.]]

Digital Library

[4]

A. Bik, P. Knijnenburg, and H. Wijshoff. Reshaping access patterns for generating sparse codes. In Proc. the 7th International Workshop on Languages and Compilers for Parallel Computing, pages 406-422, 1994.]]

Digital Library

[5]

A. Bik and H. Wijshoff. On a completion method for unimodular matrices. Technical Report 94-14, Dept. of Computer Science, Leiden University, 1994.]]

[6]

R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. Data-distribution support on distributed-shared memory multi-processors. In Proc. Programming Language Design and Implementation, Las Vegas, NV, 1997.]]

Digital Library

[7]

S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng. Optimal evaluation of array expressions on massively parallel machines. A CM Transactions on Programming Languages and Systems, 17(1):123-156, January 1995.]]

Digital Library

[8]

M. Cierniak and W. Li. Unifying data and control transformations for distributed shared memory machines. In Proe. SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995.]]

Digital Library

[9]

M. Cierniak and W. Li. Inter-procedural array re-mapping. In Proc. the International Conference on Parallel Architectures and Compilation Techniques, Nov 1997.]]

Digital Library

[10]

S. Coleman and K. McKinley. Tile size selection using cache organization and data layout. In Proc. SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995.]]

Digital Library

[11]

K. Cooper, M. W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2), April 1993.]]

[12]

C. Ding and K. Kennedy. Inter-array data regrouping. In Proc. Languages and Compilers for Parallel Computing, San Diego, CA, August 4-6, 1999.]]

Digital Library

[13]

J. Garcia, E. Ayguade, and J. Labarta. A novel approach towards automatic data distribution. In Proc. Supercomputing'95, San Diego, December 1995.]]

Digital Library

[14]

J. Garcia, E. Ayguade, and J. Labarta. Dynamic data distribution with control flow analysis. In Proc. Supercomputing'96, Pittsburgh, November 1996.]]

Digital Library

[15]

M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179-193, March 1992.]]

Digital Library

[16]

S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proc. 8th International Symposium on Architectural Support for Programming Languages and Operating Systems, October, 1998.]]

Digital Library

[17]

M. W. Hall, S. P. Amarasinghe, B. R. Murphy, S. Liao, and M. S. Lain. Detecting coarse-grain parallelism using an inter-procedural parallelizing compiler. In Proceedings of Supercomputin9 '95, December 1995.]]

Digital Library

[18]

M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and E. Ayguade. An integer linear programming approach for optimizing cache locality. In Proc. 1999 ACM International Conference on Supercomputing, Rhodes, Greece, June 1999, pp. 500-509.]]

Digital Library

[19]

M. Kandemir, A. Choudhary, J. Ramanujam, and M. Kandaswamy. Locality optimization algorithms for compilation of out-of-core codes. Journal of Information Science and Engineering, 14(1):107-138, March 1998.]]

[20]

M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. A matrix-based approach to the global locality optimization problem. In Proc. International Conference on Parallel Architectures and Compilation Techniques, October 1998, Paris, France.]]

Digital Library

[21]

M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In Proc. International Symposium on Microarchitecture, Dallas, TX, December, 1998.]]

Digital Library

[22]

M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. A framework for inter-procedural locality optimization. In Proe. International Conference on Parallel Processing, University of Aizu, Japan, Sept 1999.]]

Digital Library

[23]

M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A hyperplane based approach for optimizing spatial locality in loop nests. In Proc. 1998 A CM International Conference on Supercomputing, July 1998.]]

Digital Library

[24]

M. Kandemir, J. Ramanujam, and A. Choudhary. A compiler algorithm for optimizing locality in loop nests. In Proc. 11th ACM International Conference on Supercomputing, pages 269-276, Vienna, Austria, July 1997.]]

Digital Library

[25]

W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and David Wonnacott. The Omega Library interface guide. Technical Report CS-TR-3445, CS Dept., University of Maryland, College Park, March 1995.]]

Digital Library

[26]

K. Kennedy and U. Kremer. Automatic data layout for High Performance Fortran. In Proe. Supercomputing'95, San Diego, CA, December 1995.]]

Digital Library

[27]

K. Kennedy and K. McKinley. Optimizing for parallelism and data locality. In Proc. the 1992 ACM International Conference on Supercomputing, ACM, New York.]]

Digital Library

[28]

I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1997.]]

Digital Library

[29]

M. Lam, E. Rothberg, and M. Wolf. The cache performance of blocked algorithms. In Proc. 4th Int. Conf. Architectural Support for Programming Languages and Operating Systems, April 1991.]]

Digital Library

[30]

S.-T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report TR 95-09-01, Dept. of Computer Science and Engineering, University of Washington, September 1995.]]

[31]

W. Li. Compiling for NUMA Parallel Machines. Ph.D. Thesis, Computer Science Department, Cornell University, Ithaca, NY, 1993.]]

Digital Library

[32]

J. Li and M. Chen. Compiling communication efficient programs for massively parallel machines. Journal of Parallel and Distributed Computing, 2(3):361-376, 1991.]]

Digital Library

[33]

V. Maslov. De-linearization: An efficient way to break multi-loop dependence equations. In Proc. the SIGPLAN'92 Conference on Programming Language Design and Implementation, San Francisco, CA, June 1992.]]

Digital Library

[34]

K. McKinley, S. Carr, and C.W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 1996.]]

Digital Library

[35]

E. Minieka. Optimization Algorithms for Networks and Graphs, Marcel Dekker, Inc., 1978.]]

[36]

S. S. Muchnick. Advanced Compiler Design Implementation. Morgan Kaufmann Publishers, San Francisco, California, 1997.]]

Digital Library

[37]

NWChem: a computational chemistry package for parallel computers, version 1.1, 1995. High Performance Computational Chemistry Group, Pacific Northwest Laboratory.]]

[38]

M. O'Boyle and P. Knijnenburg. Non-singular data transformations: Definition, validity, applications. In Proc. 6th Workshop on Compilers for Parallel Computers, pages 287-297, Aachen, Germany, 1996.]]

[39]

M. O'Boyle and P. Knijnenburg. Integrating loop and data transformations for global optimisation. In Proc. International Conference on Parallel Architectures and Compilation Techniques, October 1998, Paris, France.]]

Digital Library

[40]

D. Palermo and P. Banerjee. Automatic selection of dynamic data partitioning schemes for distributed-memory multicomputers. In Proc. 8th Workshop on Languages and Compilers for Parallel Computing, Columbus, OH, pages 392-406, 1995.]]

Digital Library

[41]

J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. In IEEE Transactions on Parallel and Distributed Systems, 2(4):472-482, October 1991.]]

Digital Library

[42]

G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proc. the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.]]

Digital Library

[43]

S. Tandri and T. Abdelrahman. Automatic partitioning of data and computations on scalable shared memory multiprocessors. In Proc. 1997 International Conference on Parallel Processing, Bloomingdale, IL, pages 64-73, August 1997.]]

Digital Library

[44]

M. Wolf and M. Lain. A data locality optimizing algorithm. In Proc. ACM SIGPLAN 91 Conf. Programming Language Design and Implementation, pages 30-44, June 1991.]]

Digital Library

[45]

M. Wolf, D. Maydan, and D. Chen. Combining loop transformations considering caches and scheduling. In Proe. International Symposium on Mieroarehitecture, pages 274-286, Paris, France, December 1996.]]

Digital Library

[46]

M. Wolfe. High Performance Compilers for Parallel Computing, Addison-Wesley Publishing Company, 1996.]]

Digital Library

Cited By

Kritikakou ACatthoor FKelefouras VGoutis C(2016)Array Size Computation under Uniform Overlapping and Irregular AccessesACM Transactions on Design Automation of Electronic Systems10.1145/281864321:2(1-35)Online publication date: 28-Jan-2016
https://dl.acm.org/doi/10.1145/2818643
Kritikakou ACatthoor FKelefouras VGoutis C(2014)A scalable and near-optimal representation of access schemes for memory managementACM Transactions on Architecture and Code Optimization10.1145/257967711:1(1-25)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1145/2579677
Kritikakou ACatthoor FGoutis CKritikakou ACatthoor FGoutis C(2014)Development of Intra-signal In-Place MethodologyScalable and Near-Optimal Design Space Exploration for Embedded Systems10.1007/978-3-319-04942-7_3(37-62)Online publication date: 21-Feb-2014
https://doi.org/10.1007/978-3-319-04942-7_3
Show More Cited By

Index Terms

A compiler technique for improving whole-program locality
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

A compiler technique for improving whole-program locality
POPL '01: Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages

Exploiting spatial and temporal locality is essential for obtaining high performance on modern computers. Writing programs that exhibit high locality of reference is difficult and error-prone. Compiler researchers have developed loop transformations that ...
Improving whole-program locality using intra-procedural and inter-procedural transformations

Exploiting spatial and temporal locality is essential for obtaining high performance on modern computers. Writing programs that exhibit high locality of reference is difficult and error-prone. Compiler researchers have developed loop transformations that ...
A Layout-Conscious Iteration Space Transformation Technique

Exploiting locality of references has become extremely important in realizing the potential performance of modern machines with deep memory hierarchies. The data access patterns of programs and the memory layouts of the accessed data sets play a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 36, Issue 3

March 2001

303 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/373243

Editors:
Cindy Norris
Appalachian State Univ., Boone, NC
,
James B. Fenwick
Appalachian State Univ., Boone, NC

Issue’s Table of Contents

POPL '01: Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
January 2001
304 pages
ISBN:1581133367
DOI:10.1145/360204
Chairmen:
Chris Hankin
Imperial College, London, U. K.
,
Dave Schmidt
Kansas State Univ., Manhattan

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2001

Published in SIGPLAN Volume 36, Issue 3

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
651
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kritikakou ACatthoor FKelefouras VGoutis C(2016)Array Size Computation under Uniform Overlapping and Irregular AccessesACM Transactions on Design Automation of Electronic Systems10.1145/281864321:2(1-35)Online publication date: 28-Jan-2016
https://dl.acm.org/doi/10.1145/2818643
Kritikakou ACatthoor FKelefouras VGoutis C(2014)A scalable and near-optimal representation of access schemes for memory managementACM Transactions on Architecture and Code Optimization10.1145/257967711:1(1-25)Online publication date: 1-Feb-2014
https://dl.acm.org/doi/10.1145/2579677
Kritikakou ACatthoor FGoutis CKritikakou ACatthoor FGoutis C(2014)Development of Intra-signal In-Place MethodologyScalable and Near-Optimal Design Space Exploration for Embedded Systems10.1007/978-3-319-04942-7_3(37-62)Online publication date: 21-Feb-2014
https://doi.org/10.1007/978-3-319-04942-7_3
Kritikakou ACatthoor FKelefouras VGoutis C(2013)Near-optimal and scalable intrasignal in-place optimization for non-overlapping and irregular access schemesACM Transactions on Design Automation of Electronic Systems10.1145/253438319:1(1-30)Online publication date: 20-Dec-2013
https://dl.acm.org/doi/10.1145/2534383
Kandemir MZhao HTang XKarakoy M(2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274586743:1(137-149)Online publication date: 15-Jun-2015
https://dl.acm.org/doi/10.1145/2796314.2745867
Kandemir MZhao HTang XKarakoy MLin BXu JSengupta SShah D(2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceProceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/2745844.2745867(137-149)Online publication date: 15-Jun-2015
https://dl.acm.org/doi/10.1145/2745844.2745867
Wu BZhao ZZhang EJiang YShen X(2013)Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPUACM SIGPLAN Notices10.1145/2517327.244252348:8(57-68)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1145/2517327.2442523
Wu BZhao ZZhang EJiang YShen XNicolau AShen XAmarasinghe SVuduc R(2013)Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPUProceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2442516.2442523(57-68)Online publication date: 23-Feb-2013
https://dl.acm.org/doi/10.1145/2442516.2442523
Ding WKandemir MHu A(2012)Improving last level cache locality by integrating loop and data transformationsProceedings of the International Conference on Computer-Aided Design10.1145/2429384.2429398(65-72)Online publication date: 5-Nov-2012
https://dl.acm.org/doi/10.1145/2429384.2429398
Zhang YKandemir MYemliha T(2011)Studying inter-core data reuse in multicoresACM SIGMETRICS Performance Evaluation Review10.1145/2007116.200712039:1(25-36)Online publication date: 7-Jun-2011
https://dl.acm.org/doi/10.1145/2007116.2007120
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents