[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A compiler technique for improving whole-program locality

Published: 01 January 2001 Publication History

Abstract

Exploiting spatial and temporal locality is essential for obtaining high performance on modern computers. Writing programs that exhibit high locality of reference is difficult and error-prone. Compiler researchers have developed loop transformations that allow the conversion of programs to exploit locality. Recently, transformations that change the memory layouts of multi-dimensional arrays---called data transformations---have been proposed. Unfortunately, both data and loop transformations have some important draw-backs. In this work, we present an integrated framework that uses loop and data transformations in concert to exploit the benefits of both approaches while minimizing the impact of their disadvantages. Our approach works inter-procedurally on acyclic call graphs, uses profile data to eliminate layout conflicts, and is unique in its capability of resolving conflicting layout requirements of different references to the same array in the same nest and in different nests for regular array-based applications.The optimization technique presented in this paper has been implemented in a source-to-source translator. We evaluate its performance using standard benchmark suites and several math libraries (complete programs) with large input sizes. Experimental results show that our approach reduces the overall execution times of original codes by 17.5% on the average. This reduction comes from three important characteristics of the technique, namely, resolving layout conflicts between references to the same array in a loop nest, determining a suitable order to propagate layout modifications across loop nests, and propagating layouts between different procedures in the program --- all in a unified framework.

References

[1]
S. P. Amarasinghe, J. M. Anderson, M. S. Lain, and C. W. Tseng The SUIF compiler for scalable parallel machines. In Proc. the Seventh SIAM Conference on Parallel Processing for Scientific Computing, February, 1995.]]
[2]
J. Anderson. Automatic Computation and Data Decomposition for Multiprocessors. Ph.D. dissertation, Stanford University, CA, March 1997. Also available as Technical Report CSL-TR-97-179, Computer Systems Laboratory, Stanford University.]]
[3]
J. Anderson, S. Amarasinghe, and M. Lain. Data and computation transformations for multiprocessors. In Proc. 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, July 1995.]]
[4]
A. Bik, P. Knijnenburg, and H. Wijshoff. Reshaping access patterns for generating sparse codes. In Proc. the 7th International Workshop on Languages and Compilers for Parallel Computing, pages 406-422, 1994.]]
[5]
A. Bik and H. Wijshoff. On a completion method for unimodular matrices. Technical Report 94-14, Dept. of Computer Science, Leiden University, 1994.]]
[6]
R. Chandra, D. Chen, R. Cox, D. Maydan, N. Nedeljkovic, and J. Anderson. Data-distribution support on distributed-shared memory multi-processors. In Proc. Programming Language Design and Implementation, Las Vegas, NV, 1997.]]
[7]
S. Chatterjee, J. Gilbert, R. Schreiber, and S. Teng. Optimal evaluation of array expressions on massively parallel machines. A CM Transactions on Programming Languages and Systems, 17(1):123-156, January 1995.]]
[8]
M. Cierniak and W. Li. Unifying data and control transformations for distributed shared memory machines. In Proe. SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995.]]
[9]
M. Cierniak and W. Li. Inter-procedural array re-mapping. In Proc. the International Conference on Parallel Architectures and Compilation Techniques, Nov 1997.]]
[10]
S. Coleman and K. McKinley. Tile size selection using cache organization and data layout. In Proc. SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995.]]
[11]
K. Cooper, M. W. Hall, and K. Kennedy. A methodology for procedure cloning. Computer Languages, 19(2), April 1993.]]
[12]
C. Ding and K. Kennedy. Inter-array data regrouping. In Proc. Languages and Compilers for Parallel Computing, San Diego, CA, August 4-6, 1999.]]
[13]
J. Garcia, E. Ayguade, and J. Labarta. A novel approach towards automatic data distribution. In Proc. Supercomputing'95, San Diego, December 1995.]]
[14]
J. Garcia, E. Ayguade, and J. Labarta. Dynamic data distribution with control flow analysis. In Proc. Supercomputing'96, Pittsburgh, November 1996.]]
[15]
M. Gupta and P. Banerjee. Demonstration of automatic data partitioning techniques for parallelizing compilers on multicomputers. IEEE Transactions on Parallel and Distributed Systems, 3(2):179-193, March 1992.]]
[16]
S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proc. 8th International Symposium on Architectural Support for Programming Languages and Operating Systems, October, 1998.]]
[17]
M. W. Hall, S. P. Amarasinghe, B. R. Murphy, S. Liao, and M. S. Lain. Detecting coarse-grain parallelism using an inter-procedural parallelizing compiler. In Proceedings of Supercomputin9 '95, December 1995.]]
[18]
M. Kandemir, P. Banerjee, A. Choudhary, J. Ramanujam, and E. Ayguade. An integer linear programming approach for optimizing cache locality. In Proc. 1999 ACM International Conference on Supercomputing, Rhodes, Greece, June 1999, pp. 500-509.]]
[19]
M. Kandemir, A. Choudhary, J. Ramanujam, and M. Kandaswamy. Locality optimization algorithms for compilation of out-of-core codes. Journal of Information Science and Engineering, 14(1):107-138, March 1998.]]
[20]
M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. A matrix-based approach to the global locality optimization problem. In Proc. International Conference on Parallel Architectures and Compilation Techniques, October 1998, Paris, France.]]
[21]
M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. Improving locality using loop and data transformations in an integrated framework. In Proc. International Symposium on Microarchitecture, Dallas, TX, December, 1998.]]
[22]
M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee. A framework for inter-procedural locality optimization. In Proe. International Conference on Parallel Processing, University of Aizu, Japan, Sept 1999.]]
[23]
M. Kandemir, A. Choudhary, N. Shenoy, P. Banerjee, and J. Ramanujam. A hyperplane based approach for optimizing spatial locality in loop nests. In Proc. 1998 A CM International Conference on Supercomputing, July 1998.]]
[24]
M. Kandemir, J. Ramanujam, and A. Choudhary. A compiler algorithm for optimizing locality in loop nests. In Proc. 11th ACM International Conference on Supercomputing, pages 269-276, Vienna, Austria, July 1997.]]
[25]
W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and David Wonnacott. The Omega Library interface guide. Technical Report CS-TR-3445, CS Dept., University of Maryland, College Park, March 1995.]]
[26]
K. Kennedy and U. Kremer. Automatic data layout for High Performance Fortran. In Proe. Supercomputing'95, San Diego, CA, December 1995.]]
[27]
K. Kennedy and K. McKinley. Optimizing for parallelism and data locality. In Proc. the 1992 ACM International Conference on Supercomputing, ACM, New York.]]
[28]
I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In Proc. SIGPLAN Conf. Programming Language Design and Implementation, June 1997.]]
[29]
M. Lam, E. Rothberg, and M. Wolf. The cache performance of blocked algorithms. In Proc. 4th Int. Conf. Architectural Support for Programming Languages and Operating Systems, April 1991.]]
[30]
S.-T. Leung and J. Zahorjan. Optimizing data locality by array restructuring. Technical Report TR 95-09-01, Dept. of Computer Science and Engineering, University of Washington, September 1995.]]
[31]
W. Li. Compiling for NUMA Parallel Machines. Ph.D. Thesis, Computer Science Department, Cornell University, Ithaca, NY, 1993.]]
[32]
J. Li and M. Chen. Compiling communication efficient programs for massively parallel machines. Journal of Parallel and Distributed Computing, 2(3):361-376, 1991.]]
[33]
V. Maslov. De-linearization: An efficient way to break multi-loop dependence equations. In Proc. the SIGPLAN'92 Conference on Programming Language Design and Implementation, San Francisco, CA, June 1992.]]
[34]
K. McKinley, S. Carr, and C.W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, 1996.]]
[35]
E. Minieka. Optimization Algorithms for Networks and Graphs, Marcel Dekker, Inc., 1978.]]
[36]
S. S. Muchnick. Advanced Compiler Design Implementation. Morgan Kaufmann Publishers, San Francisco, California, 1997.]]
[37]
NWChem: a computational chemistry package for parallel computers, version 1.1, 1995. High Performance Computational Chemistry Group, Pacific Northwest Laboratory.]]
[38]
M. O'Boyle and P. Knijnenburg. Non-singular data transformations: Definition, validity, applications. In Proc. 6th Workshop on Compilers for Parallel Computers, pages 287-297, Aachen, Germany, 1996.]]
[39]
M. O'Boyle and P. Knijnenburg. Integrating loop and data transformations for global optimisation. In Proc. International Conference on Parallel Architectures and Compilation Techniques, October 1998, Paris, France.]]
[40]
D. Palermo and P. Banerjee. Automatic selection of dynamic data partitioning schemes for distributed-memory multicomputers. In Proc. 8th Workshop on Languages and Compilers for Parallel Computing, Columbus, OH, pages 392-406, 1995.]]
[41]
J. Ramanujam and P. Sadayappan. Compile-time techniques for data distribution in distributed memory machines. In IEEE Transactions on Parallel and Distributed Systems, 2(4):472-482, October 1991.]]
[42]
G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proc. the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, Montreal, Canada, June 1998.]]
[43]
S. Tandri and T. Abdelrahman. Automatic partitioning of data and computations on scalable shared memory multiprocessors. In Proc. 1997 International Conference on Parallel Processing, Bloomingdale, IL, pages 64-73, August 1997.]]
[44]
M. Wolf and M. Lain. A data locality optimizing algorithm. In Proc. ACM SIGPLAN 91 Conf. Programming Language Design and Implementation, pages 30-44, June 1991.]]
[45]
M. Wolf, D. Maydan, and D. Chen. Combining loop transformations considering caches and scheduling. In Proe. International Symposium on Mieroarehitecture, pages 274-286, Paris, France, December 1996.]]
[46]
M. Wolfe. High Performance Compilers for Parallel Computing, Addison-Wesley Publishing Company, 1996.]]

Cited By

View all
  • (2016)Array Size Computation under Uniform Overlapping and Irregular AccessesACM Transactions on Design Automation of Electronic Systems10.1145/281864321:2(1-35)Online publication date: 28-Jan-2016
  • (2014)A scalable and near-optimal representation of access schemes for memory managementACM Transactions on Architecture and Code Optimization10.1145/257967711:1(1-25)Online publication date: 1-Feb-2014
  • (2014)Development of Intra-signal In-Place MethodologyScalable and Near-Optimal Design Space Exploration for Embedded Systems10.1007/978-3-319-04942-7_3(37-62)Online publication date: 21-Feb-2014
  • Show More Cited By

Index Terms

  1. A compiler technique for improving whole-program locality

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 36, Issue 3
    March 2001
    303 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/373243
    Issue’s Table of Contents
    • cover image ACM Conferences
      POPL '01: Proceedings of the 28th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
      January 2001
      304 pages
      ISBN:1581133367
      DOI:10.1145/360204
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 January 2001
    Published in SIGPLAN Volume 36, Issue 3

    Check for updates

    Author Tags

    1. cache locality
    2. data reuse
    3. memory layouts
    4. optimizing compilers
    5. static optimizations

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 06 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2016)Array Size Computation under Uniform Overlapping and Irregular AccessesACM Transactions on Design Automation of Electronic Systems10.1145/281864321:2(1-35)Online publication date: 28-Jan-2016
    • (2014)A scalable and near-optimal representation of access schemes for memory managementACM Transactions on Architecture and Code Optimization10.1145/257967711:1(1-25)Online publication date: 1-Feb-2014
    • (2014)Development of Intra-signal In-Place MethodologyScalable and Near-Optimal Design Space Exploration for Embedded Systems10.1007/978-3-319-04942-7_3(37-62)Online publication date: 21-Feb-2014
    • (2013)Near-optimal and scalable intrasignal in-place optimization for non-overlapping and irregular access schemesACM Transactions on Design Automation of Electronic Systems10.1145/253438319:1(1-30)Online publication date: 20-Dec-2013
    • (2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceACM SIGMETRICS Performance Evaluation Review10.1145/2796314.274586743:1(137-149)Online publication date: 15-Jun-2015
    • (2015)Memory Row Reuse Distance and its Role in Optimizing Application PerformanceProceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/2745844.2745867(137-149)Online publication date: 15-Jun-2015
    • (2013)Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPUACM SIGPLAN Notices10.1145/2517327.244252348:8(57-68)Online publication date: 23-Feb-2013
    • (2013)Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPUProceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/2442516.2442523(57-68)Online publication date: 23-Feb-2013
    • (2012)Improving last level cache locality by integrating loop and data transformationsProceedings of the International Conference on Computer-Aided Design10.1145/2429384.2429398(65-72)Online publication date: 5-Nov-2012
    • (2011)Studying inter-core data reuse in multicoresACM SIGMETRICS Performance Evaluation Review10.1145/2007116.200712039:1(25-36)Online publication date: 7-Jun-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media