[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Open access

Improving data locality with loop transformations

Published: 01 July 1996 Publication History

Abstract

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In the this article, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments illustrate that for kernels our model and algorithm can select and achieve the best loop structure for a nest. For over 30 complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve bacause benchmark programs typically have high hit rates even for small data caches; however, our optimizations significanty improved several programs.

References

[1]
ABu-SUFAH, W. 1979. Improving the performance of virtual memory computers. Ph.D. thesis, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign.
[2]
ALLEN, J. R. AND KENNEDY, K. 1984. Automatic loop interchange. In Proceedings of the SIC- PLAN '8~ Symposium on Compiler Construction. ACM, New York.
[3]
ALLEN, J. R. AND KENNEDY, K. 1987. Automatic translation of Fortran programs to vector form. A CM Trans. Program. Lang. Syst. 9, 4 (Oct.), 491-542.
[4]
BANERJEE, U. 1990. A theory of loop permutations. In Languages and Compilers for Parallel Computing, D. Gelernter, A. Nicolau, and D. Padua, Eds. The MIT Press, Cambridge, Mass., 54-74.
[5]
CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. Improving register allocation for subscripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation. ACM, New York.
[6]
CALLAHAN, D., COCKE, J., AND KENNEDY, K. 1988. Estimating interlock and improving balance for pipelined machines. J. Parall. Distrib. Comput. 5, 4 (Aug.), 334-358.
[7]
CARR, S. 1992. Memory-hierarchy management. Ph.D. thesis, Dept. of Computer Science, Rice Univ., Houston, Tex.
[8]
CARR, S. AND KENNEDY, K. 1994a. Improving the ratio of memory operations to floating-point operations in loops. ACM Trans. Program. Lang. Syst. 16, 6 (Nov.), 1769-1810.
[9]
CARR, S. AND KENNEDY, K. 1994b. Scalar replacement in the presence of conditional control flow. Softw. Prac. Exper. 2g, 1 (Jan.), 51-77.
[10]
CARR, S. AND WU, Q. 1995. An analysis of loop permutation on the HP PA-RISC. Tech. Rep. TR95-03, Michigan Technological Univ., Houghton, Mich. Feb.
[11]
COLEMAN, S. AND MCKINLEY, K. S. 1995. Tile size selection using cache organization and data layout. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation. ACM, New York.
[12]
COOPER, K., HALL, M. W., HOOD, R. T., KENNEDY, K., MCKINLEY, K. S., MELLOR-CRUMMEY, J. M., TORCZON, L., AND WARREN, S. K. 1993. The ParaScope parallel programming environmerit. Proc. IEEE 81, 2 (Feb.), 244-263.
[13]
COOPER, K., HALL, M. W., AND KENNEDY, K. 1993. A methodology for procedure cloning. Comput. Lang. 19, 2 (Feb.), 105-117.
[14]
FERRANTE, J., SARKAR, V., AND THRASH, W. 1991. On estimating and enhancing cache effectiveness. In Languages and Compilers for Parallel Computing, Jth International Workshop, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds. Springer-Verlag, Berlin, 328-343.
[15]
CJANNON, D., JALBY, W., AND CJALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parall. Distrib. Comput. 5, 5 (Oct.), 587-616.
[16]
CJOFF, CJ., KENNEDY, K., AND TSENG, C.-W. 1991. Practical dependence testing. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York.
[17]
HALL, M. W., KENNEDY, K., AND MCKINLEY, K. S. 1991. Interprocedural transformations for parallel code generation. In Proceedings of Supercomputing '91. IEEE, New York.
[18]
IRIGOIN, F. AND TRIOLET, R. 1988. Supernode partitioning. In Proceedings of the 15th Annual A CM Symposium on the Principles of Programming Languages. ACM, New York.
[19]
KENNEDY, K. AND MCKINLEY, K. S. 1992. Optimizing for parallelism and data locality. In Proceedings of the 1992 ACM International Conference on Supercomputing. ACM, New York.
[20]
KENNEDY, K. AND MCKINLEY, K. S. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Languages and Compilers for Parallel Computing, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua, Eds. Springer-Verlag, Berlin, 30
[21]
KENNEDY, K., MCKINLEY, K. S., AND TSENG, C.-W. 1993. Analysis and transformation in an interactive parallel programming tool. Concurrency Pract. Exper. 5, 7 (Oct.), 575-602.
[22]
KUCK, D., KUHN, R., PADUA, D., LEASURE, B., AND WOLFE, M. J. 1981. Dependence graphs and compiler optimizations. In Conference Record of the 8th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York.
[23]
LAM, M., ROTHBERG, E., AND WOLF, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the Jth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York.
[24]
LI, W. AND PINGALI, K. 1992. Access normalization: Loop restructuring for NUMA compilers. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York.
[25]
MCKINLEY, K. S. 1992. Automatic and interactive parallelization. Ph.D. thesis, Dept. of Computer Science, Rice Univ., Houston, Tex.
[26]
WARREN, J. 1984. A hierachical basis for reordering transformations. In Conference Record of the 11th Annual ACM Symposium on the Principles of Programming Languages. ACM, New York.
[27]
WOLF, M. E. 1992. Improving locality and parallelism in nested loops. Ph.D. thesis, Dept. of Computer Science, Stanford Univ., Stanford, Calif.
[28]
WOLF, M. E. AND LAM, M. 1991. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation. ACM, New York.
[29]
WOLFE, M. J. 1986. Advanced loop interchanging. In Proceedings of the 1986 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.
[30]
WOLFE, M. J. 1987. Iteration space tiling for memory hierarchies. In Proceedings of the 3rd SIAM Conference on Parallel Processing. SIAM, Philadelphia, Pa.
[31]
WOLFE, M. J. 1991. The Tiny loop restructuring research tool. In Proceedings of the 1991 International Conference on Parallel Processing. CRC Press, Boca Raton, Fla.

Cited By

View all
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
  • (2024)A massive MPI parallel framework of smoothed particle hydrodynamics with optimized memory management for extreme mechanics problemsComputer Physics Communications10.1016/j.cpc.2023.108970295(108970)Online publication date: Feb-2024
  • (2024)LSIE: a fast and secure Latin square-based image encryption schemeMultimedia Tools and Applications10.1007/s11042-023-14786-383:3(7939-7979)Online publication date: 1-Jan-2024
  • Show More Cited By

Index Terms

  1. Improving data locality with loop transformations

    Recommendations

    Reviews

    Noah S. Prywes

    When using a processor with a cache memory of limited size, the order of memory referencing in the program can affect the overall computation time. It would be too difficult for a programmer to consider this effect in composing a program. Instead, the authors propose incorporating in the compiler the program optimization needed to accelerate program execution for the cache memory of the processor where the program will be executed. The optimization consists of reordering the loops in the program. The paper presents cost models to evaluate the impact of reordering the loops. These are presented for four schemes for reordering loops—loop permutation, fusion, distribution, and reversal. A cost model for combining such schemes is also presented. Finally, results of experimental work are provided to assess the effectiveness of using the loop reordering cost models. Experimental data are presented for a number of programs and for respective cache memories and processors. The cache memories and processors are 64 KB cache direct-mapped, 32-byte cache line for the HP 715/50; 64 KB cache, four-way, 128-byte cache line for the RS/6000; and 8 KB cache, two-way, 32-byte cache line for the i 860. The programs were executed on the RS/6000 and HP 715/50 as well as simulated for different cache memories of the RS/6000 and i 860. The experimental result validates the cost model. The improvement in execution is significant for only some of the programs studied, and particularly for the use of smaller cache memory. The paper reports on important concepts and experiments, not only for use in compiler construction (as suggested), but also for the design of cache memories. Its readers should already be familiar with the concepts of data dependence and temporal reuse.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Programming Languages and Systems
    ACM Transactions on Programming Languages and Systems  Volume 18, Issue 4
    July 1996
    164 pages
    ISSN:0164-0925
    EISSN:1558-4593
    DOI:10.1145/233561
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 July 1996
    Published in TOPLAS Volume 18, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Cache
    2. compiler optimization
    3. data locality
    4. loop distribution
    5. loop fusion
    6. loop permutation
    7. loop reversal
    8. loop transformations
    9. microprocessors
    10. simulation

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)272
    • Downloads (Last 6 weeks)66
    Reflects downloads up to 07 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
    • (2024)A massive MPI parallel framework of smoothed particle hydrodynamics with optimized memory management for extreme mechanics problemsComputer Physics Communications10.1016/j.cpc.2023.108970295(108970)Online publication date: Feb-2024
    • (2024)LSIE: a fast and secure Latin square-based image encryption schemeMultimedia Tools and Applications10.1007/s11042-023-14786-383:3(7939-7979)Online publication date: 1-Jan-2024
    • (2024)Enhancing a GPU-Based Wave Propagation Application Through Loop Tiling and Loop Fission OptimizationsHigh Performance Computing10.1007/978-3-031-52186-7_2(21-35)Online publication date: 28-Jan-2024
    • (2023)(De/Re)-Compositions Expressed Systematically via MDH-Based SchedulesProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580269(61-72)Online publication date: 17-Feb-2023
    • (2023)Sunstone: A Scalable and Versatile Scheduler for Mapping Tensor Algebra on Spatial Accelerators2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00033(259-271)Online publication date: Apr-2023
    • (2023)Multidimensional query processing algorithm by dimension transformationScientific Reports10.1038/s41598-023-31758-713:1Online publication date: 11-Apr-2023
    • (2023)BibliographyEngineering a Compiler10.1016/B978-0-12-815412-0.00023-1(793-813)Online publication date: 2023
    • (2023)Scalar OptimizationEngineering a Compiler10.1016/B978-0-12-815412-0.00016-4(517-573)Online publication date: 2023
    • (2022)Parallelizing Neural Network Models Effectively on GPU by Implementing Reductions AtomicallyProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569656(451-466)Online publication date: 8-Oct-2022
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media