[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
Skip header Section
Loop tiling for parallelismAugust 2000
  • Author:
  • Jingling Xue
Publisher:
  • Kluwer Academic Publishers
  • 101 Philip Drive Assinippi Park Norwell, MA
  • United States
ISBN:978-0-7923-7933-1
Published:01 August 2000
Pages:
256
Skip Bibliometrics Section
Reflects downloads up to 26 Dec 2024Bibliometrics
Abstract

No abstract available.

Cited By

  1. ACM
    Mohanty S, Bhasi V, Son M, Kandemir M and Das C FAAStloop: Optimizing Loop-Based Applications for Serverless Computing Proceedings of the 2024 ACM Symposium on Cloud Computing, (943-960)
  2. Liu S, Zhang Z and Wu W (2023). DHTS: A Dynamic Hybrid Tiling Strategy for Optimizing Stencil Computation on GPUs, IEEE Transactions on Computers, 72:10, (2795-2807), Online publication date: 1-Oct-2023.
  3. ACM
    Reber B, Gould M, Kneipp A, Liu F, Prechtl I, Ding C, Chen L and Patru D (2023). Cache Programming for Scientific Loops Using Leases, ACM Transactions on Architecture and Code Optimization, 20:3, (1-25), Online publication date: 30-Sep-2023.
  4. ACM
    Gondimalla A, Liu J, Thottethodi M and Vijaykumar T (2023). Occam: Optimal Data Reuse for Convolutional Neural Networks, ACM Transactions on Architecture and Code Optimization, 20:1, (1-25), Online publication date: 31-Mar-2023.
  5. Saulo H, Balakrishnan N and Vila R (2023). On a quantile autoregressive conditional duration model, Mathematics and Computers in Simulation, 203:C, (425-448), Online publication date: 1-Jan-2023.
  6. ACM
    Susungi A and Tadonki C (2021). Intermediate Representations for Explicitly Parallel Programs, ACM Computing Surveys, 54:5, (1-24), Online publication date: 30-Jun-2022.
  7. ACM
    Yuan L, Cao H, Zhang Y, Li K, Lu P and Yue Y Temporal vectorization for stencils Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-13)
  8. ACM
    Olivry A, Iooss G, Tollenaere N, Rountev A, Sadayappan P and Rastello F IOOpt: automatic derivation of I/O complexity bounds for affine programs Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, (1187-1202)
  9. ACM
    Şuşu A (2020). A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad Memory, ACM Transactions on Embedded Computing Systems, 19:6, (1-30), Online publication date: 30-Nov-2020.
  10. ACM
    Wu M, Liu Y, Cui H, Wei Q, Li Q, Li L, Lv F, Xue J and Feng X Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques, (97-109)
  11. Palkowski M and Bielecki W (2022). Parallel tiled cache and energy efficient codes for O ( n 4 ) RNA folding algorithms, Journal of Parallel and Distributed Computing, 137:C, (252-258), Online publication date: 1-Mar-2020.
  12. Zarei Zefreh E, Lotfi S, Mohammad Khanli L and Karimpour J (2019). Topology and computational-power aware tile mapping of perfectly nested loops with dependencies on distributed systems, Journal of Parallel and Distributed Computing, 129:C, (14-35), Online publication date: 1-Jul-2019.
  13. Bielecki W and Skotnicki P (2019). Insight into tiles generated by means of a correction technique, The Journal of Supercomputing, 75:5, (2665-2690), Online publication date: 1-May-2019.
  14. ACM
    Zhang F and Xue J (2018). Poker, ACM Transactions on Architecture and Code Optimization, 15:4, (1-28), Online publication date: 8-Jan-2019.
  15. ACM
    Su X, Liao X, Jiang H, Yang C and Xue J (2018). SCP, ACM Transactions on Architecture and Code Optimization, 15:4, (1-21), Online publication date: 31-Dec-2019.
  16. Seyfari Y, Lotfi S and Karimpour J (2018). Optimizing inter-nest data locality in imperfect stencils based on loop blocking, The Journal of Supercomputing, 74:10, (5432-5460), Online publication date: 1-Oct-2018.
  17. Seyfari Y, Lotfi S and Karimpour J (2018). Optimizing inter-nest data locality in imperfect stencils based on loop blocking, The Journal of Supercomputing, 74:10, (5432-5460), Online publication date: 1-Oct-2018.
  18. ACM
    Zhao J, Cui H, Zhang Y, Xue J and Feng X Revisiting Loop Tiling for Datacenters Proceedings of the 2018 International Conference on Supercomputing, (328-340)
  19. Bielecki W, Palkowski M and Skotnicki P (2018). Generation of parallel synchronization-free tiled code, Computing, 100:3, (277-302), Online publication date: 1-Mar-2018.
  20. ACM
    Zhang F and Xue J Poker: permutation-based SIMD execution of intensive tree search by path encoding Proceedings of the 2018 International Symposium on Code Generation and Optimization, (87-99)
  21. ACM
    Cui Y, Liu S, Zou N and Wu W A Dynamic Parallel Strategy for DOACROSS Loops Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, (108-115)
  22. ACM
    Sampaio D, Pouchet L and Rastello F Simplification and runtime resolution of data dependence constraints for loop transformations Proceedings of the International Conference on Supercomputing, (1-11)
  23. Su X, Liao X and Xue J Automatic generation of fast BLAS3-GEMM: a portable compiler approach Proceedings of the 2017 International Symposium on Code Generation and Optimization, (122-133)
  24. ACM
    Vanderbruggen T, Cavazos J, Liao C and Quinlan D Directive-based tile abstraction to distribute loops on accelerators Proceedings of the General Purpose GPUs, (53-62)
  25. ACM
    Zhou H and Xue J (2016). A Compiler Approach for Exploiting Partial SIMD Parallelism, ACM Transactions on Architecture and Code Optimization, 13:1, (1-26), Online publication date: 5-Apr-2016.
  26. ACM
    Darte A, Isoard A and Yuki T Extended lattice-based memory allocation Proceedings of the 25th International Conference on Compiler Construction, (218-228)
  27. ACM
    Pananilath I, Acharya A, Vasista V and Bondhugula U (2015). An Optimizing Code Generator for a Class of Lattice-Boltzmann Computations, ACM Transactions on Architecture and Code Optimization, 12:2, (1-23), Online publication date: 8-Jul-2015.
  28. ACM
    Zou Y and Rajopadhye S Automatic Energy Efficient Parallelization of Uniform Dependence Computations Proceedings of the 29th ACM on International Conference on Supercomputing, (373-382)
  29. ACM
    Mullapudi R, Vasista V and Bondhugula U (2015). PolyMage, ACM SIGARCH Computer Architecture News, 43:1, (429-443), Online publication date: 29-May-2015.
  30. ACM
    Mullapudi R, Vasista V and Bondhugula U (2015). PolyMage, ACM SIGPLAN Notices, 50:4, (429-443), Online publication date: 12-May-2015.
  31. ACM
    Mullapudi R, Vasista V and Bondhugula U PolyMage Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, (429-443)
  32. Gheolbănoiu A, Petrică L and Coţofană S Hybrid adaptive clock management for FPGA processor acceleration Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, (1359-1364)
  33. ACM
    Feld D, Soddemann T, Jünger M and Mallach S Hardware-Aware Automatic Code-Transformation to Support Compilers in Exploiting the Multi-Level Parallel Potential of Modern CPUs Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores, (1-10)
  34. ACM
    Bondhugula U, Bandishti V, Cohen A, Potron G and Vasilache N Tiling and optimizing time-iterated computations on periodic domains Proceedings of the 23rd international conference on Parallel architectures and compilation, (39-50)
  35. ACM
    Hannig F, Lari V, Boppu S, Tanase A and Reiche O (2014). Invasive Tightly-Coupled Processor Arrays, ACM Transactions on Embedded Computing Systems, 13:4s, (1-29), Online publication date: 1-Jul-2014.
  36. ACM
    Bondhugula U Compiling affine loop nests for distributed-memory parallel architectures Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-12)
  37. Alias C, Darte A and Plesco A Optimizing remote accesses for offloaded kernels Proceedings of the Conference on Design, Automation and Test in Europe, (575-580)
  38. ACM
    Bathen L, Ahn Y, Pasricha S and Dutt N (2013). MultiMaKe, ACM Transactions on Embedded Computing Systems, 12:1s, (1-25), Online publication date: 1-Mar-2013.
  39. Bao B and Ding C Defensive loop tiling for shared cache Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), (1-11)
  40. ACM
    Cui H, Yi Q, Xue J and Feng X (2013). Layout-oblivious compiler optimization for matrix computations, ACM Transactions on Architecture and Code Optimization, 9:4, (1-20), Online publication date: 1-Jan-2013.
  41. Bandishti V, Pananilath I and Bondhugula U Tiling stencil computations to maximize parallelism Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, (1-11)
  42. ACM
    Cui H, Yi Q, Xue J and Feng X Layout-oblivious optimization for matrix computations Proceedings of the 21st international conference on Parallel architectures and compilation techniques, (429-430)
  43. ACM
    Alias C, Darte A and Plesco A (2012). Optimizing remote accesses for offloaded kernels, ACM SIGPLAN Notices, 47:8, (285-286), Online publication date: 11-Sep-2012.
  44. ACM
    Cui H, Xue J, Wang L, Yang Y, Feng X and Fan D (2012). Extendable pattern-oriented optimization directives, ACM Transactions on Architecture and Code Optimization, 9:3, (1-37), Online publication date: 1-Sep-2012.
  45. Tadonki C, Lacassagne L, Dadi E and El Daoudi M Accelerator-Based implementation of the harris algorithm Proceedings of the 5th international conference on Image and Signal Processing, (485-492)
  46. ACM
    Renganarayanan L, Kim D, Strout M and Rajopadhye S (2012). Parameterized loop tiling, ACM Transactions on Programming Languages and Systems, 34:1, (1-41), Online publication date: 1-Apr-2012.
  47. Shirako J, Sharma K, Fauzia N, Pouchet L, Ramanujam J, Sadayappan P and Sarkar V Analytical bounds for optimal tile size selection Proceedings of the 21st international conference on Compiler Construction, (101-121)
  48. ACM
    Alias C, Darte A and Plesco A Optimizing remote accesses for offloaded kernels Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, (285-286)
  49. Di P and Xue J Model-driven tile size selection for DOACROSS loops on GPUs Proceedings of the 17th international conference on Parallel processing - Volume Part II, (401-412)
  50. Cui H, Xue J, Wang L, Yang Y, Feng X and Fan D Extendable pattern-oriented optimization directives Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, (107-118)
  51. Alias C, Pasca B and Plesco A Automatic generation of fpga-specific pipelined accelerators Proceedings of the 7th international conference on Reconfigurable computing: architectures, tools and applications, (53-66)
  52. ACM
    Li L, Xue J and Knoop J (2011). Scratchpad memory allocation for data aggregates via interval coloring in superperfect graphs, ACM Transactions on Embedded Computing Systems, 10:2, (1-42), Online publication date: 1-Dec-2010.
  53. Corvino R, Gamatié A and Boulet P Architecture exploration for efficient data transfer and storage in data-parallel applications Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I, (101-116)
  54. Smyk A and Tudruj M Streaming model computation of the FDTD problem Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I, (184-192)
  55. ACM
    Baskaran M, Hartono A, Tavarageri S, Henretty T, Ramanujam J and Sadayappan P Parameterized tiling revisited Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, (200-209)
  56. ACM
    Yuki T, Renganarayanan L, Rajopadhye S, Anderson C, Eichenberger A and O'Brien K Automatic creation of tile size selection models Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, (190-199)
  57. ACM
    Renganarayana L, Bondhugula U, Derisavi S, Eichenberger A and O'Brien K Compact multi-dimensional kernel extraction for register tiling Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, (1-12)
  58. ACM
    Zhang Y, Kandemir M, Pitsianis N and Sun X Exploring parallelization strategies for NUFFT data translation Proceedings of the seventh ACM international conference on Embedded software, (187-196)
  59. Kim D and Rajopadhye S Efficient tiled loop generation Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing, (293-307)
  60. Smyk A and Tudruj M Optimization of FDTD computations in a streaming model architecture Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I, (547-556)
  61. ACM
    Hartono A, Baskaran M, Bastoul C, Cohen A, Krishnamoorthy S, Norris B, Ramanujam J and Sadayappan P Parametric multi-level tiling of imperfectly nested loops Proceedings of the 23rd international conference on Supercomputing, (147-157)
  62. Renganarayana L and Rajopadhye S Positivity, posynomials and tile size selection Proceedings of the 2008 ACM/IEEE conference on Supercomputing, (1-12)
  63. Andronikos T, Ciorba F, Theodoropoulos P, Kamenopoulos D and Papakonstantinou G (2008). Cronus, Journal of Systems and Software, 81:8, (1389-1405), Online publication date: 1-Aug-2008.
  64. ACM
    Wang L, Yang X, Xue J, Deng Y, Yan X, Tang T and Nguyen Q (2008). Optimizing scientific application loops on stream processors, ACM SIGPLAN Notices, 43:7, (161-170), Online publication date: 27-Jun-2008.
  65. ACM
    Wang L, Yang X, Xue J, Deng Y, Yan X, Tang T and Nguyen Q Optimizing scientific application loops on stream processors Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems, (161-170)
  66. ACM
    Bondhugula U, Hartono A, Ramanujam J and Sadayappan P A practical automatic polyhedral parallelizer and locality optimizer Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, (101-113)
  67. ACM
    Bondhugula U, Hartono A, Ramanujam J and Sadayappan P (2008). A practical automatic polyhedral parallelizer and locality optimizer, ACM SIGPLAN Notices, 43:6, (101-113), Online publication date: 30-May-2008.
  68. ACM
    Vera X, Lisper B and Xue J (2007). Data cache locking for tight timing calculations, ACM Transactions on Embedded Computing Systems, 7:1, (1-38), Online publication date: 1-Dec-2007.
  69. ACM
    Kim D, Renganarayanan L, Rostron D, Rajopadhye S and Strout M Multi-level tiling Proceedings of the 2007 ACM/IEEE conference on Supercomputing, (1-12)
  70. ACM
    Bouchebaba Y, Girodias B, Nicolescu G, Aboulhamid E, Lavigueur B and Paulin P (2007). MPSoC memory optimization using program transformation, ACM Transactions on Design Automation of Electronic Systems, 12:4, (43-es), Online publication date: 1-Sep-2007.
  71. Du J, Yang X, Wang G, Tang T and Zeng K Architecture-based optimization for mapping scientific applications to imagine Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications, (32-43)
  72. Li L, Wu H, Feng H and Xue J Towards data tiling for whole programs in scratchpad memory allocation Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture, (63-74)
  73. ACM
    Renganarayanan L, Kim D, Rajopadhye S and Strout M Parameterized tiled loops for free Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, (405-414)
  74. ACM
    Renganarayanan L, Kim D, Rajopadhye S and Strout M (2007). Parameterized tiled loops for free, ACM SIGPLAN Notices, 42:6, (405-414), Online publication date: 10-Jun-2007.
  75. Yang X, Du J, Yan X and Deng Y Matrix-Based programming optimization for improving memory hierarchy performance on imagine Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications, (782-793)
  76. Hu Z, del Cuvillo J, Zhu W and Gao G Optimization of dense matrix multiplication on IBM cyclops-64 Proceedings of the 12th international conference on Parallel Processing, (134-144)
  77. Ciorba F, Andronikos T, Riakiotakis I, Chronopoulos A and Papakonstantinou G Dynamic multi phase scheduling for heterogeneous cluste Proceedings of the 20th international conference on Parallel and distributed processing, (72-72)
  78. Pan L, Lai M, Dillencourt M and Bic L Mobile pipelines Proceedings of the 12th international conference on High Performance Computing, (201-212)
  79. Xue J Aggressive loop fusion for improving locality and parallelism Proceedings of the Third international conference on Parallel and Distributed Processing and Applications, (224-238)
  80. Renganarayana L, Ramakrishna U and Rajopadhye S Combined ILP and register tiling Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing, (244-258)
  81. ACM
    Zhang C and Kurdahi F On combining iteration space tiling with data space tiling for scratch-pad memory systems Proceedings of the 2005 Asia and South Pacific Design Automation Conference, (973-976)
  82. Renganarayana L and Rajopadhye S A Geometric Programming Framework for Optimal Multi-Level Tiling Proceedings of the 2004 ACM/IEEE conference on Supercomputing
  83. Xue J and Vera X (2004). Efficient and Accurate Analytical Modeling of Whole-Program Data Cache Behavior, IEEE Transactions on Computers, 53:5, (547-566), Online publication date: 1-May-2004.
Contributors
  • UNSW Sydney

Reviews

Hans J. Schneider

A technique to restructure nested loops for parallel machines is discussed. This monograph focuses on the use of loop tiling for minimizing synchronization and communication cost and maximizing parallelism. The author does not consider other kinds of loop transformation or optimizing for cache locality. The book is organized into three parts. The first part introduces the mathematical background as well as the theory of nonsingular loop transformations. The second deals with rectangular tiling and then addresses the general case of parallelepiped tiling. The last part focuses on minimizing the execution time of a loop nest on a distributed memory machine. Chapter 1 presents some mathematical concepts necessary to understand the subject. The author emphasizes convex cones, which are used to analyze data dependency, loop permutability, legality, and selection of tile size and shape. The next chapter formally introduces the basic concepts of loop transformations. It defines perfectly nested loops, dependence vectors, and their polyhedra, and it treats fully permutable loop nests. Nonsingular transformations, closely related to tiling, are briefly reviewed. Rectangular tiling, covered in chapter 3, uses squares or rectangles of the same size and shape to partition the iteration space; formally, a concrete partitioning is characterized by a tile size vector and a tile offset. In general, the legality of a rectangular tiling can be checked by testing the existence of integer solutions to a system of inequalities; the author also describes a simpler test that is sufficient in practical applications. Finally, he discusses several related transformations, such as strip-mining, loop coalescing, and loop skewing. Chapter 4 extends the investigations to parallelepiped tiling, which offers more opportunities for exposing parallelism, improving locality, and reducing communication overhead. After illustrating this technique with an example and giving the formal definition, the author again discusses how to use integer programming to test legality exactly and then describes a practical test. Finally, he presents an alternative formal model that can clarify the duality between loop tiling and loop partitioning. The final chapters consider implementation on a distributed memory machine. Chapter 5 describes a suite of compiler techniques to generate a single-program, multiple data (SPMD) program to execute a tiled iteration space. Tiles are assigned to processors as atomic units of computation and then data distribution is derived using the computer-owns rule, by which a processor owns the data it computes. A detailed formal discussion of message-passing code generation, local memory management, and global-to-local address translation follows. Some experimental results round out the chapter. Chapter 6 addresses the problem of determining the best tile shape to minimize inter-tile communication if the tile size is given; this analysis assumes constant distance vectors. Conversely, chapter 7 deals with the more difficult problem of finding the best tile size once the shape is known, but restricting discussion to a two-dimensional iteration space. The author writes well and, at the beginning of each chapter, states what that chapter will cover. The book is clearly organized, and “Further Reading” sections are helpful for readers who are entering the field. Overall, the author succeeds in giving a consistent presentation of the state of the art in a specialized topic. The book can be used as a reference by professionals in compiler techniques, but they should be very familiar with the notation and terminology of linear algebra.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations