Abstract
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their computation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the completion of halo exchanges. Both communication and synchronization may incur significant overhead on parallel architectures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchronizations. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly computing some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the characteristics of both the architecture and the application, and it has only been studied for message-passing systems in distributed environments. To automate this process on shared memory systems, we establish a performance model using NVIDIA’s Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applications, for which the predicted ghost zone configurations are able to achieve a speedup no less than 95% of the optimal speedup.
Similar content being viewed by others
References
Allen, G., Dramlitsch, T., Foster, I., Karonis, N.T., Ripeanu, M., Seidel, E., Toonen, B.: Supporting efficient execution in heterogeneous distributed computing environments with cactus and globus. In: SC’01, pp. 52–52 (2001)
Alpert, M.: Not just fun and games. April (1999)
Bromley M., Heller S., McNerney T., Steele G.L. Jr: Fortran at ten gigaflops: the connection machine convolution compiler. PLDI ’91 26(6), 145–156 (1991)
Chatterjee, S., Gilbert, J.R., Schreiber, R.: Mobile and replicated alignment of arrays in data-parallel programs. In: SC’93, pp. 420–429 November (1993)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Skadron, K.: A performance study of general purpose applications on graphics processors using CUDA, June (2008)
NVIDIA Corporation. Geforce gtx 280 specifications. (2008)
NVIDIA Corporation. NVIDIA CUDA visual profiler. June (2008)
Dagum, L.: OpenMP: a proposed industry standard API for shared memory programming, October (1997)
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: SC ’08. 1–12 (2008)
Deitz, S.J., Chamberlain, B.L., Snyder, L.: Eliminating redundancies in sum-of-product array computations. In: ICS ’01, pp. 65–77 (2001)
Evans, L.C.: Partial differential equations. Am. Math. Soc. (1998)
Chen, L., Zhang, Z.-Q., Feng, X.-B.: Redundant computation partition on distributed-memory systems. In: ICA3PP ’02, pp. 252 (2002)
Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: ICS’05, pp. 361–366 (2005)
Goodnight, N.: CUDA/OpenGL fluid simulation. April (2007)
Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: CF’06 (2006)
Huang C.-H., Sadayappan P.: Communication-free hyperplane partitioning of nested loops. J. Parallel Distrib. Comput. 19(2), 90–102 (1993)
Huang, W., Stan, M.R., Skadron, K., Ghosh, S., Sankaranarayanan, K., Velusamy, S.: Compact thermal modeling for temperature-aware design. In: DAC’04. (2004)
Electronic Educational Devices Inc. Watts up? electricity meter operator’s manual. (2002)
Jalby, W., Meier, U.: Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system, pp. 429–432 (1986)
Kamil, S., Husbands, P., Oliker, L., Shalf, J., Yelick, K.: Impact of modern memory subsystems on cache optimizations for stencil computations. In: MSP’05, pp. 36–43 (2005)
Kowarschik M., Weiß C., Karl W., Rüde U.: Cache-aware multigrid methods for solving poisson’s equation in two dimensions. Computing 64(4), 381–399 (2000)
Krishnamoorthy S., Baskaran M., Bondhugula U., Ramanujam J., Rountev A., Sadayappan P.: Effective automatic parallelization of stencil computations. PLDI ’07 42(6), 235–244 (2007)
Lee P.: Techniques for compiling programs on distributed memory multicomputers. Parallel Comput. 21, 1895–1923 (1995)
Li Z., Song Y.: Automatic tiling of iterative stencil loops. ACM Trans. Program. Lang. Syst. 26(6), 975–1028 (2004)
Manjikian N., Abdelrahman T.S.: Fusion of loops for parallelism and locality. Parallel Distrib. Syst. 8, 19–28 (1997)
Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In: ICS ’09, pp. 256–265 (2009)
Nickolls J., Buck I., Garland M., Skadron K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)
Premnath K.N., Abraham J.: Three-dimensional multi-relaxation time (mrt) lattice-Boltzmann models for multiphase flow. J. Comput. Phys. 224(2), 539–559 (2007)
Ramanujam, J.: Tiling of iteration spaces for multicomputers. In: Proceedings International Conference Parallel Processing, pp. 179–186. (1990)
Renganarayana, L., Harthikote-Matha, M., Dewri, R., Rajopadhye, S.: Towards optimal multi-level tiling for stencil computations. IPDPS’07, pp. 1–10, March (2007)
Renganarayana, L., Rajopadhye, S.: Positivity, posynomials and tile size selection. In: SC ’08, pp. 1–12 (2008)
Ripeanu, M., Iamnitchi, A., Foster, I.: Cactus application: Performance predictions in a grid environment. In: EuroPar’01. (2001)
Rivera G., Tseng, C.-W.: Tiling optimizations for 3D scientific computations. In: SC ’00, p. 32 (2000)
Ueng, S.-Z., Baghsorkhi, S., Lathara, M., Hwu, W.m.: CUDA-lite: Reducing GPU programming complexity. In: LCPC’08. (2008)
Wonnacott, D.: Time skewing for parallel computers. In: WLCPC’99, pp. 477–480 (1999)
Wonnacott D.: Achieving scalable locality with time skewing. Int. J. Parallel Program. 30(3), 181–221 (2002)
Yang Z., Zhu Y., Pu Y.: Parallel image processing based on CUDA. Int. Conf. Comput. Sci. Software Eng. 3, 198–201 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Meng, J., Skadron, K. A Performance Study for Iterative Stencil Loops on GPUs with Ghost Zone Optimizations. Int J Parallel Prog 39, 115–142 (2011). https://doi.org/10.1007/s10766-010-0142-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-010-0142-5