More Web Proxy on the site http://driver.im/

research-article

Reducing redundancy in data organization and arithmetic calculation for stencil computations

Authors:

Yue YueAuthors Info & Claims

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 84, Pages 1 - 15

https://doi.org/10.1145/3458817.3476154

Published: 13 November 2021 Publication History

Abstract

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization techniques, aiming at exploiting the in-core data parallelism. However, they either incur spatial data conflicts or hurt the data locality when integrated with tiling. In this paper, a novel spatial computation folding is devised to reduce the data reorganization overhead for vectorization and preserve the data locality for tiling in the data space simultaneously. We then propose an approach of temporal computation folding enhanced with shifts reusing, tessellate tiling, and semi-automatic code generation. It aims to further reduce the redundancy of arithmetic calculations and exploit the register reuse along the time dimension. Experimental results on the AVX2 and AVX-512 CPUs show that our approach obtains significant performance improvements compared with state-of-the-art techniques.

Supplementary Material

MP4 File (Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations.mp4)

Presentation video

Download
248.99 MB

References

[1]

Alfred V Aho, Stephen C Johnson, and Jeffrey D Ullman. 1976. Code generation for expressions with common subexpressions. In Proceedings of the 3rd ACM SIGACT-SIGPLAN symposium on Principles on programming languages. 19--31.

Digital Library

[2]

Randy Allen and Ken Kennedy. 2002. Optimizing compilers for modern architectures: a dependence-based approach. Taylor & Francis US.

[3]

Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from berkeley. (2006).

[4]

Krste Asanovic, Ras Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John D Kubiatowicz, Edward A Lee, Nelson Morgan, George Necula, David A Patterson, et al. 2008. The parallel computing laboratory at UC Berkeley: A research agenda based on the Berkeley view. EECS Department, University of California, Berkeley, Tech. Rep (2008).

[5]

Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.

Digital Library

[6]

P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In 2015 IEEE International Parallel and Distributed Processing Symposium. 313--323.

[7]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101--113.

Digital Library

[8]

Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 676--687.

Digital Library

[9]

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 1--12.

[10]

Raúl de la Cruz and Mauricio Araya-Polo. 2014. Algorithm 942: Semi-Stencil. ACM Trans. Math. Softw. 40, 3, Article 23 (April 2014), 39 pages.

Digital Library

[11]

Steven J Deitz, Bradford L Chamberlain, and Lawrence Snyder. 2001. Eliminating redundancies in sum-of-product array computations. In Proceedings of the 15th international conference on Supercomputing. 65--77.

Digital Library

[12]

Chris Ding and Yun He. 2001. A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems (SC '01). 50--50.

[13]

Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations (ICS '05). 361--366.

[14]

Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. Modesto: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing. 177--186.

Digital Library

[15]

Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J Ramanujam, and P Sadayappan. 2011. Data layout transformation for stencil computations on short-vector simd architectures. In International Conference on Compiler Construction. Springer, 225--245.

Digital Library

[16]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-Vector SIMD Architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). Association for Computing Machinery, New York, NY, USA, 13--24.

Digital Library

[17]

F. Irigoin and R. Triolet. 1988. Supernode Partitioning (POPL '88). 319--329.

[18]

Guohua Jin, John Mellor-Crummey, and Robert Fowler. 2001. Increasing Temporal Locality with Skewing and Recursive Blocking (SC '01). 43--43.

[19]

Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 1--12.

[20]

Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory system performance and correctness. 51--60.

Digital Library

[21]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. SIGPLAN Not. 42, 6 (June 2007), 235--244.

Digital Library

[22]

Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms (ASPLOS IV). 63--74.

[23]

Kun Li, Honghui Shang, Yunquan Zhang, Shigang Li, Baodong Wu, Dong Wang, Libo Zhang, Fang Li, Dexun Chen, and Zhiqiang Wei. 2019. OpenKMC: a KMC design for hundred-billion-atom simulation using millions of cores on Sunway Taihulight. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.

Digital Library

[24]

Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations. ACM Trans. Parallel Comput. 4, 3, Article Article 12 (Dec. 2017), 32 pages.

Digital Library

[25]

A. C. McKellar and E. G. Coffman, Jr. 1969. Organizing Matrices and Matrix Operations for Paged Memory Systems. Commun. ACM 12, 3 (1969), 153--165.

Digital Library

[26]

Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd international conference on Supercomputing. 256--265.

Digital Library

[27]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs (SC '10). 1--13.

[28]

C.S. Department of University of Oregon. 2014. Stencil Pattern. https://ipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdf [Online; accessed 29-July-2020].

[29]

Fabrice Rastello and Thierry Dauxois. 2002. Efficient Tiling for an ODE Discrete Integration Program: Redundant Tasks Instead of Trapezoidal Shaped-Tiles (IPDPS '02). 138--.

[30]

Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 168--182.

Digital Library

[31]

P. S. Rawat, A. Sukumaran-Rajam, A. Rountev, F. Rastello, L. Pouchet, and P. Sadayappan. 2018. Associative Instruction Reordering to Alleviate Register Pressure. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 590--602.

[32]

Gabriel Rivera and Chau-Wen Tseng. 2000. Tiling Optimizations for 3D Scientific Computations (SC '00). Article 32.

[33]

Aaron Sawdey, Matthew O'Keefe, Rainer Bleck, and Robert W Numrich. 1995. The design, implementation, and performance of a parallel ocean circulation model. In Proceedings of 6th ECMWF Workshop on the Use of Parallel Processors in Meteorology: Coming of Age. 523--550.

[34]

Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality (PLDI '99). 215--228.

[35]

Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2014. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 65--76.

Digital Library

[36]

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2010. Cache oblivious parallelograms in iterative stencil computations. In Proceedings of the 24th ACM International Conference on Supercomputing. 49--59.

Digital Library

[37]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). Association for Computing Machinery, New York, NY, USA, 117--128.

Digital Library

[38]

Xinmin Tian, Aart Bik, Milind Girkar, Paul Grey, Hideki Saito, and Ernesto Su. 2002. Intel® OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance. Intel Technology Journal 6, 1 (2002).

[39]

Sundaresan Venkatasubramanian, Richard W Vuduc, and none none. 2009. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international conference on Supercomputing. 244--255.

[40]

Wikichip.org. 2019. Wikichip of Intel Xeon Gold 6140. https://en.wikichip.org/wiki/intel/xeon_gold/6140 [Online; accessed 29-July-2020].

[41]

Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm (PLDI '91). 30--44.

Digital Library

[42]

M. Wolfe. 1989. More Iteration Space Tiling (Supercomputing '89). 655--664.

[43]

David Wonnacott. 2002. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program. 30, 3 (June 2002), 181--221.

Digital Library

[44]

David G Wonnacott and Michelle Mills Strout. 2013. On the scalability of loop tiling techniques. IMPACT 2013 (2013).

[45]

Charles Yount. 2015. Vector Folding: improving stencil performance via multidimensional SIMD-vector representation. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE, 865--870.

[46]

Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK-Yet another stencil kernel: A framework for HPC stencil code-generation and tuning. In 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC). IEEE, 30--39.

[47]

Liang Yuan, Shan Huang, Yunquan Zhang, and Hang Cao. 2019. Tessellating Star Stencils. In Proceedings of the 48th International Conference on Parallel Processing. 1--10.

Digital Library

[48]

Liang Yuan, Yunquan Zhang, Peng Guo, and Shan Huang. 2017. Tessellating Stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). Association for Computing Machinery, New York, NY, USA, Article Article 49, 13 pages.

Digital Library

[49]

Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 155--164.

Digital Library

[50]

Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--44.

Digital Library

Cited By

Han HLi KCui WBai DZhang YYuan LChen YZhang YCao TYang M(2025)FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core UnitsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710897(355-368)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710897
Zhang YLi KYuan LHan HZhang YCao TYang M(2025)Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled RegistersProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710886(481-495)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710886
TAN PXUE CZHOU L(2024)Acceleration of Remote Sensing Image Filtering Based on Embedded CPU+GPU Heterogeneous PlatformChinese Journal of Space Science10.11728/cjss2024.01.2023-003344:1(95)Online publication date: 2024
https://doi.org/10.11728/cjss2024.01.2023-0033
Show More Cited By

Index Terms

Reducing redundancy in data organization and arithmetic calculation for stencil computations
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Vector / streaming algorithms
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Vector / streaming algorithms

Recommendations

Optimizing Stencil Computation on Multi-core DSPs
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Stencil is a common computation pattern in high-performance computing (HPC) applications. While extensive work has been proposed to optimize stencil kernels on CPUs and GPUs, there is no consensus on how to best optimize stencils on multi-core Digital ...
High-performance code generation for stencil computations on GPU architectures
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Stencil computations arise in many scientific computing domains, and often represent time-critical portions of applications. There is significant interest in offloading these computations to high-performance devices such as GPU accelerators, but these ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2021

1493 pages

ISBN:9781450384421

DOI:10.1145/3458817

General Chair:
Bronis R. de Supinski,
Program Chairs:
Mary Hall,
Todd Gamblin

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Conference

SC '21

Sponsor:

SIGHPC

SC '21: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 19, 2021

Missouri, St. Louis

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
309
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)4

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Han HLi KCui WBai DZhang YYuan LChen YZhang YCao TYang M(2025)FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core UnitsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710897(355-368)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710897
Zhang YLi KYuan LHan HZhang YCao TYang M(2025)Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled RegistersProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710886(481-495)Online publication date: 28-Feb-2025
https://dl.acm.org/doi/10.1145/3710848.3710886
TAN PXUE CZHOU L(2024)Acceleration of Remote Sensing Image Filtering Based on Embedded CPU+GPU Heterogeneous PlatformChinese Journal of Space Science10.11728/cjss2024.01.2023-003344:1(95)Online publication date: 2024
https://doi.org/10.11728/cjss2024.01.2023-0033
Sun BLi MYang HXu JLuan ZQian D(2023)Adapting combined tiling to stencil optimizations on sunway processorCCF Transactions on High Performance Computing10.1007/s42514-023-00147-x5:3(322-333)Online publication date: 17-May-2023
https://doi.org/10.1007/s42514-023-00147-x
Li KYuan LZhang YChen G(2021)An Accurate and Efficient Large-scale Regression Method through Best Friend ClusteringIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3134336(1-1)Online publication date: 2021
https://doi.org/10.1109/TPDS.2021.3134336

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten