[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3458817.3476154acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Reducing redundancy in data organization and arithmetic calculation for stencil computations

Published: 13 November 2021 Publication History

Abstract

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization techniques, aiming at exploiting the in-core data parallelism. However, they either incur spatial data conflicts or hurt the data locality when integrated with tiling. In this paper, a novel spatial computation folding is devised to reduce the data reorganization overhead for vectorization and preserve the data locality for tiling in the data space simultaneously. We then propose an approach of temporal computation folding enhanced with shifts reusing, tessellate tiling, and semi-automatic code generation. It aims to further reduce the redundancy of arithmetic calculations and exploit the register reuse along the time dimension. Experimental results on the AVX2 and AVX-512 CPUs show that our approach obtains significant performance improvements compared with state-of-the-art techniques.

Supplementary Material

MP4 File (Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations.mp4)
Presentation video

References

[1]
Alfred V Aho, Stephen C Johnson, and Jeffrey D Ullman. 1976. Code generation for expressions with common subexpressions. In Proceedings of the 3rd ACM SIGACT-SIGPLAN symposium on Principles on programming languages. 19--31.
[2]
Randy Allen and Ken Kennedy. 2002. Optimizing compilers for modern architectures: a dependence-based approach. Taylor & Francis US.
[3]
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. 2006. The landscape of parallel computing research: A view from berkeley. (2006).
[4]
Krste Asanovic, Ras Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John D Kubiatowicz, Edward A Lee, Nelson Morgan, George Necula, David A Patterson, et al. 2008. The parallel computing laboratory at UC Berkeley: A research agenda based on the Berkeley view. EECS Department, University of California, Berkeley, Tech. Rep (2008).
[5]
Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling stencil computations to maximize parallelism. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
[6]
P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella. 2015. Compiler-Directed Transformation for Higher-Order Stencils. In 2015 IEEE International Parallel and Distributed Processing Symposium. 313--323.
[7]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101--113.
[8]
Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In 2011 IEEE International Parallel & Distributed Processing Symposium. IEEE, 676--687.
[9]
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick. 2008. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE, 1--12.
[10]
Raúl de la Cruz and Mauricio Araya-Polo. 2014. Algorithm 942: Semi-Stencil. ACM Trans. Math. Softw. 40, 3, Article 23 (April 2014), 39 pages.
[11]
Steven J Deitz, Bradford L Chamberlain, and Lawrence Snyder. 2001. Eliminating redundancies in sum-of-product array computations. In Proceedings of the 15th international conference on Supercomputing. 65--77.
[12]
Chris Ding and Yun He. 2001. A Ghost Cell Expansion Method for Reducing Communications in Solving PDE Problems (SC '01). 50--50.
[13]
Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations (ICS '05). 361--366.
[14]
Tobias Gysi, Tobias Grosser, and Torsten Hoefler. 2015. Modesto: Data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In Proceedings of the 29th ACM on International Conference on Supercomputing. 177--186.
[15]
Tom Henretty, Kevin Stock, Louis-Noël Pouchet, Franz Franchetti, J Ramanujam, and P Sadayappan. 2011. Data layout transformation for stencil computations on short-vector simd architectures. In International Conference on Compiler Construction. Springer, 225--245.
[16]
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanujam, and P. Sadayappan. 2013. A Stencil Compiler for Short-Vector SIMD Architectures. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). Association for Computing Machinery, New York, NY, USA, 13--24.
[17]
F. Irigoin and R. Triolet. 1988. Supernode Partitioning (POPL '88). 319--329.
[18]
Guohua Jin, John Mellor-Crummey, and Robert Fowler. 2001. Increasing Temporal Locality with Skewing and Recursive Blocking (SC '01). 43--43.
[19]
Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel Williams. 2010. An auto-tuning framework for parallel multicore stencil computations. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE, 1--12.
[20]
Shoaib Kamil, Kaushik Datta, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2006. Implicit and explicit optimizations for stencil computations. In Proceedings of the 2006 workshop on Memory system performance and correctness. 51--60.
[21]
Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of Stencil Computations. SIGPLAN Not. 42, 6 (June 2007), 235--244.
[22]
Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms (ASPLOS IV). 63--74.
[23]
Kun Li, Honghui Shang, Yunquan Zhang, Shigang Li, Baodong Wu, Dong Wang, Libo Zhang, Fang Li, Dexun Chen, and Zhiqiang Wei. 2019. OpenKMC: a KMC design for hundred-billion-atom simulation using millions of cores on Sunway Taihulight. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16.
[24]
Tareq M. Malas, Georg Hager, Hatem Ltaief, and David E. Keyes. 2017. Multidimensional Intratile Parallelization for Memory-Starved Stencil Computations. ACM Trans. Parallel Comput. 4, 3, Article Article 12 (Dec. 2017), 32 pages.
[25]
A. C. McKellar and E. G. Coffman, Jr. 1969. Organizing Matrices and Matrix Operations for Paged Memory Systems. Commun. ACM 12, 3 (1969), 153--165.
[26]
Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd international conference on Supercomputing. 256--265.
[27]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs (SC '10). 1--13.
[28]
C.S. Department of University of Oregon. 2014. Stencil Pattern. https://ipcc.cs.uoregon.edu/lectures/lecture-8-stencil.pdf [Online; accessed 29-July-2020].
[29]
Fabrice Rastello and Thierry Dauxois. 2002. Efficient Tiling for an ODE Discrete Integration Program: Redundant Tasks Instead of Trapezoidal Shaped-Tiles (IPDPS '02). 138--.
[30]
Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 168--182.
[31]
P. S. Rawat, A. Sukumaran-Rajam, A. Rountev, F. Rastello, L. Pouchet, and P. Sadayappan. 2018. Associative Instruction Reordering to Alleviate Register Pressure. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 590--602.
[32]
Gabriel Rivera and Chau-Wen Tseng. 2000. Tiling Optimizations for 3D Scientific Computations (SC '00). Article 32.
[33]
Aaron Sawdey, Matthew O'Keefe, Rainer Bleck, and Robert W Numrich. 1995. The design, implementation, and performance of a parallel ocean circulation model. In Proceedings of 6th ECMWF Workshop on the Use of Parallel Processors in Meteorology: Coming of Age. 523--550.
[34]
Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality (PLDI '99). 215--228.
[35]
Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice Rastello, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2014. A framework for enhancing data reuse via associative reordering. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation. 65--76.
[36]
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2010. Cache oblivious parallelograms in iterative stencil computations. In Proceedings of the 24th ACM International Conference on Supercomputing. 49--59.
[37]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). Association for Computing Machinery, New York, NY, USA, 117--128.
[38]
Xinmin Tian, Aart Bik, Milind Girkar, Paul Grey, Hideki Saito, and Ernesto Su. 2002. Intel® OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance. Intel Technology Journal 6, 1 (2002).
[39]
Sundaresan Venkatasubramanian, Richard W Vuduc, and none none. 2009. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In Proceedings of the 23rd international conference on Supercomputing. 244--255.
[40]
Wikichip.org. 2019. Wikichip of Intel Xeon Gold 6140. https://en.wikichip.org/wiki/intel/xeon_gold/6140 [Online; accessed 29-July-2020].
[41]
Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm (PLDI '91). 30--44.
[42]
M. Wolfe. 1989. More Iteration Space Tiling (Supercomputing '89). 655--664.
[43]
David Wonnacott. 2002. Achieving Scalable Locality with Time Skewing. Int. J. Parallel Program. 30, 3 (June 2002), 181--221.
[44]
David G Wonnacott and Michelle Mills Strout. 2013. On the scalability of loop tiling techniques. IMPACT 2013 (2013).
[45]
Charles Yount. 2015. Vector Folding: improving stencil performance via multidimensional SIMD-vector representation. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems. IEEE, 865--870.
[46]
Charles Yount, Josh Tobin, Alexander Breuer, and Alejandro Duran. 2016. YASK-Yet another stencil kernel: A framework for HPC stencil code-generation and tuning. In 2016 Sixth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing (WOLFHPC). IEEE, 30--39.
[47]
Liang Yuan, Shan Huang, Yunquan Zhang, and Hang Cao. 2019. Tessellating Star Stencils. In Proceedings of the 48th International Conference on Parallel Processing. 1--10.
[48]
Liang Yuan, Yunquan Zhang, Peng Guo, and Shan Huang. 2017. Tessellating Stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). Association for Computing Machinery, New York, NY, USA, Article Article 49, 13 pages.
[49]
Yongpeng Zhang and Frank Mueller. 2012. Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 155--164.
[50]
Tuowen Zhao, Protonu Basu, Samuel Williams, Mary Hall, and Hans Johansen. 2019. Exploiting reuse and vectorization in blocked stencil computations on CPUs and GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--44.

Cited By

View all
  • (2025)FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core UnitsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710897(355-368)Online publication date: 28-Feb-2025
  • (2025)Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled RegistersProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710886(481-495)Online publication date: 28-Feb-2025
  • (2024)Acceleration of Remote Sensing Image Filtering Based on Embedded CPU+GPU Heterogeneous PlatformChinese Journal of Space Science10.11728/cjss2024.01.2023-003344:1(95)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2021
1493 pages
ISBN:9781450384421
DOI:10.1145/3458817
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. data locality
  2. register reuse
  3. stencil
  4. vectorization

Qualifiers

  • Research-article

Conference

SC '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)41
  • Downloads (Last 6 weeks)4
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)FlashFFTStencil: Bridging Fast Fourier Transforms to Memory-Efficient Stencil Computations on Tensor Core UnitsProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710897(355-368)Online publication date: 28-Feb-2025
  • (2025)Jigsaw: Toward Conflict-free Vectorized Stencil Computation by Tessellating Swizzled RegistersProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3710848.3710886(481-495)Online publication date: 28-Feb-2025
  • (2024)Acceleration of Remote Sensing Image Filtering Based on Embedded CPU+GPU Heterogeneous PlatformChinese Journal of Space Science10.11728/cjss2024.01.2023-003344:1(95)Online publication date: 2024
  • (2023)Adapting combined tiling to stencil optimizations on sunway processorCCF Transactions on High Performance Computing10.1007/s42514-023-00147-x5:3(322-333)Online publication date: 17-May-2023
  • (2021)An Accurate and Efficient Large-scale Regression Method through Best Friend ClusteringIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.3134336(1-1)Online publication date: 2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media