[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling

Published: 03 May 2019 Publication History

Abstract

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]—hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation—not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

References

[1]
M. F. Adams and J. Demmel. 1999. Parallel multigrid solver algorithms and implementations for 3D unstructured finite element problem. In Proceedings of SC’99. Portland, Oregon.
[2]
Utkarsh Ayachit. 2015. The ParaView Guide (Full Color Version): A Parallel Visualization Application (paraview 4.3 ed.). Kitware, Incorporated. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path===ASIN/1930934300.
[3]
Federico Bassetti, Kei Davis, and Daniel J. Quinlan. 1998. Optimizing transformations of stencil operations for parallel object-oriented scientific frameworks on cache-based architectures. In Proceedings of the 2nd International Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE’98). Springer-Verlag, London, UK, 107--118. http://dl.acm.org/citation.cfm?id=646894.709707.
[4]
Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, 39--50.
[5]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). ACM, New York, 101--113.
[6]
Li Chen, Zhao-Qing Zhang, and Xiao-Bing Feng. 2002. Redundant computation partition on distributed-memory systems. In Proceedings of the 5th International Conference on Algorithms and Architectures for Parallel Processing, 2002. 252--260.
[7]
A. T. Chronopoulos. 1991. s-step iterative methods for (non)symmetric (in)definite linear systems. SIAM Journal on Numerical Analysis. 28, 6 (1991), 1776--1789. arXiv:http://dx.doi.org/10.1137/0728088
[8]
Sarah Delcourte, Loula Fezoui, and Nathalie Glinsky-Olivier. 2009. A high-order Discontinuous Galerkin method for the seismic wave propagation. ESAIM: Proceedings 27 (2009), 70--89.
[9]
James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. 2008. Avoiding communication in sparse matrix computations. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS’08). IEEE Computer Society.
[10]
Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiß. 2000. Cache optimization for structured and unstructured grid multigrid. Electronic Transaction on Numerical Analysis (February 2000), 21--40.
[11]
Denys Dutykh, Raphaël Poncet, and Frédéric Dias. 2011. The VOLNA code for the numerical modeling of tsunami waves: Generation, propagation and inundation. European Journal of Mechanics - B/Fluids 30, 6 (2011), 598--615. Special Issue: Nearshore Hydrodynamics.
[12]
W. W. Garvin. 1956. Exact transient solution of the buried line source problem. Proceedings of the Royal Society of London, Series A 234, 1199 (1956).
[13]
M. B. Giles, G. R. Mudalige, C. Bertolli, P. H. J. Kelly, E. Laszlo, and I. Reguly. 2012. An analytical study of loop tiling for a large-scale unstructured mesh application. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, 477--482.
[14]
M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. 2011. Performance analysis of the OP2 framework on many-core architectures. SIGMETRICS Performance Evaluation Review 38, 4 (March 2011), 9--15.
[15]
Gerard Gorman, Marcos de Aguiar, David Ham, Felix Herrmann, Christian Jacobs, Paul Kelly, Michael Lange, Chris Pain, Matthew Piggott, Spencer Sherwin, Felippe Vieira Zacarias, Mike Warner, Fabio Luporini, and Navjot Kukreja. 2015. OPESCI project web page. http://www.opesci.org.
[16]
Intel Corporation. 2016. Intel VTune Performance Analyzer. software.intel.com/en-us/intel-vtune-amplifier-xe.
[17]
Christian T. Jacobs, Michael Lange, Fabio Luporini, and Gerard J. Gorman. 2017. Application of code generation to high-order seismic modelling with the discontinuous Galerkin finite element method. In preparation.
[18]
George Karypis and Vipin Kumar. 2011. MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 5.0. http://www.cs.umn.edu/∼metis.
[19]
Matthew G. Knepley, Michael Lange, and Gerard Gorman. 2015. Unstructured overlapping mesh distribution in parallel. Submitted to ACM Transactions on Mathematical Software (TOMS'15) arxiv:1506.06194 http://arxiv.org/abs/1506.06194
[20]
Christopher D. Krieger and Michelle Mills Strout. 2012. Executing optimized irregular applications using task graphs within existing parallel models. In Proceedings of the 2nd Workshop on Irregular Applications: Architectures and Algorithms (IA<sup>3</sup>) Held in Conjunction with SC12.
[21]
Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul Kelly, Gihan Mudalige, Brian Van Straalen, and Sam Williams. 2013. Loop chaining: A programming abstraction for balancing locality and parallelism. In Proceedings of the 18th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’13). Boston, Massachusetts.
[22]
Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. SIGPLAN Notices 42, 6 (June 2007), 235--244.
[23]
Michael Lange, Lawrence Mitchell, Matthew G. Knepley, and Gerard J. Gorman. 2016. Efficient mesh management in Firedrake using PETSc-DMPlex. SIAM Journal on Scientific Computing 38, 5 (2016), S143--S155. arXiv:cs.MS/1506.07749
[24]
Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2015. Cross-loop optimization of arithmetic intensity for finite element local assembly. ACM Transactions on Architecture and Code Optimization 11, 4, Article 57 (January 2015), 25 pages.
[25]
Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, 256--265.
[26]
Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing communication in sparse matrix solvers. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, Article 36, 12 pages.
[27]
Florian Rathgeber. 2014. Productive and Efficient Computational Science Through Domain-Specific Abstractions. Ph.D. Dissertation. Imperial College London.
[28]
Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. Mcrae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2016. Firedrake: Automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software 43, 3 (Dec. 2016), Article 24, 27 pages.
[29]
Mahesh Ravishankar, John Eisenlohr, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2012. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In Proceedings of the International Confrencedel.e on High Performance Computing, Networking, Storage and Analysis. Article 72, 11 pages.
[30]
I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. J. Kelly, and D. Radford. 2016. Acceleration of a full-scale industrial CFD application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1265--1278.
[31]
István Z. Reguly, Gihan R. Mudalige, and Mike B. Giles. 2017. Loop tiling in large-scale stencil codes at runtime with OPS. arXiv preprint arXiv:1704.00693.
[32]
Joel H. Salz, Ravi Mirchandaney, and Kay Crowley. 1991. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers 40, 5 (1991), 603--612.
[33]
Imperial College High Performance Computing Service. 2017. The cx2/Helen Cluster.
[34]
M. M. Strout, F. Luporini, C. D. Krieger, C. Bertolli, G.-T. Bercea, C. Olschanowsky, J. Ramanujam, and P. H. J. Kelly. 2014. Generalizing run-time tiling with the loop chain abstraction. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1136--1145.
[35]
Michelle Mills Strout, Larry Carter, and Jeanne Ferrante. 2003. Compile-time composition of run-time data and iteration reorderings. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’03). ACM, New York.
[36]
Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck. 2002. Combining performance aspects of irregular Gauss-Seidel via sparse tiling. In Proceedings of the 15th Workshop on Languages and Compilers for Parallel Computing (LCPC’02). Springer.
[37]
Michelle Mills Strout, Larry Carter, Jeanne Ferrante, and Barbara Kreaseck. 2004. Sparse tiling for stationary iterative methods. International Journal of High Performance Computing Applications 18, 1 (February 2004), 95--114.
[38]
TOP500. 2016. cx2/Helen Cluster Specification in the TOP500 Ranking. https://www.top500.org/system/178845.
[39]
Jean Virieux. 1986. P-SV wave propagation in heterogeneous media; velocity-stress finite-difference method. Geophysics 51, 4 (1986), 889--901. arXiv:http://geophysics.geoscienceworld.org/content/51/4/889.full.pdf.
[40]
Zenodo/COFFEE. 2017. coneoproject/COFFEE: A Compiler for Fast Expression Evaluation.
[41]
Zenodo/FIAT. 2017. firedrakeproject/fiat: The Finite Element Automated Tabulator.
[42]
Zenodo/Firedrake. 2017. firedrakeproject/firedrake: An Automated Finite Element System.
[43]
Zenodo/PETSc. 2017. firedrakeproject/petsc: Portable, Extensible Toolkit for Scientific Computation.
[44]
Zenodo/PETSc4Py. 2017. firedrakeproject/petsc4py: The Python Interface to PETSc.
[45]
Zenodo/PyOP2. 2017. OP2/PyOP2: Framework for Performance-Portable Parallel Computations on Unstructured Meshes.
[46]
Zenodo/Seigen. 2017. OPESCI/Seigen: An Elastic Wave Equation Solver for Seismological Problems Based on the Finite Element Method.
[47]
Zenodo/SLOPE. 2017. coneoproject/SLOPE: A Run-Time System to Tile Irregular Loops.
[48]
Zenodo/TSFC. 2017. firedrakeproject/tsfc: The Two Stage Form Compiler firedrakeproject/tsfc: The Two Stage Form Compiler.
[49]
Zenodo/UFL. 2017. firedrakeproject/ufl: The Unified Form Language.
[50]
Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, 207--218.

Cited By

View all
  • (2023)Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2Proceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605604(380-391)Online publication date: 7-Aug-2023
  • (2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 45, Issue 2
June 2019
255 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/3326465
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2019
Accepted: 01 December 2018
Revised: 01 November 2018
Received: 01 August 2017
Published in TOMS Volume 45, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Finite element method
  2. compiler
  3. loop fusion
  4. loop tiling
  5. performance optimization
  6. sparse tiling
  7. unstructured mesh

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2Proceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605604(380-391)Online publication date: 7-Aug-2023
  • (2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media