More Web Proxy on the site http://driver.im/

research-article

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling

Authors:

Fabio Luporini,

Christian T. Jacobs,

Gerard J. Gorman,

Paul H. J. KellyAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 45, Issue 2

Article No.: 17, Pages 1 - 30

https://doi.org/10.1145/3302256

Published: 03 May 2019 Publication History

Abstract

Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]—hence the name “sparse.” One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementation—not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.

References

[1]

M. F. Adams and J. Demmel. 1999. Parallel multigrid solver algorithms and implementations for 3D unstructured finite element problem. In Proceedings of SC’99. Portland, Oregon.

Digital Library

[2]

Utkarsh Ayachit. 2015. The ParaView Guide (Full Color Version): A Parallel Visualization Application (paraview 4.3 ed.). Kitware, Incorporated. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path===ASIN/1930934300.

Digital Library

[3]

Federico Bassetti, Kei Davis, and Daniel J. Quinlan. 1998. Optimizing transformations of stencil operations for parallel object-oriented scientific frameworks on cache-based architectures. In Proceedings of the 2nd International Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE’98). Springer-Verlag, London, UK, 107--118. http://dl.acm.org/citation.cfm?id=646894.709707.

Digital Library

[4]

Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and optimizing time-iterated computations on periodic domains. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, 39--50.

Digital Library

[5]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’08). ACM, New York, 101--113.

Digital Library

[6]

Li Chen, Zhao-Qing Zhang, and Xiao-Bing Feng. 2002. Redundant computation partition on distributed-memory systems. In Proceedings of the 5th International Conference on Algorithms and Architectures for Parallel Processing, 2002. 252--260.

Digital Library

[7]

A. T. Chronopoulos. 1991. s-step iterative methods for (non)symmetric (in)definite linear systems. SIAM Journal on Numerical Analysis. 28, 6 (1991), 1776--1789. arXiv:http://dx.doi.org/10.1137/0728088

Digital Library

[8]

Sarah Delcourte, Loula Fezoui, and Nathalie Glinsky-Olivier. 2009. A high-order Discontinuous Galerkin method for the seismic wave propagation. ESAIM: Proceedings 27 (2009), 70--89.

[9]

James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. 2008. Avoiding communication in sparse matrix computations. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS’08). IEEE Computer Society.

[10]

Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiß. 2000. Cache optimization for structured and unstructured grid multigrid. Electronic Transaction on Numerical Analysis (February 2000), 21--40.

[11]

Denys Dutykh, Raphaël Poncet, and Frédéric Dias. 2011. The VOLNA code for the numerical modeling of tsunami waves: Generation, propagation and inundation. European Journal of Mechanics - B/Fluids 30, 6 (2011), 598--615. Special Issue: Nearshore Hydrodynamics.

[12]

W. W. Garvin. 1956. Exact transient solution of the buried line source problem. Proceedings of the Royal Society of London, Series A 234, 1199 (1956).

[13]

M. B. Giles, G. R. Mudalige, C. Bertolli, P. H. J. Kelly, E. Laszlo, and I. Reguly. 2012. An analytical study of loop tiling for a large-scale unstructured mesh application. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (SCC’12). IEEE Computer Society, 477--482.

Digital Library

[14]

M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. 2011. Performance analysis of the OP2 framework on many-core architectures. SIGMETRICS Performance Evaluation Review 38, 4 (March 2011), 9--15.

Digital Library

[15]

Gerard Gorman, Marcos de Aguiar, David Ham, Felix Herrmann, Christian Jacobs, Paul Kelly, Michael Lange, Chris Pain, Matthew Piggott, Spencer Sherwin, Felippe Vieira Zacarias, Mike Warner, Fabio Luporini, and Navjot Kukreja. 2015. OPESCI project web page. http://www.opesci.org.

[16]

Intel Corporation. 2016. Intel VTune Performance Analyzer. software.intel.com/en-us/intel-vtune-amplifier-xe.

[17]

Christian T. Jacobs, Michael Lange, Fabio Luporini, and Gerard J. Gorman. 2017. Application of code generation to high-order seismic modelling with the discontinuous Galerkin finite element method. In preparation.

[18]

George Karypis and Vipin Kumar. 2011. MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System, Version 5.0. http://www.cs.umn.edu/&sim;metis.

[19]

Matthew G. Knepley, Michael Lange, and Gerard Gorman. 2015. Unstructured overlapping mesh distribution in parallel. Submitted to ACM Transactions on Mathematical Software (TOMS'15) arxiv:1506.06194 http://arxiv.org/abs/1506.06194

[20]

Christopher D. Krieger and Michelle Mills Strout. 2012. Executing optimized irregular applications using task graphs within existing parallel models. In Proceedings of the 2nd Workshop on Irregular Applications: Architectures and Algorithms (IA<sup>3</sup>) Held in Conjunction with SC12.

[21]

Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul Kelly, Gihan Mudalige, Brian Van Straalen, and Sam Williams. 2013. Loop chaining: A programming abstraction for balancing locality and parallelism. In Proceedings of the 18th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’13). Boston, Massachusetts.

Digital Library

[22]

Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2007. Effective automatic parallelization of stencil computations. SIGPLAN Notices 42, 6 (June 2007), 235--244.

Digital Library

[23]

Michael Lange, Lawrence Mitchell, Matthew G. Knepley, and Gerard J. Gorman. 2016. Efficient mesh management in Firedrake using PETSc-DMPlex. SIAM Journal on Scientific Computing 38, 5 (2016), S143--S155. arXiv:cs.MS/1506.07749

Digital Library

[24]

Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham, and Paul H. J. Kelly. 2015. Cross-loop optimization of arithmetic intensity for finite element local assembly. ACM Transactions on Architecture and Code Optimization 11, 4, Article 57 (January 2015), 25 pages.

Digital Library

[25]

Jiayuan Meng and Kevin Skadron. 2009. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing (ICS’09). ACM, New York, 256--265.

Digital Library

[26]

Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing communication in sparse matrix solvers. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, Article 36, 12 pages.

Digital Library

[27]

Florian Rathgeber. 2014. Productive and Efficient Computational Science Through Domain-Specific Abstractions. Ph.D. Dissertation. Imperial College London.

[28]

Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. Mcrae, Gheorghe-Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2016. Firedrake: Automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software 43, 3 (Dec. 2016), Article 24, 27 pages.

Digital Library

[29]

Mahesh Ravishankar, John Eisenlohr, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2012. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In Proceedings of the International Confrencedel.e on High Performance Computing, Networking, Storage and Analysis. Article 72, 11 pages.

Digital Library

[30]

I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. J. Kelly, and D. Radford. 2016. Acceleration of a full-scale industrial CFD application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1265--1278.

Digital Library

[31]

István Z. Reguly, Gihan R. Mudalige, and Mike B. Giles. 2017. Loop tiling in large-scale stencil codes at runtime with OPS. arXiv preprint arXiv:1704.00693.

[32]

Joel H. Salz, Ravi Mirchandaney, and Kay Crowley. 1991. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers 40, 5 (1991), 603--612.

Digital Library

[33]

Imperial College High Performance Computing Service. 2017. The cx2/Helen Cluster.

[34]

M. M. Strout, F. Luporini, C. D. Krieger, C. Bertolli, G.-T. Bercea, C. Olschanowsky, J. Ramanujam, and P. H. J. Kelly. 2014. Generalizing run-time tiling with the loop chain abstraction. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1136--1145.

Digital Library

[35]

Michelle Mills Strout, Larry Carter, and Jeanne Ferrante. 2003. Compile-time composition of run-time data and iteration reorderings. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’03). ACM, New York.

Digital Library

[36]

Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck. 2002. Combining performance aspects of irregular Gauss-Seidel via sparse tiling. In Proceedings of the 15th Workshop on Languages and Compilers for Parallel Computing (LCPC’02). Springer.

Digital Library

[37]

Michelle Mills Strout, Larry Carter, Jeanne Ferrante, and Barbara Kreaseck. 2004. Sparse tiling for stationary iterative methods. International Journal of High Performance Computing Applications 18, 1 (February 2004), 95--114.

Digital Library

[38]

TOP500. 2016. cx2/Helen Cluster Specification in the TOP500 Ranking. https://www.top500.org/system/178845.

[39]

Jean Virieux. 1986. P-SV wave propagation in heterogeneous media; velocity-stress finite-difference method. Geophysics 51, 4 (1986), 889--901. arXiv:http://geophysics.geoscienceworld.org/content/51/4/889.full.pdf.

[40]

Zenodo/COFFEE. 2017. coneoproject/COFFEE: A Compiler for Fast Expression Evaluation.

[41]

Zenodo/FIAT. 2017. firedrakeproject/fiat: The Finite Element Automated Tabulator.

[42]

Zenodo/Firedrake. 2017. firedrakeproject/firedrake: An Automated Finite Element System.

[43]

Zenodo/PETSc. 2017. firedrakeproject/petsc: Portable, Extensible Toolkit for Scientific Computation.

[44]

Zenodo/PETSc4Py. 2017. firedrakeproject/petsc4py: The Python Interface to PETSc.

[45]

Zenodo/PyOP2. 2017. OP2/PyOP2: Framework for Performance-Portable Parallel Computations on Unstructured Meshes.

[46]

Zenodo/Seigen. 2017. OPESCI/Seigen: An Elastic Wave Equation Solver for Seismological Problems Based on the Finite Element Method.

[47]

Zenodo/SLOPE. 2017. coneoproject/SLOPE: A Run-Time System to Tile Irregular Loops.

[48]

Zenodo/TSFC. 2017. firedrakeproject/tsfc: The Two Stage Form Compiler firedrakeproject/tsfc: The Two Stage Form Compiler.

[49]

Zenodo/UFL. 2017. firedrakeproject/ufl: The Unified Form Language.

[50]

Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the 10th International Symposium on Code Generation and Optimization (CGO’12). ACM, New York, 207--218.

Digital Library

Cited By

Ekanayake SReguly ILuporini FMudalige G(2023)Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2Proceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605604(380-391)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605604
Bisbas GLuporini FLouboutin MNelson RGorman GKelly P(2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00058

Index Terms

Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling

Recommendations

Register tiling in nonrectangular iteration spaces

Loop tiling is a well-known loop transformation generally used to expose coarse-grain parallelism and to exploit data reuse at the cache level. Tiling can also be used to exploit data reuse at the register level and to improve a program's ILP. However, ...
Mesh independent loop fusion for unstructured mesh applications
CF '12: Proceedings of the 9th conference on Computing Frontiers

Applications based on unstructured meshes are typically compute intensive, leading to long running times. In principle, state-of-the-art hardware, such as multi-core CPUs and many-core GPUs, could be used for their acceleration but these esoteric ...
Parallelizing SOR for GPGPUs using alternate loop tiling

Gauss-Seidel and SOR, which are widely used smoothers in multigrid methods, are difficult to parallelize, particularly on GPGPUs due to the existence of DOACROSS data dependences. In this paper, we present a new parallel SOR method that admits more ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 45, Issue 2

June 2019

255 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/3326465

Editors:
Zhaojun Bai
University of California at Davis, USA
,
Wolfgang Bangerth
Colorado State University, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 May 2019

Accepted: 01 December 2018

Revised: 01 November 2018

Received: 01 August 2017

Published in TOMS Volume 45, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Engineering and Physical Sciences Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
210
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ekanayake SReguly ILuporini FMudalige G(2023)Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2Proceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605604(380-391)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605604
Bisbas GLuporini FLouboutin MNelson RGorman GKelly P(2021)Temporal blocking of finite-difference stencil operators with sparse “off-the-grid” sources2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS49936.2021.00058(497-506)Online publication date: May-2021
https://doi.org/10.1109/IPDPS49936.2021.00058

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents