More Web Proxy on the site http://driver.im/

research-article

Open access

Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2

Authors:

Suneth Dasantha Ekanayake,

István Zoltán Reguly,

Fabio Luporini,

Gihan Ravideva MudaligeAuthors Info & Claims

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

Pages 380 - 391

https://doi.org/10.1145/3605573.3605604

Published: 13 September 2023 Publication History

All formats PDF

Abstract

In this paper, we investigate data movement-reducing and communication-avoiding optimizations and their practicable implementation for large-scale unstructured-mesh applications. Utilizing the high-level abstraction of the OP2 DSL for the unstructured-mesh class of codes, we reason about techniques for reduced communications across a consecutive sequence of loops – a loop-chain. The careful trade-off with increased redundant computation in place of data movement is analyzed for distributed-memory parallelization. A new communication-avoiding (CA) back-end for OP2 is designed, codifying these techniques such that they can be applied automatically to any OP2 application. The back-end is extended to operate on a cluster of GPUs, integrating GPU-to-GPU communication with CUDA, in combination with MPI. The new CA back-end is applied automatically to two non-trivial applications, including the OP2 version of Rolls-Royce’s production CFD application, Hydra. Performance is investigated on both CPU and GPU clusters on representative problems of 8M and 24M node mesh sizes. Results demonstrate how for select configurations the new CA back-end provides between 30 – 65% runtime reductions for the loop-chains in these applications for the mesh sizes on both an HPE Cray EX system and an NVIDIA V100 GPU cluster. We model and examine the determinants and characteristics of a given unstructured-mesh loop-chain that can lead to performance benefits with CA techniques, providing insights into the general feasibility and profitability of using the optimizations for this class of applications.

References

[1]

2023. MG-CFD-app-OP2 github repository. https://github.com/warwick-hpsc/MG-CFD-app-OP2/tree/feature/icpp2023-commavoid

[2]

2023. OP2 github repository. https://github.com/OP-DSL/OP2-Common/tree/feature/icpp2023-commavoid

[3]

Accessed Aug 2022. Cirrus. https://www.cirrus.ac.uk/

[4]

Accessed Nov 2022. ARCHER2. https://www.archer2.ac.uk

[5]

Martin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified Form Language: A Domain-Specific Language for Weak Formulations of Partial Differential Equations. 40, 2 (2014). https://doi.org/10.1145/2566630

Digital Library

[6]

Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures(SC ’13). Association for Computing Machinery, New York, NY, USA, Article 33, 12 pages. https://doi.org/10.1145/2503210.2503289

Digital Library

[7]

Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113.

Digital Library

[8]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. SIGPLAN Not. 43, 6 (June 2008), 101–113. https://doi.org/10.1145/1379022.1375595

Digital Library

[9]

James Demmel, Jack Dongarra, Julie Langou, Julien Langou, Piotr Luszczek, and Michael Mahoney. 2020. Prospectus for the Next LAPACK and ScaLAPACK Libraries: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC). Technical Report 297, ICL-UT-20-07.

[10]

James W. Demmel, Laura Grigori, Ming Gu, and Hua Xiang. 2015. Communication Avoiding Rank Revealing QR Factorization with Column Pivoting. SIAM J. Matrix Anal. Appl. 36, 1 (2015), 55–89. https://doi.org/10.1137/13092157X

Digital Library

[11]

Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Distributed Halide. SIGPLAN Not. 51, 8, Article 5 (feb 2016), 12 pages. https://doi.org/10.1145/3016078.2851157

Digital Library

[12]

Lester Kalms, Tim Hebbeler, and Diana Göhringer. 2018. Automatic OpenCL Code Generation from LLVM-IR Using Polyhedral Optimization(PARMA-DITAM ’18). Association for Computing Machinery, New York, NY, USA, 45–50. https://doi.org/10.1145/3183767.3183779

Digital Library

[13]

Penporn Koanantakool. 2017. Communication Avoidance for Algorithms with Sparse All-to-all Interactions. PhD dissertation. University of California, Berkeley.

[14]

C. D. Krieger, M. M. Strout, C. Olschanowsky, A. Stone, S. Guzik, X. Gao, C. Bertolli, P. H. J. Kelly, G. R. Mudalige, B. Van Straalen, and S. Williams. 2013. Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism. In 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum. 375–384. https://doi.org/10.1109/IPDPSW.2013.68

Digital Library

[15]

Michael Lange, Lawrence Mitchell, Matthew G. Knepley, and Gerard J. Gorman. 2016. Efficient Mesh Management in Firedrake Using PETSc DMPlex. SIAM Journal on Scientific Computing 38, 5 (2016), S143–S155. https://doi.org/10.1137/15M1026092

Digital Library

[16]

Vincent Loechner. 1999. PolyLib: A library for manipulating parameterized polyhedra. https://repo.or.cz/polylib.git/blob_plain/HEAD:/doc/parampoly-doc.ps.gz

[17]

Fabio Luporini. 2016. Automated Optimization of Numerical Methods for Partial Differential Equations. PhD dissertation. Imperial College London.

[18]

Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, and Paul H. J. Kelly. 2019. Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling. ACM Trans. Math. Softw. 45, 2, Article 17 (May 2019), 30 pages. https://doi.org/10.1145/3302256

Digital Library

[19]

Fabio Luporini, Mathias Louboutin, Michael Lange, Navjot Kukreja, Philipp Witte, Jan Hückelheim, Charles Yount, Paul H. J. Kelly, Felix J. Herrmann, and Gerard J. Gorman. 2020. Architecture and Performance of Devito, a System for Automated Stencil Computation. ACM Trans. Math. Softw. 46, 1, Article 6 (apr 2020), 28 pages. https://doi.org/10.1145/3374916

Digital Library

[20]

Gihan Mudalige, Mike Giles, I.Z. Reguly, C. Bertolli, and Paul Kelly. 2012. OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures. 2012 Innovative Parallel Computing, InPar 2012, 1–12. https://doi.org/10.1109/InPar.2012.6339594

[21]

Gihan R. Mudalige, Istvan Z. Reguly, Arun Prabhakar, Dario Amirante, Leigh Lapworth, and Stephen A. Jarvis. 2022. Towards Virtual Certification of Gas Turbine Engines With Performance-Portable Simulations. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 206–217. https://doi.org/10.1109/CLUSTER51413.2022.00034

[22]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. SIGPLAN Not. 50, 4 (mar 2015), 429–443. https://doi.org/10.1145/2775054.2694364

Digital Library

[23]

NVIDIA. Accessed March 2023. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect

[24]

A. M. B. Owenson, S. A. Wright, R. A. Bunt, Y. K. Ho, M. J. Street, and S. A. Jarvis. 2020. An unstructured CFD mini-application for the performance prediction of a production CFD code. Concurrency and Computation: Practice and Experience 32, 10 (2020), e5443. https://doi.org/10.1002/cpe.5443 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.5443e5443 cpe.5443.

[25]

Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2017. Halide: Decoupling Algorithms from Schedules for High-Performance Image Processing. Commun. ACM 61, 1 (dec 2017), 106–115. https://doi.org/10.1145/3150211

Digital Library

[26]

Istvan Z. Reguly, Gihan R. Mudalige, Carlo Bertolli, Michael B. Giles, Adam Betts, Paul H.J. Kelly, and David Radford. 2016. Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1265–1278. https://doi.org/10.1109/tpds.2015.2453972

Digital Library

[27]

István Z. Reguly, Gihan R. Mudalige, and Michael B. Giles. 2018. Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS. IEEE Transactions on Parallel and Distributed Systems 29, 4 (2018), 873–886. https://doi.org/10.1109/TPDS.2017.2778161

[28]

Istvan Z. Reguly, Andrew M. B. Owenson, Archie Powell, Stephen A. Jarvis, and Gihan R. Mudalige. 2021. Under the Hood of SYCL – An Initial Performance Analysis with An Unstructured-Mesh CFD Application. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings. Springer-Verlag, Berlin, Heidelberg, 391–410. https://doi.org/10.1007/978-3-030-78713-4_21

Digital Library

[29]

Michelle Mills Strout, Fabio Luporini, Christopher D. Krieger, Carlo Bertolli, Gheorghe-Teodor Bercea, Catherine Olschanowsky, J. Ramanujam, and Paul H.J. Kelly. 2014. Generalizing Run-Time Tiling with the Loop Chain Abstraction. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1136–1145. https://doi.org/10.1109/IPDPS.2014.118

Digital Library

[30]

Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (San Jose, California, USA) (SPAA ’11). Association for Computing Machinery, New York, NY, USA, 117–128. https://doi.org/10.1145/1989493.1989508

Digital Library

[31]

Sven Verdoolaege. 2010. isl: An Integer Set Library for the Polyhedral Model. In Mathematical Software (ICMS’10)(LNCS 6327), Komei Fukuda, Joris Hoeven, Michael Joswig, and Nobuki Takayama (Eds.). Springer-Verlag, 299–302.

[32]

M. Wolfe. 1989. More iteration space tiling. In Supercomputing ’89:Proceedings of the 1989 ACM/IEEE Conference on Supercomputing. 655–664. https://doi.org/10.1145/76263.76337

Digital Library

[33]

Yang You, James Demmel, Kenneth Czechowski, Le Song, and Richard Vuduc. 2015. CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems. In 2015 IEEE International Parallel and Distributed Processing Symposium. 847–859. https://doi.org/10.1109/IPDPS.2015.117

Digital Library

Cited By

Zhang WLiu YZang TBao Z(2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1145/3678010

Index Terms

Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2
1. Applied computing
  1. Physical sciences and engineering
    1. Aerospace
2. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Auto-vectorizing a large-scale production unstructured-mesh CFD application
WPMVP '16: Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing

For modern x86 based CPUs with increasingly longer vector lengths, achieving good vectorization has become very important for gaining higher performance. Using very explicit SIMD vector programming techniques has been shown to give near optimal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

August 2023

858 pages

ISBN:9798400708435

DOI:10.1145/3605573

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICPP 2023

ICPP 2023: 52nd International Conference on Parallel Processing

August 7 - 10, 2023

UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
337
Total Downloads

Downloads (Last 12 months)275
Downloads (Last 6 weeks)20

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang WLiu YZang TBao Z(2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024
https://dl.acm.org/doi/10.1145/3678010

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents