[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3605573.3605604acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Communication-Avoiding Optimizations for Large-Scale Unstructured-Mesh Applications with OP2

Published: 13 September 2023 Publication History

Abstract

In this paper, we investigate data movement-reducing and communication-avoiding optimizations and their practicable implementation for large-scale unstructured-mesh applications. Utilizing the high-level abstraction of the OP2 DSL for the unstructured-mesh class of codes, we reason about techniques for reduced communications across a consecutive sequence of loops – a loop-chain. The careful trade-off with increased redundant computation in place of data movement is analyzed for distributed-memory parallelization. A new communication-avoiding (CA) back-end for OP2 is designed, codifying these techniques such that they can be applied automatically to any OP2 application. The back-end is extended to operate on a cluster of GPUs, integrating GPU-to-GPU communication with CUDA, in combination with MPI. The new CA back-end is applied automatically to two non-trivial applications, including the OP2 version of Rolls-Royce’s production CFD application, Hydra. Performance is investigated on both CPU and GPU clusters on representative problems of 8M and 24M node mesh sizes. Results demonstrate how for select configurations the new CA back-end provides between 30 – 65% runtime reductions for the loop-chains in these applications for the mesh sizes on both an HPE Cray EX system and an NVIDIA V100 GPU cluster. We model and examine the determinants and characteristics of a given unstructured-mesh loop-chain that can lead to performance benefits with CA techniques, providing insights into the general feasibility and profitability of using the optimizations for this class of applications.

References

[1]
2023. MG-CFD-app-OP2 github repository. https://github.com/warwick-hpsc/MG-CFD-app-OP2/tree/feature/icpp2023-commavoid
[2]
2023. OP2 github repository. https://github.com/OP-DSL/OP2-Common/tree/feature/icpp2023-commavoid
[3]
Accessed Aug 2022. Cirrus. https://www.cirrus.ac.uk/
[4]
Accessed Nov 2022. ARCHER2. https://www.archer2.ac.uk
[5]
Martin S. Alnæs, Anders Logg, Kristian B. Ølgaard, Marie E. Rognes, and Garth N. Wells. 2014. Unified Form Language: A Domain-Specific Language for Weak Formulations of Partial Differential Equations. 40, 2 (2014). https://doi.org/10.1145/2566630
[6]
Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures(SC ’13). Association for Computing Machinery, New York, NY, USA, Article 33, 12 pages. https://doi.org/10.1145/2503210.2503289
[7]
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113.
[8]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. SIGPLAN Not. 43, 6 (June 2008), 101–113. https://doi.org/10.1145/1379022.1375595
[9]
James Demmel, Jack Dongarra, Julie Langou, Julien Langou, Piotr Luszczek, and Michael Mahoney. 2020. Prospectus for the Next LAPACK and ScaLAPACK Libraries: Basic ALgebra LIbraries for Sustainable Technology with Interdisciplinary Collaboration (BALLISTIC). Technical Report 297, ICL-UT-20-07.
[10]
James W. Demmel, Laura Grigori, Ming Gu, and Hua Xiang. 2015. Communication Avoiding Rank Revealing QR Factorization with Column Pivoting. SIAM J. Matrix Anal. Appl. 36, 1 (2015), 55–89. https://doi.org/10.1137/13092157X
[11]
Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Distributed Halide. SIGPLAN Not. 51, 8, Article 5 (feb 2016), 12 pages. https://doi.org/10.1145/3016078.2851157
[12]
Lester Kalms, Tim Hebbeler, and Diana Göhringer. 2018. Automatic OpenCL Code Generation from LLVM-IR Using Polyhedral Optimization(PARMA-DITAM ’18). Association for Computing Machinery, New York, NY, USA, 45–50. https://doi.org/10.1145/3183767.3183779
[13]
Penporn Koanantakool. 2017. Communication Avoidance for Algorithms with Sparse All-to-all Interactions. PhD dissertation. University of California, Berkeley.
[14]
C. D. Krieger, M. M. Strout, C. Olschanowsky, A. Stone, S. Guzik, X. Gao, C. Bertolli, P. H. J. Kelly, G. R. Mudalige, B. Van Straalen, and S. Williams. 2013. Loop Chaining: A Programming Abstraction for Balancing Locality and Parallelism. In 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum. 375–384. https://doi.org/10.1109/IPDPSW.2013.68
[15]
Michael Lange, Lawrence Mitchell, Matthew G. Knepley, and Gerard J. Gorman. 2016. Efficient Mesh Management in Firedrake Using PETSc DMPlex. SIAM Journal on Scientific Computing 38, 5 (2016), S143–S155. https://doi.org/10.1137/15M1026092
[16]
Vincent Loechner. 1999. PolyLib: A library for manipulating parameterized polyhedra. https://repo.or.cz/polylib.git/blob_plain/HEAD:/doc/parampoly-doc.ps.gz
[17]
Fabio Luporini. 2016. Automated Optimization of Numerical Methods for Partial Differential Equations. PhD dissertation. Imperial College London.
[18]
Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, and Paul H. J. Kelly. 2019. Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modeling. ACM Trans. Math. Softw. 45, 2, Article 17 (May 2019), 30 pages. https://doi.org/10.1145/3302256
[19]
Fabio Luporini, Mathias Louboutin, Michael Lange, Navjot Kukreja, Philipp Witte, Jan Hückelheim, Charles Yount, Paul H. J. Kelly, Felix J. Herrmann, and Gerard J. Gorman. 2020. Architecture and Performance of Devito, a System for Automated Stencil Computation. ACM Trans. Math. Softw. 46, 1, Article 6 (apr 2020), 28 pages. https://doi.org/10.1145/3374916
[20]
Gihan Mudalige, Mike Giles, I.Z. Reguly, C. Bertolli, and Paul Kelly. 2012. OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures. 2012 Innovative Parallel Computing, InPar 2012, 1–12. https://doi.org/10.1109/InPar.2012.6339594
[21]
Gihan R. Mudalige, Istvan Z. Reguly, Arun Prabhakar, Dario Amirante, Leigh Lapworth, and Stephen A. Jarvis. 2022. Towards Virtual Certification of Gas Turbine Engines With Performance-Portable Simulations. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 206–217. https://doi.org/10.1109/CLUSTER51413.2022.00034
[22]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. SIGPLAN Not. 50, 4 (mar 2015), 429–443. https://doi.org/10.1145/2775054.2694364
[23]
NVIDIA. Accessed March 2023. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect
[24]
A. M. B. Owenson, S. A. Wright, R. A. Bunt, Y. K. Ho, M. J. Street, and S. A. Jarvis. 2020. An unstructured CFD mini-application for the performance prediction of a production CFD code. Concurrency and Computation: Practice and Experience 32, 10 (2020), e5443. https://doi.org/10.1002/cpe.5443 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.5443e5443 cpe.5443.
[25]
Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Frédo Durand. 2017. Halide: Decoupling Algorithms from Schedules for High-Performance Image Processing. Commun. ACM 61, 1 (dec 2017), 106–115. https://doi.org/10.1145/3150211
[26]
Istvan Z. Reguly, Gihan R. Mudalige, Carlo Bertolli, Michael B. Giles, Adam Betts, Paul H.J. Kelly, and David Radford. 2016. Acceleration of a Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016), 1265–1278. https://doi.org/10.1109/tpds.2015.2453972
[27]
István Z. Reguly, Gihan R. Mudalige, and Michael B. Giles. 2018. Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS. IEEE Transactions on Parallel and Distributed Systems 29, 4 (2018), 873–886. https://doi.org/10.1109/TPDS.2017.2778161
[28]
Istvan Z. Reguly, Andrew M. B. Owenson, Archie Powell, Stephen A. Jarvis, and Gihan R. Mudalige. 2021. Under the Hood of SYCL – An Initial Performance Analysis with An Unstructured-Mesh CFD Application. In High Performance Computing: 36th International Conference, ISC High Performance 2021, Virtual Event, June 24 – July 2, 2021, Proceedings. Springer-Verlag, Berlin, Heidelberg, 391–410. https://doi.org/10.1007/978-3-030-78713-4_21
[29]
Michelle Mills Strout, Fabio Luporini, Christopher D. Krieger, Carlo Bertolli, Gheorghe-Teodor Bercea, Catherine Olschanowsky, J. Ramanujam, and Paul H.J. Kelly. 2014. Generalizing Run-Time Tiling with the Loop Chain Abstraction. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1136–1145. https://doi.org/10.1109/IPDPS.2014.118
[30]
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (San Jose, California, USA) (SPAA ’11). Association for Computing Machinery, New York, NY, USA, 117–128. https://doi.org/10.1145/1989493.1989508
[31]
Sven Verdoolaege. 2010. isl: An Integer Set Library for the Polyhedral Model. In Mathematical Software (ICMS’10)(LNCS 6327), Komei Fukuda, Joris Hoeven, Michael Joswig, and Nobuki Takayama (Eds.). Springer-Verlag, 299–302.
[32]
M. Wolfe. 1989. More iteration space tiling. In Supercomputing ’89:Proceedings of the 1989 ACM/IEEE Conference on Supercomputing. 655–664. https://doi.org/10.1145/76263.76337
[33]
Yang You, James Demmel, Kenneth Czechowski, Le Song, and Richard Vuduc. 2015. CA-SVM: Communication-Avoiding Support Vector Machines on Distributed Systems. In 2015 IEEE International Parallel and Distributed Processing Symposium. 847–859. https://doi.org/10.1109/IPDPS.2015.117

Cited By

View all
  • (2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing
August 2023
858 pages
ISBN:9798400708435
DOI:10.1145/3605573
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2023

Check for updates

Author Tags

  1. Communication-avoiding algorithms
  2. OP2
  3. Unstructured-mesh.

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICPP 2023
ICPP 2023: 52nd International Conference on Parallel Processing
August 7 - 10, 2023
UT, Salt Lake City, USA

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)275
  • Downloads (Last 6 weeks)20
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media