[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3615318.3615327acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article
Public Access

Optimizing Communication in 2D Grid-Based MPI Applications at Exascale

Published: 21 September 2023 Publication History

Abstract

The new reality of exascale computing faces many challenges in achieving optimal performance on large numbers of nodes. A key challenge is the efficient utilization of the message-passing interface (MPI), a critical component for process communication. This paper explores communication optimization strategies to harness the GPU-accelerated architectures of these supercomputers. We focus on MPI applications where processors form a two-dimensional process grid, a common arrangement in applications involving dense matrix operations. This configuration offers a unique opportunity to implement innovative strategies to improve performance and maintain effective load distribution. We study two applications— Dist-FW (Apsp:all-pair-shortest-path) and HPL-MxP (LU factorization with Mixed precision)—on two accelerated systems: Summit (IBM Power and NVIDIA V100) and Frontier (AMD EPYC and MI250X). These supercomputers are operated by the Oak Ridge Leadership Computing Facility (OLCF) and are currently ranked #1 and #5 on the Top500 list. We show how to scale up both applications to exascale levels and tackle the MPI challenges related to implementation, synchronization, and performance. We also compare the performance of several communication strategies at an unprecedented scale. Accurately predicting application performance becomes crucial for cost reduction as the computational scale grows. To address this, we suggest a hyperbolic model as a better alternative to the traditional one-sided asymptotic model for predicting future application performance at such large scales.

References

[1]
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. 180, 1 (2009), 012037.
[2]
AMD. [n.d.]. AMD ROCm Platform Portal. Accessed Oct. 21, 2021. https://rocmdocs.amd.com/en/latest/
[3]
Cade Brown, Ahmad Abdelfattah, Stanimire Tomov, and Jack Dongarra. 2020. Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs. In 2020 IEEE High Performance Extreme Computing Conference. 1–7. https://doi.org/10.1109/HPEC43674.2020.9286214
[4]
Aydın Buluç and John R Gilbert. 2011. The Combinatorial BLAS: Design, implementation, and applications. The International Journal of High Performance Computing Applications 25, 4 (2011), 496–509.
[5]
Bradford L Chamberlain, David Callahan, and Hans P Zima. 2007. Parallel programmability and the chapel language. The International Journal of High Performance Computing Applications 21, 3 (2007), 291–312.
[6]
Jaeyoung Choi, Jack J Dongarra, Roldan Pozo, and David W Walker. 1992. ScaLAPACK: A scalable linear algebra library for distributed memory concurrent computers. In The Fourth Symposium on the Frontiers of Massively Parallel Computation. 120–121.
[7]
Timothy A. Davis. 2019. Algorithm 1000: SuiteSparse:GraphBLAS: Graph Algorithms in the Language of Sparse Linear Algebra. ACM Trans. Math. Softw. 45, 4, Article 44 (dec 2019), 25 pages. https://doi.org/10.1145/3322125
[8]
Mattias De Wael, Stefan Marr, Bruno De Fraine, Tom Van Cutsem, and Wolfgang De Meuter. 2015. Partitioned global address space languages. ACM Computing Surveys (CSUR) 47, 4 (2015), 1–27.
[9]
Hristo Djidjev, Sunil Thulasidasan, Guillaume Chapuis, Rumen Andonov, and Dominique Lavenier. 2014. Efficient multi-GPU computation of all-pairs shortest paths. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 360–369.
[10]
Tarek El-Ghazawi and Lauren Smith. 2006. UPC: unified parallel C. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing. 27–es.
[11]
Jeremy T. Fineman and Eric Robinson. 2011. Fundamental graph algorithms. In Graph Algorithms in the Language of Linear Algebra, Jeremy Kepner and John Gilbert (Eds.). Society of Industrial and Applied Mathematics, Philadelphia, PA, USA, Chapter 5, 45–58.
[12]
Jeremy T. Fineman and Eric Robinson. 2011. Fundamental graph algorithms. In Graph Algorithms in the Language of Linear Algebra, Jeremy Kepner and John Gilbert (Eds.). Society of Industrial and Applied Mathematics, Philadelphia, PA, USA, Chapter 5, 45–58.
[13]
K. A. Gallivan, R. J. Plemmons, and A. H. Sameh. 1990. Parallel Algorithms for Dense Linear Algebra Computations. SIAM Rev. 32, 1 (Mar. 1990), 54–135. https://doi.org/10.1137/1032002
[14]
Mark Gates, Jakub Kurzak, Ali Charara, Asim YarKhan, and Jack Dongarra. 2019. SLATE: Design of a modern distributed and accelerated linear algebra library. In SC19:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18.
[15]
Azzam Haidar, Harun Bayraktar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2020. Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 476, 2243 (2020), 20200110. https://doi.org/10.1098/rspa.2020.0110
[16]
Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J. Higham. 2018. Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. 603–613. https://doi.org/10.1109/SC.2018.00050
[17]
Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics, USA.
[18]
ICL. [n.d.]. HPL-AI Mixed-Precision Benchmark. Accessed Aug. 1, 2021. https://hpl-ai.org/
[19]
Jing Fu Jenq and Sartaj Sahni. 1987. ALL PAIRS SHORTEST PATHS ON A HYPERCUBE MULTIPROCESSOR. In Proc Int Conf Parallel Process 1987. Pennsylvania State Univ Press, 713–716.
[20]
Ramakrishnan Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar, Robert Patton, Richard Vuduc, and Thomas Potok. 2020. Scalable Knowledge Graph Analytics at 136 Petaflop/s. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–13. https://doi.org/10.1109/SC41405.2020.00010
[21]
Ramakrishnan Kannan, Piyush Sao, Hao Lu, Jakub Kurzak, Gundolf Schenk, Yongmei Shi, Seung-Hwan Lim, Sharat Israni, Vijay Thakkar, Guojing Cong, 2022. Exaflops biomedical knowledge graph analytics. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
[22]
Shuhei Kudo, Keigo Nitadori, Takuya Ina, and Toshiyuki Imamura. 2020. Implementation and Numerical Techniques for One EFlop/s HPL-AI Benchmark on Fugaku. In 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). 69–76. https://doi.org/10.1109/ScalA51936.2020.00014
[23]
Vipin Kumar and Vineet Singh. 1991. Scalability of parallel algorithms for the all-pairs shortest-path problem. J. Parallel and Distrib. Comput. 13, 2 (1991), 124–138.
[24]
Jakub Kurzak and Jack Dongarra. 2006. Implementation of the Mixed-Precision High Performance LINPACK Benchmark on the CELL Processor. University of Tennessee Computer Science Tech ReportUT-CS-06-580, LAPACK Working Note # 177 (2006).
[25]
Wang Lei, Zhang Yunquan, Zhang Xianyi, and Liu Fangfang. 2010. Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster. In 2010 10th IEEE International Conference on Computer and Information Technology. 1169–1174. https://doi.org/10.1109/CIT.2010.212
[26]
Xing Liu, Aftab Patel, and Edmond Chow. 2014. A new scalable parallel algorithm for Fock matrix construction. In 2014 IEEE 28th international parallel and distributed processing symposium. IEEE, 902–914.
[27]
Hao Lu, Michael Matheson, Vladyslav Oles, Austin Ellis, Wayne Joubert, and Feiyi Wang. 2022. Climbing the Summit and Pushing the Frontier of Mixed Precision Benchmarks at Extreme Scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. https://doi.org/10.1109/SC41404.2022.00083
[28]
Robert W Numrich and John Reid. 1998. Co-Array Fortran for parallel programming. In ACM Sigplan Fortran Forum, Vol. 17. ACM New York, NY, USA, 1–31.
[29]
NVIDIA. [n.d.]. NVIDIA CUDA Toolkit Documentation. Accessed Apr. 21, 2021. https://docs.nvidia.com/cuda/index.html
[30]
RIKEN-RCCS. [n.d.]. HPL-AI implementation for Fugaku. Accessed Apr. 21, 2021. https://github.com/RIKEN-RCCS/hpl-ai
[31]
Piyush Sao, Ramakrishnan Kannan, Prasun Gera, and Richard W. Vuduc. 2020. A Supernodal All-Pairs Shortest Path Algorithm. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20). ACM, 250–261. https://doi.org/10.1145/3332466.3374533
[32]
Piyush Sao, Hao Lu, Ramakrishnan Kannan, Vijay Thakkar, Richard Vuduc, and Thomas Potok. 2021. Scalable All-pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing. 121–131.
[33]
Robert Schreiber. 1988. Block Algorithms for Parallel Machines. In Numerical Algorithms for Modern Parallel Computer Architectures, Martin Schultz (Ed.). Springer US, New York, NY, 197–207.
[34]
Edgar Solomonik, Aydın Buluç, and James Demmel. 2013. Minimizing communication in all-pairs shortest paths. In Proceedings of the 27th IPDPS.
[35]
Gilbert Strang. 2016. Introduction to Linear Algebra (5 ed.). Wellesley-Cambridge Press, Wellesley, MA.
[36]
James H. Wilkinson. 1994. Rounding Errors in Algebraic Processes. Dover Publications, Inc., USA.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroMPI '23: Proceedings of the 30th European MPI Users' Group Meeting
September 2023
123 pages
ISBN:9798400709135
DOI:10.1145/3615318
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2023

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • DOE

Conference

EUROMPI '23
EUROMPI '23: 30th European MPI Users' Group Meeting
September 11 - 13, 2023
Bristol, United Kingdom

Acceptance Rates

Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 85
    Total Downloads
  • Downloads (Last 12 months)65
  • Downloads (Last 6 weeks)22
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media