[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3635035.3635036acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article
Open access

Non-Blocking GPU-CPU Notifications to Enable More GPU-CPU Parallelism

Published: 19 January 2024 Publication History

Abstract

GPUs are increasingly popular in HPC systems, and more applications are adopting GPUs each day. However, the control synchronization of GPUs with CPUs is suboptimal and only possible after GPU kernel termination points, resulting in serialized host and device tasks. In this paper, we propose a novel CPU-GPU notification method that enables non-blocking in-kernel control synchronization of device and host tasks in combination with persistent GPU kernels. Using this notification method, we increase the overlap of CPU and GPU execution and with that parallelism. We present the concept and structure of the proposed notification mechanism together with in-kernel GPU-CPU control synchronization, using halo-exchange as an example. We analyze the performance of the halo-exchange pattern using our new notification method, as well as the interference between CPU and GPU operations due to the execution overlap. Finally, we verify our results using a performance model covering the halo-exchange pattern with the new notification method.

References

[1]
Tyler Allen and Rong Ge. 2021. Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 141–150. https://doi.org/10.1109/IPDPS49936.2021.00023
[2]
Tyler Allen and Rong Ge. 2021. In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC 21). Association for Computing Machinery, New York, NY, USA, Article 64, 15 pages. https://doi.org/10.1145/3458817.3480855
[3]
Mauro Bianco. 2014. An interface for halo exchange pattern. www.prace-ri.eu/IMG/pdf/wp86.pdf (2014).
[4]
Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydin Buluç, Katherine Yelick, and John D. Owens. 2022. Scalable Irregular Parallelism with GPUs: Getting CPUs Out of the Way. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, TX, USA, 1–16. https://doi.org/10.1109/SC41404.2022.00055
[5]
Jaemin Choi, David F. Richards, Laxmikant V. Kale, and Abhinav Bhatele. 2020. End-to-End Performance Modeling of Distributed GPU Applications. In Proceedings of the 34th ACM International Conference on Supercomputing (Barcelona, Spain) (ICS ’20). Association for Computing Machinery, New York, NY, USA, Article 30, 12 pages. https://doi.org/10.1145/3392717.3392737
[6]
Matthew G.F. Dosanjh, Andrew Worley, Derek Schafer, Prema Soundararajan, Sheikh Ghafoor, Anthony Skjellum, Purushotham V. Bangalore, and Ryan E. Grant. 2021. Implementation and evaluation of MPI 4.0 partitioned communication libraries. Parallel Comput. 108 (Dec 2021), 102827. https://doi.org/10.1016/j.parco.2021.102827
[7]
Nathan Hanford, Ramesh Pankajakshan, Edgar A. Leon, and Ian Karlin. 2020. Challenges of GPU-aware Communication in MPI. In 2020 Workshop on Exascale MPI (ExaMPI). IEEE, 1–10. https://doi.org/10.1109/ExaMPI52011.2020.00006
[8]
Mark Harris and Kyrylo Perelygin. 2023. Cooperative groups: Flexible cuda thread programming. https://developer.nvidia.com/blog/cooperative-groups/
[9]
Feng Ji, Ashwin M. Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-chun Feng, and Xiaosong Ma. 2012. Efficient Intranode Communication in GPU-Accelerated Systems. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops and PhD Forum. IEEE, Shanghai, China, 1838–1847. https://doi.org/10.1109/IPDPSW.2012.227
[10]
Jiri Kraus. 2022. An introduction to cuda-aware MPI. https://developer.nvidia.com/blog/introduction-cuda-aware-mpi/
[11]
Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio Breternitz, Steven K Reinhardt, and Lizy K John. 2017. GPU Triggered Networking for Intra-Kernel Communications. (2017), 12.
[12]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan Tallent, and Kevin Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (Jan 2020), 94–110. https://doi.org/10.1109/TPDS.2019.2928289
[13]
Dian-Lun Lin and Tsung-Wei Huang. 2021. Efficient GPU Computation Using Task Graph Parallelism. In Euro-Par 2021: Parallel Processing, Leonel Sousa, Nuno Roma, and Pedro Tomás (Eds.). Springer, Cham, 435–450.
[14]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland OR USA, 1633–1649. https://doi.org/10.1145/3318464.3389705
[15]
Naveen Namashivayam, Krishna Kandalla, Trey White, Nick Radcliffe, Larry Kaplan, and Mark Pagel. 2022. Exploring GPU Stream-Aware Message Passing using Triggered Operations. arXiv:2208.04817 (Aug 2022). http://arxiv.org/abs/2208.04817 arXiv:2208.04817 [cs].
[16]
Pier Giorgio Raponi, Fabrizio Petrini, Robert Walkup, and Fabio Checconi. 2011. Characterization of the Communication Patterns of Scientific Applications on Blue Gene/P. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops. 1017–1024. https://doi.org/10.1109/IPDPS.2011.249
[17]
Lukas Spies, Amanda Bienz, David Moulton, Luke Olson, and Andrew Reisner. 2022. Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA. Parallel Comput. 114 (2022), 102973. https://doi.org/10.1016/j.parco.2022.102973
[18]
Jeff A. Stuart, Michael Cox, and John D. Owens. 2011. GPU-to-CPU Callbacks. In Euro-Par 2010 Parallel Processing Workshops, Mario R. Guarracino, Frédéric Vivien, Jesper Larsson Träff, Mario Cannatoro, Marco Danelutto, Anders Hast, Francesca Perla, Andreas Knüpfer, Beniamino Di Martino, and Michael Alexander (Eds.). Springer Berlin Heidelberg, 365–372.
[19]
V. Volkov and J.W. Demmel. 2008. Benchmarking GPUs to tune dense linear algebra. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Austin, TX, 1–11. https://doi.org/10.1109/SC.2008.5214359
[20]
Chenle Yu, Sara Royuela, and Eduardo Quiñones. 2020. OpenMP to CUDA Graphs: A Compiler-Based Transformation to Enhance the Programmability of NVIDIA Devices. In Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems(SCOPES ’20). Association for Computing Machinery, New York, NY, USA, 42–47. https://doi.org/10.1145/3378678.3391881
[21]
Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka. 2020. A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, New Orleans, 483–493. https://doi.org/10.1109/IPDPS47924.2020.00057

Cited By

View all
  • (2024)Understanding GPU Triggering APIs for MPI+X CommunicationRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_3(39-55)Online publication date: 25-Sep-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
January 2024
185 pages
ISBN:9798400708893
DOI:10.1145/3635035
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. Halo-exchange
  3. MPI
  4. Synchronization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

HPCAsia 2024

Acceptance Rates

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)716
  • Downloads (Last 6 weeks)165
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Understanding GPU Triggering APIs for MPI+X CommunicationRecent Advances in the Message Passing Interface10.1007/978-3-031-73370-3_3(39-55)Online publication date: 25-Sep-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media