[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3582514.3582517acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

Harmonic CUDA: Asynchronous Programming on GPUs

Published: 25 February 2023 Publication History

Abstract

We introduce Harmonic CUDA, a dataflow programming model for GPUs that allows programmers to describe algorithms as a dependency graph of producers and consumers where data flows continuously through the graph for the duration of the kernel. This makes it easier for programmers to exploit asynchrony, warp specialization, and hardware acceleration. Using Harmonic CUDA, we implement two example applications: Matrix Multiplication and GraphSage. The matrix multiplication kernel demonstrates how a key kernel can break down into more granular building blocks, with results that show a geomean average of 80% of cuBLAS performance, and up to 92% when omitting small matrices, as well as an analysis of how to improve performance in the future. GraphSage shows how asynchrony and warp specialization can provide significant performance improvements by reusing the same building blocks as the matrix multiplication kernel. We show performance improvements of 34% by changing to a warp-specialized version compared to a bulk-synchronous implementation. This paper evaluates the strengths and weaknesses of Harmonic CUDA based on these test cases and suggests future work to improve the programming model.

References

[1]
Farhoosh Alghabi, Ulrich Schipper, and Andreas Kolb. 2014. A Scalable Software Framework for Stateful Stream Data Processing on Multiple GPUs and Applications. In GPU Computing and Applications. Springer Singapore, 99--118.
[2]
Michael Bauer, Henry Cook, and Brucek Khailany. 2011. CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). 1--11.
[3]
Michael Bauer, Sean Treichler, and Alex Aiken. 2014. Singe: Leveraging Warp Specialization for High Performance on GPUs. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '14). 119--130.
[4]
Jack Choquette, Oliver Giroux, and Denis Foley. 2018. Volta: Performance and Programmability. IEEE Micro 38, 2 (April 2018), 42--52.
[5]
Federico Ciccozzi, Lorenzo Addazi, Sara Abbaspour Asadollah, Björn Lisper, Abu Naser Masud, and Saad Mubeen. 2022. A Comprehensive Exploration of Languages for Parallel Computing. ACM Comput. Surv. 55, 2, Article 24 (Jan. 2022), 39 pages.
[6]
William J. Dally, Stephen W. Keckler, and David B. Kirk. 2021. Evolution of the Graphics Processing Unit (GPU). IEEE Micro 41, 6 (2021), 42--51.
[7]
Michał Dominiak, Georgy Evtushenko, Lewis Baker, Lucian Radu Teodorescu, Lee Howes, Kirk Shoop, Michael Garland, Eric Niebler, and Bryce Adelstein Lelbach. 2022. std::execution. C++ Standards Committee Papers. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2300r5.html
[8]
Alex Fender, Brad Rees, and Joe Eaton. 2022. RAPIDS cuGraph. In Massive Graph Analytics. Chapman and Hall/CRC, 483--493.
[9]
Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics. 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.
[10]
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 1025--1035. https://proceedings.neurips.cc/paper/2017/file/5dd9db5e033da9c6fb5ba83c7a7ebea9-Paper.pdf
[11]
Mark Harris and Kyrylo Perelygin. 2017. Cooperative Groups: Flexible CUDA Thread Programming. https://developer.nvidia.com/blog/cooperative-groups/
[12]
Kartik Hegde, Hadi Asghari Moghaddam, Michael Pellauer, Neal Clayton Crago, Aamer Jaleel, Edgar Solomonik, Joel S. Emer, and Christopher W. Fletcher. 2019. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-52). 319--333.
[13]
Dominique Houzet, Sylvain Huet, and Anis Rahman. 2010. SysCellC: a data-flow programming model on multi-GPU. Procedia Computer Science 1, 1 (May 2010), 1035--1044. ICCS 2010.
[14]
Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/
[15]
Ronny Krashinsky, Olivier Giroux, Stephen Jones, NickStam, and Sridhar Ramaswamy. 2020. NVIDIA Ampere Architecture In-Depth. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/.
[16]
MathWorks Corporation. 2022. Simulink. https://www.mathworks.com/help/simulink/index.html
[17]
Duane Merrill. 2013--2022. CUB: Flexible Library of Cooperative Threadblock Primitives and Other Utilities for CUDA Kernel Programming. (2013--2022). https://github.com/NVIDIA/cub.
[18]
National Instruments Corporation. 2022. LabVIEW Documentation. https://www.ni.com/docs/en-US/bundle/labview/page/lvhelp/labview_help.html
[19]
NVIDIA Corporation. 2020. NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core.
[20]
NVIDIA Corporation. 2022. CUDA cuBLAS Library (v11.6). http://developer.nvidia.com/cublas.
[21]
NVIDIA Corporation. 2022. CUDA Samples. https://github.com/NVIDIA/cuda-samples.
[22]
NVIDIA Corporation. 2022. libcu++: The C++ Standard Library for Your Entire System. https://nvidia.github.io/libcudacxx/ Version 1.8.1.
[23]
Michael Pellauer, Yakun Sophia Shao, Jason Clemons, Neal Crago, Kartik Hegde, Rangharajan Venkatesan, Stephen W. Keckler, Christopher W. Fletcher, and Joel Emer. 2019. Buffets: An Efficient and Composable Storage Idiom for Explicit Decoupled Data Orchestration. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). 137--151.
[24]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM Press, 519--530.
[25]
William Thies, Michal Karczmarek, and Saman P. Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction, R. Nigel Horspool (Ed.). Springer-Verlag, 179--196.

Cited By

View all
  • (2024)An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDAParallel Computing10.1016/j.parco.2024.103084120(103084)Online publication date: Jun-2024

Index Terms

  1. Harmonic CUDA: Asynchronous Programming on GPUs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PMAM'23: Proceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores
    February 2023
    73 pages
    ISBN:9798400701153
    DOI:10.1145/3582514
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 February 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. CUDA
    3. programming model
    4. asynchronous
    5. GEMM
    6. graphsage

    Qualifiers

    • Research-article

    Conference

    PMAM'23

    Acceptance Rates

    Overall Acceptance Rate 53 of 97 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)640
    • Downloads (Last 6 weeks)107
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)An approach for low-power heterogeneous parallel implementation of ALC-PSO algorithm using OmpSs and CUDAParallel Computing10.1016/j.parco.2024.103084120(103084)Online publication date: Jun-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media