[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Enhancing Kokkos with OpenACC

Published: 16 October 2024 Publication History

Abstract

C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method.

References

[1]
Beckingsale D, Hornung RD, and Scogland T, et al. (2019) Performance portable C++ programming with RAJA. In: Hollingsworth JK and Keidar I. Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, 16–20 February 2019. ACM, 455–456.
[2]
Boku T, Tsunashima R, and Kobayashi R, et al. (2023) Openacc unified programming environment for multi-hybrid acceleration with gpu and fpga. In: High Performance Computing: ISC High Performance 2023 International Workshops, Hamburg, Germany, 21–25 May 2023. Berlin, Heidelberg: Springer-Verlag, 662–674. ISBN 978-3-031-40842-7. Revised Selected Papers.
[3]
Catalán S, Martorell X, and Labarta J, et al. (2019) Accelerating conjugate gradient using ompss. In: 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019, Gold Coast, Australia, 5-7 December 2019. IEEE, 121–126.
[4]
Chandrasekaran S and Juckeland G (2017) OpenACC for Programmers: Concepts and Strategies. 1st edition. Addison-Wesley Professional. ISBN 0134694287.
[5]
Denny JE, Lee S, and Vetter JS (2018) Clacc: translating OpenACC to OpenMP in clang. In: IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), in Conjunction with SC18, LLVM-HPC ’18,11–16 Dallas, TX, USA, November, 2018.
[6]
Denny JE, Lee S, and Valero-Lara P, et al. (2024) Clacc: openacc for C/C++ in clang. In: International Journal on High Performance Computing Applications (IJHPCA). Submitted.
[7]
Dietrich R, Juckeland G, and Wolfe M (2015) OpenACC programs examined: a performance analysis approach. In: 44th International Conference on Parallel Processing, Beijing, China, September 1-4, 2015, 310–319.
[8]
Edwards HC, Trott CR, and Sunderland D (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12): 3202–3216.
[9]
Eichstädt J, Vymazal M, and Moxey D, et al. (2020) A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM. Computer Physics Communications 255: 107245.
[10]
Gounley J, Vardhan M, and Draeger EW, et al. (2022) Propagation pattern for moment representation of the lattice Boltzmann method. IEEE Transactions on Parallel and Distributed Systems: A Publication of the IEEE Computer Society 33(3): 642–653.
[11]
Hansen G, Xavier PG, and Mish SP, et al. (2016) An MPI+$$X$$ implementation of contact global search using Kokkos. Engineering with Computers 32(2): 295–311.
[12]
Herdman JA, Gaudin WP, and Perks O, et al. (2014) Achieving portability and performance through openacc. In: Chandrasekaran S, Foertter FS, and Hernandez OR. Proceedings of the First Workshop on Accelerator Programming Using Directives, WACCPD ’14, New Orleans, Louisiana, USA, 16-21 November 2014. IEEE Computer Society, 19–26.
[13]
Heroux MA, Doerfler DW, and Crozier PS, et al. (2022) Improving performance via mini-applications. Available at: https://github.com/Mantevo/.Online (accessed 20 April 2022).
[14]
Joó B, Kurth T, and Clark MA, et al. (2019) Performance portability of a wilson dslash stencil operator mini-app using kokkos and SYCL. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC@SC 2019, Denver, CO, USA, 22 November 2019. IEEE, 14–25.
[15]
Karlin I, McGraw J, and Gallardo E, et al. (2012) Abstract: memory and parallelism exploration using the LULESH proxy application. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, 10–16 November 2012. IEEE Computer Society, 1427–1428.
[16]
Khuvis S, Tomko K, and Hashmi JM, et al. (2020) Exploring hybrid mpi+kokkos tasks programming model. In: 3rd IEEE/ACM Annual Parallel Applications Workshop: Alternatives to MPI+X, PAW-ATM@SC 2020, Atlanta, GA, USA, 12 November 2020. IEEE, 66–73.
[17]
Komoda T, Miwa S, and Nakamura H, et al. (2013) Integrating Multi-Gpu Execution in an Openacc Compiler. 260–269.
[18]
Matsumura K, de Gonzalo SG, and Pena AJ (2021) Jacc: an openacc runtime framework with kernel-level and multi-gpu parallelization. In: 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC), Bengaluru, India, December 17–20, 2021. IEEE.
[19]
OpenACC (2011) OpenACC: directives for accelerators. [Online]. Available at: https://www.openacc.org
[20]
Qian YH, d’Humières D, and Lallemand P (1992) Lattice BGK models for Navier-Stokes equation. Europhysics Letters (EPL) 17(6): 479–484.
[21]
Sahasrabudhe D, Phipps ET, and Rajamanickam S, et al. (2019) A portable SIMD primitive using kokkos for heterogeneous architectures. In: Wienke S and Bhalachandra S. Accelerator Programming Using Directives - 6th International Workshop, WACCPD 2019, Denver, CO, USA, 18 November 2019. Springer, Vol. 12017, 140–163. Revised Selected Papers, Lecture Notes in Computer Science.
[22]
SNAP (2022) ECP Proxy applications. Available at: https://proxyapps.exascaleproject.org/app/snap/.Online (accessed 27 July 2022).
[23]
Thompson AP, Aktulga HM, and Berger R, et al. (2022) LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271: 108171.
[24]
Tobias Burnus AS Thomas Schwinge (2022) Openacc, openmp, offloading and gcc gnu tools. Available at: https://gcc.gnu.org/wiki/cauldron2022talks?action=AttachFile&do=get&target=OpenMP-OpenACC-Offload-Cauldron2022-1.pdf
[25]
Trott C, Berger-Vergiat L, and Poliakoff D, et al. (2021) The kokkos ecosystem: comprehensive performance portability for high performance computing. Computing in Science & Engineering 23(5): 10–18.
[26]
Valero-Lara P and Jansson J (2017) Heterogeneous CPU+GPU approaches for mesh refinement over lattice-boltzmann simulations. Concurrency and Computation: Practice and Experience 29(7).
[27]
Valero-Lara P, Igual FD, and Prieto-Matías M, et al. (2015) Accelerating fluid-solid simulations (lattice-boltzmann & immersed-boundary) on heterogeneous architectures. Journal of Computational Science 10: 249–261.
[28]
Valero-Lara P, Andrade D, and Sirvent R, et al. (2019) A fast solver for large tridiagonal systems on multi-core processors (lass library). IEEE Access 7: 23365–23378.
[29]
Valero-Lara P, Lee S, and Tallada MG, et al. (2022) Kokkacc: enhancing kokkos with openacc. In: 9th Workshop on Accelerator Programming Using Directives, WACCPD@SC 2022, Dallas, TX, USA, 13–18 November 2022. IEEE, 32–42.
[30]
Valero-Lara P, Lee S, and Denny JE, et al. (2024) skokkos: enabling kokkos with transparent device selection on heterogeneous systems using openacc. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia 2024, Nagoya, Japan, 25–27 January 2024. ACM, 23–34.
[31]
Wolf MM, Edwards HC, and Olivier SL (2016) Kokkos/qthreads task-parallel approach to linear algebra based graph analytics. In: 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, 13–15 September 2016. IEEE, 1–7.
[32]
Zenker E, Worpitz B, and Widera R, et al. (2016) Alpaka - an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, 23–27 May 2016. IEEE Computer Society, 631–640.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications  Volume 38, Issue 5
Sep 2024
165 pages

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 16 October 2024

Author Tags

  1. OpenACC
  2. C++ metaprogramming
  3. Kokkos
  4. CUDA
  5. OpenMP target
  6. parallel programming models

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media