More Web Proxy on the site http://driver.im/

research-article

Enhancing Kokkos with OpenACC

Authors:

Pedro Valero-Lara,

Marc Gonzalez-Tallada,

Keita Teranishi,

Jeffrey S. VetterAuthors Info & Claims

The International Journal of High Performance Computing Applications, Volume 38, Issue 5

Pages 409 - 426

https://doi.org/10.1177/10943420241261987

Published: 16 October 2024 Publication History

Abstract

C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method.

References

[1]

Beckingsale D, Hornung RD, and Scogland T, et al. (2019) Performance portable C++ programming with RAJA. In: Hollingsworth JK and Keidar I. Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, 16–20 February 2019. ACM, 455–456.

Digital Library

[2]

Boku T, Tsunashima R, and Kobayashi R, et al. (2023) Openacc unified programming environment for multi-hybrid acceleration with gpu and fpga. In: High Performance Computing: ISC High Performance 2023 International Workshops, Hamburg, Germany, 21–25 May 2023. Berlin, Heidelberg: Springer-Verlag, 662–674. ISBN 978-3-031-40842-7. Revised Selected Papers.

Digital Library

[3]

Catalán S, Martorell X, and Labarta J, et al. (2019) Accelerating conjugate gradient using ompss. In: 20th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2019, Gold Coast, Australia, 5-7 December 2019. IEEE, 121–126.

[4]

Chandrasekaran S and Juckeland G (2017) OpenACC for Programmers: Concepts and Strategies. 1st edition. Addison-Wesley Professional. ISBN 0134694287.

[5]

Denny JE, Lee S, and Vetter JS (2018) Clacc: translating OpenACC to OpenMP in clang. In: IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), in Conjunction with SC18, LLVM-HPC ’18,11–16 Dallas, TX, USA, November, 2018.

[6]

Denny JE, Lee S, and Valero-Lara P, et al. (2024) Clacc: openacc for C/C++ in clang. In: International Journal on High Performance Computing Applications (IJHPCA). Submitted.

[7]

Dietrich R, Juckeland G, and Wolfe M (2015) OpenACC programs examined: a performance analysis approach. In: 44th International Conference on Parallel Processing, Beijing, China, September 1-4, 2015, 310–319.

[8]

Edwards HC, Trott CR, and Sunderland D (2014) Kokkos: enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12): 3202–3216.

Digital Library

[9]

Eichstädt J, Vymazal M, and Moxey D, et al. (2020) A comparison of the shared-memory parallel programming models OpenMP, OpenACC and Kokkos in the context of implicit solvers for high-order FEM. Computer Physics Communications 255: 107245.

[10]

Gounley J, Vardhan M, and Draeger EW, et al. (2022) Propagation pattern for moment representation of the lattice Boltzmann method. IEEE Transactions on Parallel and Distributed Systems: A Publication of the IEEE Computer Society 33(3): 642–653.

[11]

Hansen G, Xavier PG, and Mish SP, et al. (2016) An MPI+$$X$$ implementation of contact global search using Kokkos. Engineering with Computers 32(2): 295–311.

Digital Library

[12]

Herdman JA, Gaudin WP, and Perks O, et al. (2014) Achieving portability and performance through openacc. In: Chandrasekaran S, Foertter FS, and Hernandez OR. Proceedings of the First Workshop on Accelerator Programming Using Directives, WACCPD ’14, New Orleans, Louisiana, USA, 16-21 November 2014. IEEE Computer Society, 19–26.

Digital Library

[13]

Heroux MA, Doerfler DW, and Crozier PS, et al. (2022) Improving performance via mini-applications. Available at: https://github.com/Mantevo/.Online (accessed 20 April 2022).

[14]

Joó B, Kurth T, and Clark MA, et al. (2019) Performance portability of a wilson dslash stencil operator mini-app using kokkos and SYCL. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC, P3HPC@SC 2019, Denver, CO, USA, 22 November 2019. IEEE, 14–25.

[15]

Karlin I, McGraw J, and Gallardo E, et al. (2012) Abstract: memory and parallelism exploration using the LULESH proxy application. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, USA, 10–16 November 2012. IEEE Computer Society, 1427–1428.

Digital Library

[16]

Khuvis S, Tomko K, and Hashmi JM, et al. (2020) Exploring hybrid mpi+kokkos tasks programming model. In: 3rd IEEE/ACM Annual Parallel Applications Workshop: Alternatives to MPI+X, PAW-ATM@SC 2020, Atlanta, GA, USA, 12 November 2020. IEEE, 66–73.

[17]

Komoda T, Miwa S, and Nakamura H, et al. (2013) Integrating Multi-Gpu Execution in an Openacc Compiler. 260–269.

[18]

Matsumura K, de Gonzalo SG, and Pena AJ (2021) Jacc: an openacc runtime framework with kernel-level and multi-gpu parallelization. In: 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC), Bengaluru, India, December 17–20, 2021. IEEE.

[19]

OpenACC (2011) OpenACC: directives for accelerators. [Online]. Available at: https://www.openacc.org

[20]

Qian YH, d’Humières D, and Lallemand P (1992) Lattice BGK models for Navier-Stokes equation. Europhysics Letters (EPL) 17(6): 479–484.

[21]

Sahasrabudhe D, Phipps ET, and Rajamanickam S, et al. (2019) A portable SIMD primitive using kokkos for heterogeneous architectures. In: Wienke S and Bhalachandra S. Accelerator Programming Using Directives - 6th International Workshop, WACCPD 2019, Denver, CO, USA, 18 November 2019. Springer, Vol. 12017, 140–163. Revised Selected Papers, Lecture Notes in Computer Science.

[22]

SNAP (2022) ECP Proxy applications. Available at: https://proxyapps.exascaleproject.org/app/snap/.Online (accessed 27 July 2022).

[23]

Thompson AP, Aktulga HM, and Berger R, et al. (2022) LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271: 108171.

[24]

Tobias Burnus AS Thomas Schwinge (2022) Openacc, openmp, offloading and gcc gnu tools. Available at: https://gcc.gnu.org/wiki/cauldron2022talks?action=AttachFile&do=get&target=OpenMP-OpenACC-Offload-Cauldron2022-1.pdf

[25]

Trott C, Berger-Vergiat L, and Poliakoff D, et al. (2021) The kokkos ecosystem: comprehensive performance portability for high performance computing. Computing in Science & Engineering 23(5): 10–18.

[26]

Valero-Lara P and Jansson J (2017) Heterogeneous CPU+GPU approaches for mesh refinement over lattice-boltzmann simulations. Concurrency and Computation: Practice and Experience 29(7).

[27]

Valero-Lara P, Igual FD, and Prieto-Matías M, et al. (2015) Accelerating fluid-solid simulations (lattice-boltzmann & immersed-boundary) on heterogeneous architectures. Journal of Computational Science 10: 249–261.

[28]

Valero-Lara P, Andrade D, and Sirvent R, et al. (2019) A fast solver for large tridiagonal systems on multi-core processors (lass library). IEEE Access 7: 23365–23378.

[29]

Valero-Lara P, Lee S, and Tallada MG, et al. (2022) Kokkacc: enhancing kokkos with openacc. In: 9th Workshop on Accelerator Programming Using Directives, WACCPD@SC 2022, Dallas, TX, USA, 13–18 November 2022. IEEE, 32–42.

[30]

Valero-Lara P, Lee S, and Denny JE, et al. (2024) skokkos: enabling kokkos with transparent device selection on heterogeneous systems using openacc. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPCAsia 2024, Nagoya, Japan, 25–27 January 2024. ACM, 23–34.

Digital Library

[31]

Wolf MM, Edwards HC, and Olivier SL (2016) Kokkos/qthreads task-parallel approach to linear algebra based graph analytics. In: 2016 IEEE High Performance Extreme Computing Conference, HPEC 2016, Waltham, MA, USA, 13–15 September 2016. IEEE, 1–7.

[32]

Zenker E, Worpitz B, and Widera R, et al. (2016) Alpaka - an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, 23–27 May 2016. IEEE Computer Society, 631–640.

Index Terms

Enhancing Kokkos with OpenACC
1. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Reusability
  2. Software notations and tools
    1. Compilers
      1. Runtime environments
      2. Source code generation
    2. General programming languages
      1. Language types
        Parallel programming languages

Index terms have been assigned to the content through auto-classification.

Recommendations

A MultiGPU Performance-Portable Solution for Array Programming Based on Kokkos
ARRAY 2023: Proceedings of the 9th ACM SIGPLAN International Workshop on Libraries, Languages and Compilers for Array Programming

Today, multiGPU nodes are widely used in high-performance computing and data centers. However, current programming models do not provide simple, transparent, and portable support for automatically targeting multiple GPUs within a node on application ...
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing

Many modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
sKokkos: Enabling Kokkos with Transparent Device Selection on Heterogeneous Systems using OpenACC
HPCAsia '24: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

This paper presents a new feature to enable Kokkos with transparent device selection. For application developers, it is not easy to identify which device is the most appropriate to use in a heterogeneous system, since this depends on the characteristics ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 38, Issue 5

Sep 2024

165 pages

Issue’s Table of Contents

© The Author(s) 2024.

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 16 October 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents