extended-abstract

Evaluating data parallelism in C++ using the Parallel Research Kernels

Authors:

Jeff R. Hammond,

Timothy G. MattsonAuthors Info & Claims

IWOCL '19: Proceedings of the International Workshop on OpenCL

Article No.: 14, Pages 1 - 6

https://doi.org/10.1145/3318170.3318192

Published: 13 May 2019 Publication History

Get Access

Abstract

The Parallel Research Kernels are a set of simple algorithms that correspond to popular classes of high-performance computing applications. We report on their use to evaluate parallel programming models based upon modern C++.

References

[1]

2019. Parallel Research Kernels. https://github.com/ParRes/Kernels

Google Scholar

[2]

2019. Travis CI -- ParRes/Kernels. https://travis-ci.org/ParRes/Kernels

Google Scholar

[3]

Vishakha Agrawal, Michael J. Voss, Pablo Reble, Vasanth Tovinkere, Jeff Hammond, and Michael Klemm. 2018. Visualization of OpenMP* Task Dependencies Using Intel® Advisor -- Flow Graph Analyzer. In Evolving OpenMP for Evolving Architectures, Bronis R. de Supinski, Pedro Valero-Lara, Xavier Martorell, Sergi Mateo Bellido, and Jesus Labarta (Eds.). Springer International Publishing, Cham, 175--188.

Google Scholar

[4]

R. Belli and T. Hoefler. 2015. Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization. In 2015 IEEE International Parallel and Distributed Processing Symposium. 871--881.

Digital Library

Google Scholar

[5]

James Dinan, Clement Cole, Gabriele Jost, Stan Smith, Keith Underwood, and Robert W. Wisniewski. 2014. Reducing Synchronization Overhead Through Bundled Communication. In OpenSHMEM and Related Technologies. Experiences, Implementations, and Tools, Stephen Poole, Oscar Hernandez, and Pavel Shamis (Eds.). Springer International Publishing, Cham, 163--177.

Digital Library

Google Scholar

[6]

H. Carter Edwards and Daniel Sunderland. 2012. Kokkos Array Performance-portable Manycore Programming Model. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM '12). ACM, New York, NY, USA, 1--10.

Digital Library

Google Scholar

[7]

H. Carter Edwards, Daniel Sunderland, Vicki Porter, Chris Amsler, and Sam Mish. 2012. Manycore Performance-portability: Kokkos Multidimensional Array Library. Sci. Program. 20, 2 (April 2012), 89--114.

Digital Library

Google Scholar

[8]

H. C. Edwards and C. R. Trott. 2013. Kokkos: Enabling Performance Portability Across Manycore Architectures. In 2013 Extreme Scaling Workshop (xsw 2013). 18--24.

Digital Library

Google Scholar

[9]

H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel and Distrib. Comput. 74, 12 (2014), 3202--3216. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.

Digital Library

Google Scholar

[10]

Alessandro Fanfarillo and Jeff Hammond. 2016. CAF Events Implementation Using MPI-3 Capabilities. In Proceedings of the 23rd European MPI Users' Group Meeting (EuroMPI 2016). ACM, New York, NY, USA, 198--207.

Digital Library

Google Scholar

[11]

Alessandro Fanfarillo and Davide Del Vento. 2017. Notified Access in Coarray Fortran. In Proceedings of the 24th European MPI Users' Group Meeting (EuroMPI '17). ACM, New York, NY, USA, Article 12, 7 pages.

Digital Library

Google Scholar

[12]

Evangelos Georganas, Rob F. Van der Wijngaart, and Timothy G. Mattson. 2016. Design and Implementation of a Parallel Research Kernel for Assessing Dynamic Load-Balancing Capabilities. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 73--82.

Google Scholar

[13]

The Khronos 'Group. {n. d.}. Open Source Parallel STL implementation. https://github.com/KhronosGroup/SyclParallelSTL

Google Scholar

[14]

Georg Hager. 2019. The McCalpin STREAM benchmark: How do do it right and interpret the results. https://blogs.fau.de/hager/archives/8263

Google Scholar

[15]

Richard D. Hornung and Jeffrey A. Keasler. 2014. The RAJA Portability Layer: Overview and Status. (9 2014).

Google Scholar

[16]

Intel Corporation. {n. d.}. Threading Building Blocks (TBB). https://github.com/01org/tbb. https://www.threadingbuildingblocks.org/

Google Scholar

[17]

ISO. 2017. ISO/IEC 14882:2017 Information technology --- Programming languages --- C++ (fifth ed.). International Organization for Standardization, Geneva, Switzerland. 1605 pages. https://www.iso.org/standard/68564.html

Google Scholar

[18]

Hartmut Kaiser, Thomas Heller, Daniel Bourgeois, and Dietmar Fey. 2015. Higherlevel Parallelization for Local and Distributed Asynchronous Task-based Programming. In Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware (ESPM '15). ACM, New York, NY, USA, 29--37.

Digital Library

Google Scholar

[19]

E. Kayraklioglu, W. Chang, and T. El-Ghazawi. 2017. Comparative Performance and Optimization of Chapel in Modern Manycore Architectures. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1105--1114.

Google Scholar

[20]

Khronos OpenCL Working Group. 2012. The OpenCL Specification, Version 1.2, Aaftab Munshi (Ed.). https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf

Google Scholar

[21]

Lawrence Livermore National Laboratory. {n. d.}. RAJA Performance Portability Layer. https://github.com/LLNL/RAJA

Google Scholar

[22]

Sandia National Laboratory. {n. d.}. Kokkos C++ Performance Portability Programming EcoSystem: The Programming Model -- Parallel Execution and Memory Abstraction. https://github.com/Kokkos/kokkos

Google Scholar

[23]

Piotr R. Luszczek, David H. Bailey, Jack J. Dongarra, Jeremy Kepner, Robert F. Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC Challenge (HPCC) Benchmark Suite. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06). ACM, New York, NY, USA, Article 213.

Digital Library

Google Scholar

[24]

Devin Matthews. {n. d.}. TBLIS (Tensor BLIS). https://github.com/devinamatthews/tblis

Google Scholar

[25]

Timothy Mattson, Beverly Sanders, and Berna Massingill. 2004. Patterns for Parallel Programming (first ed.). Addison-Wesley Professional.

Digital Library

Google Scholar

[26]

John McCalpin. 1995. Memory bandwidth and machine balance in high performance computers. IEEE Technical Committee on Computer Architecture Newsletter (12 1995), 19--25.

Google Scholar

[27]

John D. McCalpin. 2015. STREAM: Sustainable Memory Bandwidth in High Performance Computers. https://www.cs.virginia.edu/stream/

Google Scholar

[28]

Naveen Namashivayam, Bob Cernohous, Krishna Kandalla, Dan Pou, Joseph Robichaux, James Dinan, and Mark Pagel. 2018. Symmetric Memory Partitions in OpenSHMEM: A Case Study with Intel KNL. In OpenSHMEM and Related Technologies. Big Compute and Big Data Convergence, Manjunath Gorentla Venkata, Neena Imam, and Swaroop Pophale (Eds.). Springer International Publishing, Cham, 3--18.

Google Scholar

[29]

OpenMP Architecture Review Board. 2015. OpenMP Aplication Program Interface -- Version 4.5. https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf.

Google Scholar

[30]

OpenMP Architecture Review Board. 2018. OpenMP Aplication Program Interface - Version 5.0. https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf.

Google Scholar

[31]

S.J. Plimpton, R. Brightwell, C. Vaughan, K. Underwood, and M. Davis. 2006. A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark. In 2006 IEEE International Conference on Cluster Computing. 1--7.

Google Scholar

[32]

Gopalakrishnan Santhanaraman, Sundeep Narravula, Amith. R. Mamidala, and Dhabaleswar K. Panda. 2007. MPI-2 One-Sided Usage and Implementation for Read Modify Write Operations: A Case Study with HPCC. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Franck Cappello, Thomas Herault, and Jack Dongarra (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 251--259.

Digital Library

Google Scholar

[33]

Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2014).

Digital Library

Google Scholar

[34]

Khronos® OpenCL™ Working Group SYCL™ subgroup. 2019. SYCL™ Specification. https://www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf, Ronan Keryell, Maria Rovatsou, and Lee Howes (Eds.).

Google Scholar

[35]

Rob F. Van der Wijngaart, Evangelos Georganas, Timothy G. Mattson, and Andrew Wissink. 2017. A New Parallel Research Kernel to Expand Research on Dynamic Load-Balancing Capabilities. In High Performance Computing, Julian M. Kunkel, Rio Yokota, Pavan Balaji, and David Keyes (Eds.). Springer International Publishing, Cham, 256--274.

Google Scholar

[36]

Rob F. Van der Wijngaart, Abdullah Kayi, Jeff R. Hammond, Gabriele Jost, Tom St. John, Srinivas Sridharan, Timothy G. Mattson, John Abercrombie, and Jacob Nelson. 2016. Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels. In High Performance Computing, Julian M. Kunkel, Pavan Balaji, and Jack Dongarra (Eds.). Springer International Publishing, Cham, 321--339.

Google Scholar

[37]

Rob F. Van der Wijngaart and Timothy G. Mattson. 2014. The Parallel Research Kernels: A tool for architecture and programming system investigation. In Proceedings of the IEEE High Performance Extreme Computing Conference. IEEE.

Google Scholar

[38]

Rob F. Van der Wijngaart, Srinivas Sridharan, Abdullah Kayi, Gabrielle Jost, Jeff R. Hammond, Timothy G. Mattson, and Jacob E. Nelson. 2015. Using the Parallel Research Kernels to Study PGAS Models. In 2015 9th International Conference on Partitioned Global Address Space Programming Models. 76--81.

Digital Library

Google Scholar

[39]

Field van Zee et al. {n. d.}. BLIS. https://github.com/flame/blis

Google Scholar

Cited By

View all

Alpay ASoproni BWünsche HHeuveline V(2022)Exploring the possibility of a hipSYCL-based implementation of oneAPIProceedings of the 10th International Workshop on OpenCL10.1145/3529538.3530005(1-12)Online publication date: 10-May-2022
https://dl.acm.org/doi/10.1145/3529538.3530005
Baratta IRichardson CWells G(2022)Performance analysis of matrix-free conjugate gradient kernels using SYCLProceedings of the 10th International Workshop on OpenCL10.1145/3529538.3529993(1-10)Online publication date: 10-May-2022
https://dl.acm.org/doi/10.1145/3529538.3529993
Peng WBelikov E(2022)CAMP: a Synthetic Micro-Benchmark for Assessing Deep Memory Hierarchies2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/HiPar56574.2022.00009(28-36)Online publication date: Nov-2022
https://doi.org/10.1109/HiPar56574.2022.00009
Show More Cited By

Index Terms

Evaluating data parallelism in C++ using the Parallel Research Kernels
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Implicit heterogeneous and parallel programming

Programmers are often required to develop parallel programs using new parallel languages or parallel extensions to existing languages that are different from the languages they used previously on sequential machines. As a consequence, programmers are ...
Shared Memory Parallelism in Modern C++ and HPX
Asynchronous Many-Task Systems and Applications
Abstract
Parallel programming remains a daunting challenge, from the struggle to express a parallel algorithm without cluttering the underlying synchronous logic, to describing which devices to employ in a calculation, to correctness. Over the years, ...
CS in parallel: modules for adding parallel computing to CS courses, from CS2 to theory of computation (abstract only)
SIGCSE '12: Proceedings of the 43rd ACM technical symposium on Computer Science Education

Parallel computing with more and more cores is here to stay. This workshop presents four independent, class-tested, primarily hands-on modules for incrementally adding parallelism in undergraduate CS courses, each requiring 1 to 3 class days and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

IWOCL '19: Proceedings of the International Workshop on OpenCL

May 2019

102 pages

ISBN:9781450362306

DOI:10.1145/3318170

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

In-Cooperation

Khronos: Khronos Group
Northeastern University
Codeplay: Codeplay Software Ltd.
Intel: Intel
The University of Bristol: The University of Bristol

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Check for updates

Author Tags

Qualifiers

Extended-abstract
Research
Refereed limited

Conference

IWOCL'19

IWOCL'19: International Workshop on OpenCL

May 13 - 15, 2019

MA, Boston, USA

Acceptance Rates

IWOCL '19 Paper Acceptance Rate 13 of 33 submissions, 39%;

Overall Acceptance Rate 84 of 152 submissions, 55%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
246
Total Downloads

Downloads (Last 12 months)14
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Alpay ASoproni BWünsche HHeuveline V(2022)Exploring the possibility of a hipSYCL-based implementation of oneAPIProceedings of the 10th International Workshop on OpenCL10.1145/3529538.3530005(1-12)Online publication date: 10-May-2022
https://dl.acm.org/doi/10.1145/3529538.3530005
Baratta IRichardson CWells G(2022)Performance analysis of matrix-free conjugate gradient kernels using SYCLProceedings of the 10th International Workshop on OpenCL10.1145/3529538.3529993(1-10)Online publication date: 10-May-2022
https://dl.acm.org/doi/10.1145/3529538.3529993
Peng WBelikov E(2022)CAMP: a Synthetic Micro-Benchmark for Assessing Deep Memory Hierarchies2022 IEEE/ACM International Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/HiPar56574.2022.00009(28-36)Online publication date: Nov-2022
https://doi.org/10.1109/HiPar56574.2022.00009
Grete PGlines FO'Shea B(2021)K-Athena: A Performance Portable Structured Grid Finite Volume Magnetohydrodynamics CodeIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.301001632:1(85-97)Online publication date: 1-Jan-2021
https://doi.org/10.1109/TPDS.2020.3010016
Lavin PYoung JVuduc RRiedy JVose AErnst D(2020)Evaluating Gather and Scatter Performance on CPUs and GPUsProceedings of the International Symposium on Memory Systems10.1145/3422575.3422794(209-222)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422794
Holmen JPeterson BBerzins M(2019)An Approach for Indirectly Adopting a Performance Portability Layer in Large Legacy Codes2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)10.1109/P3HPC49587.2019.00009(36-49)Online publication date: Nov-2019
https://doi.org/10.1109/P3HPC49587.2019.00009

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Implicit heterogeneous and parallel programming

Shared Memory Parallelism in Modern C++ and HPX

CS in parallel: modules for adding parallel computing to CS courses, from CS2 to theory of computation (abstract only)