extended-abstract

Performance Evaluation of OpenCL Standard Support (and Beyond)

Authors:

Tyler Sorensen,

Sreepathi Pai,

Alastair F. DonaldsonAuthors Info & Claims

IWOCL '19: Proceedings of the International Workshop on OpenCL

Article No.: 8, Pages 1 - 2

https://doi.org/10.1145/3318170.3318177

Published: 13 May 2019 Publication History

Get Access

Abstract

In this talk, we will discuss how support (or lack of it) for various OpenCL (OCL) features affects performance of graph applications executing on GPU platforms. Given that adoption of OCL features varies widely across vendors, our results can help quantify the performance benefits and potentially motivate the timely adoption of these OCL features.

Our findings are drawn from the experience of developing an OCL backend for a state-of-the-art graph application DSL, IrGL, originally developed with a CUDA backend [1]. IrGL allows competitive algorithms for applications such as breadth-first-search, page-rank, and single-source-shortest-path to be written at a high level. A series of optimisations can then be applied by the compiler to generate OCL code. These user-selectable optimisations exercise various features of OCL: on one end of the spectrum, applications compiled without optimisations require only core OCL version 1.1 features; on the other end, a certain optimisation requires inter-workgroup forward progress guarantees, which are yet to be officially supported by OCL, but have been empirically validated and are relied upon e.g. to achieve global device-wide synchronisation [3]. Other optimisations require OCL features such as: fine-grained memory consistency guarantees (added in OCL 2.0) and subgroup primitives (added to core in OCL 2.1).

Our compiler can apply 6 independent optimisations (Table 1), each of which requires an associated minimum version of OCL to be supported. Increased OCL support enables more and more optimisations: 2 optimisations are supported with OCL 1.x; 1 additional optimization with OCL 2.0; and a further 2 with OCL 2.1. Using OCL FP to denote v2.1 extended with forward progress guarantees (not officially supported at present), the last optimisation is enabled. We will discuss the OCL features required for each optimisation and the idioms in which the features are used. Use-case discussions of these features (e.g. memory consistency and subgroup primitives) are valuable as there appear to be very few open-source examples: a GitHub search yields only a small number of results.

Our compiler enables us to carry out a large and controlled study, in which the performance benefit of various levels of OCL support can be evaluated. We gather runtime data exhaustively on all combinations across: all optimisations, 17 applications, 3 graph inputs, 6 different GPUs, spanning 4 vendors: Nvidia, AMD, Intel and ARM (Table 2).

We show two notable results in this abstract: our first result, summarised in Figure 1, shows that all optimizations can be beneficial across a range of GPUs, despite significant architectural differences (e.g. subgroup size as seen in Table 2). This provides motivation that previous vendor specific approaches (e.g. for Nvidia) can be ported to OCL and achieve speedups on range of devices.

Our second result, summarised in Figure 2, shows that if feature support is limited to OCL 2.0 (or below), the available optimisations (fg wg sz256) fail to achieve any speedups in over 70% of the chip/application/input benchmarks. If support for OCL 2.1 (adding the optimizations: sg coop-cv) is considered, this number drops to 60% but observed speedups are modest, rarely exceeding 2x. Finally, if forward progress guarantees are assumed (adding the oitergb optimization), speedups are observed in over half of the cases, including impressive speedups of over 14x for AMD and Intel GPUs. This provides compelling evidence for forward progress properties to be considered for adoption for a future OCL version.

An extended version of this material can be found in [2, ch. 5].

References

[1]

Sreepathi Pai and Keshav Pingali. 2016. A compiler for throughput optimization of graph algorithms on GPUs. In OOPSLA. 1--19.

Digital Library

Google Scholar

[2]

Tyler Sorensen. 2018. Inter-workgroup Barrier Synchronisation on Graphics Processing Units. Ph.D. Dissertation. Imperial College London. http://www.cs.princeton.edu/~ts20/files/phdthesis.pdf.

Google Scholar

[3]

Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In OOPSLA. 39--58.

Digital Library

Google Scholar

Index Terms

Performance Evaluation of OpenCL Standard Support (and Beyond)
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Index terms have been assigned to the content through auto-classification.

Recommendations

OpenCL performance evaluation on modern multicore CPUs
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous Architectures

Utilizing heterogeneous platforms for computation has become a general trend, making the portability issue important. OpenCL (Open Computing Language) serves this purpose by enabling portable execution on heterogeneous architectures. However, ...
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
OpenCL Performance Evaluation on Modern Multi Core CPUs
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Utilizing heterogeneous platforms for computation has become a general trend making the portability issue important. OpenCL (Open Computing Language) serves the purpose by enabling portable execution on heterogeneous architectures. However, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

IWOCL '19: Proceedings of the International Workshop on OpenCL

May 2019

102 pages

ISBN:9781450362306

DOI:10.1145/3318170

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

In-Cooperation

Khronos: Khronos Group
Northeastern University
Codeplay: Codeplay Software Ltd.
Intel: Intel
The University of Bristol: The University of Bristol

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Check for updates

Qualifiers

Extended-abstract
Research
Refereed limited

Conference

IWOCL'19

IWOCL'19: International Workshop on OpenCL

May 13 - 15, 2019

MA, Boston, USA

Acceptance Rates

IWOCL '19 Paper Acceptance Rate 13 of 33 submissions, 39%;

Overall Acceptance Rate 84 of 152 submissions, 55%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
58
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

OpenCL performance evaluation on modern multicore CPUs

Evaluation of a performance portable lattice Boltzmann code using OpenCL

OpenCL Performance Evaluation on Modern Multi Core CPUs