[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3318170.3318177acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwoclConference Proceedingsconference-collections
extended-abstract

Performance Evaluation of OpenCL Standard Support (and Beyond)

Published: 13 May 2019 Publication History

Abstract

In this talk, we will discuss how support (or lack of it) for various OpenCL (OCL) features affects performance of graph applications executing on GPU platforms. Given that adoption of OCL features varies widely across vendors, our results can help quantify the performance benefits and potentially motivate the timely adoption of these OCL features.
Our findings are drawn from the experience of developing an OCL backend for a state-of-the-art graph application DSL, IrGL, originally developed with a CUDA backend [1]. IrGL allows competitive algorithms for applications such as breadth-first-search, page-rank, and single-source-shortest-path to be written at a high level. A series of optimisations can then be applied by the compiler to generate OCL code. These user-selectable optimisations exercise various features of OCL: on one end of the spectrum, applications compiled without optimisations require only core OCL version 1.1 features; on the other end, a certain optimisation requires inter-workgroup forward progress guarantees, which are yet to be officially supported by OCL, but have been empirically validated and are relied upon e.g. to achieve global device-wide synchronisation [3]. Other optimisations require OCL features such as: fine-grained memory consistency guarantees (added in OCL 2.0) and subgroup primitives (added to core in OCL 2.1).
Our compiler can apply 6 independent optimisations (Table 1), each of which requires an associated minimum version of OCL to be supported. Increased OCL support enables more and more optimisations: 2 optimisations are supported with OCL 1.x; 1 additional optimization with OCL 2.0; and a further 2 with OCL 2.1. Using OCL FP to denote v2.1 extended with forward progress guarantees (not officially supported at present), the last optimisation is enabled. We will discuss the OCL features required for each optimisation and the idioms in which the features are used. Use-case discussions of these features (e.g. memory consistency and subgroup primitives) are valuable as there appear to be very few open-source examples: a GitHub search yields only a small number of results.
Our compiler enables us to carry out a large and controlled study, in which the performance benefit of various levels of OCL support can be evaluated. We gather runtime data exhaustively on all combinations across: all optimisations, 17 applications, 3 graph inputs, 6 different GPUs, spanning 4 vendors: Nvidia, AMD, Intel and ARM (Table 2).
We show two notable results in this abstract: our first result, summarised in Figure 1, shows that all optimizations can be beneficial across a range of GPUs, despite significant architectural differences (e.g. subgroup size as seen in Table 2). This provides motivation that previous vendor specific approaches (e.g. for Nvidia) can be ported to OCL and achieve speedups on range of devices.
Our second result, summarised in Figure 2, shows that if feature support is limited to OCL 2.0 (or below), the available optimisations (fg wg sz256) fail to achieve any speedups in over 70% of the chip/application/input benchmarks. If support for OCL 2.1 (adding the optimizations: sg coop-cv) is considered, this number drops to 60% but observed speedups are modest, rarely exceeding 2x. Finally, if forward progress guarantees are assumed (adding the oitergb optimization), speedups are observed in over half of the cases, including impressive speedups of over 14x for AMD and Intel GPUs. This provides compelling evidence for forward progress properties to be considered for adoption for a future OCL version.
An extended version of this material can be found in [2, ch. 5].

References

[1]
Sreepathi Pai and Keshav Pingali. 2016. A compiler for throughput optimization of graph algorithms on GPUs. In OOPSLA. 1--19.
[2]
Tyler Sorensen. 2018. Inter-workgroup Barrier Synchronisation on Graphics Processing Units. Ph.D. Dissertation. Imperial College London. http://www.cs.princeton.edu/~ts20/files/phdthesis.pdf.
[3]
Tyler Sorensen, Alastair F. Donaldson, Mark Batty, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2016. Portable inter-workgroup barrier synchronisation for GPUs. In OOPSLA. 39--58.

Index Terms

  1. Performance Evaluation of OpenCL Standard Support (and Beyond)
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      IWOCL '19: Proceedings of the International Workshop on OpenCL
      May 2019
      102 pages
      ISBN:9781450362306
      DOI:10.1145/3318170
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      In-Cooperation

      • Khronos: Khronos Group
      • Northeastern University
      • Codeplay: Codeplay Software Ltd.
      • Intel: Intel
      • The University of Bristol: The University of Bristol

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 May 2019

      Check for updates

      Qualifiers

      • Extended-abstract
      • Research
      • Refereed limited

      Conference

      IWOCL'19
      IWOCL'19: International Workshop on OpenCL
      May 13 - 15, 2019
      MA, Boston, USA

      Acceptance Rates

      IWOCL '19 Paper Acceptance Rate 13 of 33 submissions, 39%;
      Overall Acceptance Rate 84 of 152 submissions, 55%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 58
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media