More Web Proxy on the site http://driver.im/

article

Compiler Optimization of Accelerator Data Transfers

Authors:

Matthew B. Ashcraft,

Alexander Lemon,

David A. Penry,

Quinn SnellAuthors Info & Claims

International Journal of Parallel Programming, Volume 47, Issue 1

Pages 39 - 58

https://doi.org/10.1007/s10766-017-0549-3

Published: 01 February 2019 Publication History

Abstract

Accelerators such as GPUs, FPGAs, and many-core processors can provide significant performance improvements, but their effectiveness is dependent upon the skill of programmers to manage their complex architectures. One area of difficulty is determining which data to transfer on and off of the accelerator and when. Poorly placed data transfers can result in overheads that completely dwarf the benefits of using accelerators. To know what data to transfer, and when, the programmer must understand the data-flow of the transferred memory locations throughout the program, and how the accelerator region fits into the program as a whole. We argue that compilers should take on the responsibility of data transfer scheduling, thereby reducing the demands on the programmer, and resulting in improved program performance and program efficiency from the reduction in the number of bytes transferred. We show that by performing whole-program transfer scheduling on accelerator data transfers we are able to automatically eliminate up to 99% of the bytes transferred to and from the accelerator compared to transfering all data immediately before and after kernel execution for all data involved. The analysis and optimization are language and accelerator-agnostic, but for our examples and testing they have been implemented into an OpenMP to LLVM-IR to CUDA workflow.

References

[1]

Bourgoin, M., Emmanuel, C.: GPGPU composition with OCaml. In: Poceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY14, pp. 32---37 (2012)

Digital Library

[2]

Bourgoin, M., Chailloux, E., Lamotte, J.L.: SPOC: GPGPU programming through stream processing with OCaml. Parallel Process. Lett. 22, 1240007 (2012)

[3]

Bourgoin, M., Chailloux, E., Lamotte, J.L.: Efficient abstractions for gpgpu programming. IJPP 42, 583---600 (2014)

Digital Library

[4]

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.H., Skadron, K.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of the IEEE International Symposium on Workload Characterization, pp. 44---54 (2009)

Digital Library

[5]

CUDA C Programming Guide, Version 8.0. NVIDIA Corporation (2016)

[6]

Fujii, Y., Azumi, T., Nishio, N., Kato, S., Edahiro, M.: Data transfer matters for GPU computing. In: ICPADS (2013)

Digital Library

[7]

Gelado, I., Stone, J.E., Cabezas, J., Patel, J., Navarro, N., Mei W., Hwu, W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: Proceedings of the 15th International Symposium on Architectural Support for Programming Languages and Operating Systems, pp. 247---258 (2010)

Digital Library

[8]

Ishizaki, K., Hayashi, A., Koblents, G., Sarkar, V.: Compiling and optimizing Java 8 programs for GPU execution. In: Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (2015)

Digital Library

[9]

Kim, J., Lee, Y.J., Park, J., Lee, J.: Translating OpenMP device constructs to OpenCL using unnecessary data transfer elimination. In: Proceedings of the 2016 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2016)

Digital Library

[10]

Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the International Symposium on Code Generation and Optimization, pp. 75---86 (2004)

Digital Library

[11]

Lattner, C., Lenharth, A., Adve, V.: Making context-sensitive points-to analysis with heap cloning practical. In: Proceedings of the 2007 Conference on Programming Language Design and Implementation (2007)

Digital Library

[12]

Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (2010)

Digital Library

[13]

Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2009)

Digital Library

[14]

Lengauer, T., Tarjan, R.E.: A fast algorithm for finding dominators in a flowgraph. ACM Trans. Program. Lang. Syst. 1, 121---141 (1979)

Digital Library

[15]

Leroy, X., Doligez, D., Firsch, A., Garrigue, J., Remy, D.R., Vouillon, J.: The OCaml System Release 4.01: Documentation and Users Manual (2013)

[16]

Lustig, D., Martonosi, M.: Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of the 19th International Symposium on High-Performance Computer Architecture, pp. 354---365 (2013)

Digital Library

[17]

OpenMP Application Program Interface, Version 4.0. OpenMP Architecture Review Board (2013)

[18]

The OpenCL Specification, Version 2.2. Khronos OpenCL Working Group (2016)

[19]

Vassiliadis, V., Antonopoulos, C.D., Zindros, G.: Automating data management in heterogeneous systems using polyhedral analysis. In: Proceedings of the 19th Panhellenic Conference on Informatics, pp. 317---322 (2015)

Digital Library

[20]

Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimizations and parallelism management. In: Proceedings of the 31st Conference on Programming Language Design and Implementation, pp. 86---97 (2010)

Digital Library

Cited By

Li FWang YWang ZJi XJiang JTang XZhang H(2022)CC-RRTMG_SW++: Further optimizing a shortwave radiative transfer scheme on GPUThe Journal of Supercomputing10.1007/s11227-022-04566-578:15(17378-17402)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s11227-022-04566-5
Bychkov ANikolskiy V(2022)Rust Language for GPU ProgrammingSupercomputing10.1007/978-3-031-22941-1_38(522-532)Online publication date: 26-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-22941-1_38
Barua PZhao JSarkar V(2020)OmpMemOpt: Optimized Memory Movement for Heterogeneous ComputingEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_13(200-216)Online publication date: 24-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-57675-2_13

Compiler Optimization of Accelerator Data Transfers

Recommendations

Generating data transfers for distributed GPU parallel programs

Nowadays, high performance applications exploit multiple level architectures, due to the presence of hardware accelerators like GPUs inside each computing node. Data transfers occur at two different levels: inside the computing node between the CPU and ...
Cross-Accelerator Performance Profiling
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the ...
On the Performance, Energy, and Power of Data-Access Methods in Heterogeneous Computing Systems
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop

Graphics processing units (GPUs) have delivered promising speedups in data-parallel applications. A discrete GPU resides on the PCIe interface and has traditionally required data to be moved from the host memory to the GPU memory via PCIe. In certain ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Parallel Programming

International Journal of Parallel Programming Volume 47, Issue 1

February 2019

160 pages

ISSN:0885-7458

Issue’s Table of Contents

Copyright © Copyright © 2019 Springer Science+Business Media, LLC, part of Springer Nature.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2019

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li FWang YWang ZJi XJiang JTang XZhang H(2022)CC-RRTMG_SW++: Further optimizing a shortwave radiative transfer scheme on GPUThe Journal of Supercomputing10.1007/s11227-022-04566-578:15(17378-17402)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1007/s11227-022-04566-5
Bychkov ANikolskiy V(2022)Rust Language for GPU ProgrammingSupercomputing10.1007/978-3-031-22941-1_38(522-532)Online publication date: 26-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-22941-1_38
Barua PZhao JSarkar V(2020)OmpMemOpt: Optimized Memory Movement for Heterogeneous ComputingEuro-Par 2020: Parallel Processing10.1007/978-3-030-57675-2_13(200-216)Online publication date: 24-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-57675-2_13

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents