CUDA-Lite: Reducing GPU Programming Complexity

Sain-Zee Ueng²,
Melvin Lathara²,
Sara S. Baghsorkhi² &
…
Wen-mei W. Hwu²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5335))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

1517 Accesses

Abstract

The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A C++ Library for Memory Layout and Performance Portability of Scientific Applications

A review of CUDA optimization techniques and tools for structured grid computing

Article 26 July 2019

NAS Parallel Benchmarks for GPGPUs Using a Directive-Based Programming Model

References

Barnett, M., Leino, K.R.M., Schulte, W.: The Spec# programming system: An overview. In: Barthe, G., Burdy, L., Huisman, M., Lanet, J.-L., Muntean, T. (eds.) CASSIS 2004. LNCS, vol. 3362, pp. 49–69. Springer, Heidelberg (2005)
Chapter Google Scholar
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
Google Scholar
Brunner, R.J., Kindratenko, V.V., Myers, A.D.: Developing and deploying advanced algorithms to novel supercomputing hardware. In: Proceedings of NASA Science Technology Conference - NCTC 2007 (2007)
Google Scholar
Guo, J., Bikshandi, G., Fraguela, B.B., Garzaran, M.J., Padua, D.: Programming with tiles. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
Google Scholar
Kandemir, M., Choudhary, A.: Compiler-directed scratch pad memory hierarchy design and management. In: DAC 2002: Proceedings of the 39th Conference on Design Automation (2002)
Google Scholar
Knight, T.J., Park, J.Y., Ren, M., Mike, H., Erez, M., Fatahalian, K., Aiken, A., Dally, W.J., Hanrahan, P.: Compilation for explicitly managed memory hierarchies. In: Proceedings of the 2007 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2007)
Google Scholar
Microsoft. Phoenix compiler, http://research.microsoft.com/Phoenix/
Nickolls, J., Buck, I.: NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum (May 2007)
Google Scholar
NVIDIA. NVIDIA CUDA, http://www.nvidia.com/cuda
NVIDIA. NVIDIA CUDA Compute Unified Device Architecture Programming Guide: Version 1.0. NVIDIA Corporation (June 2007)
Google Scholar
Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), 80–113 (2007)
Article Google Scholar
Panda, P.R., Dutt, N.D., Nicolau, A.: Efficient utilization of scratch-pad memory in embedded processor applications. In: EDTC 1997: Proceedings of the 1997 European Conference on Design and Test (1997)
Google Scholar
Ren, G., Wu, P., Padua, D.A.: Optimizing data permutations for SIMD devices. In: PLDI, pp. 118–131 (2006)
Google Scholar
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP, pp. 73–82 (2008)
Google Scholar
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S., Stratton, J.A., Hwu, W.W.: Program optimization space pruning for a multithreaded GPU. In: CGO (April 2008)
Google Scholar
Stone, S.S., Haldar, J.P., Tsao, S.C., Hwu, W.W., Liang, Z., Sutton, B.P.: Accelerating advanced MRI reconstructions on GPUs. In: Proceedings of the 2008 International Conference on Computing Frontiers (May 2008)
Google Scholar
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: PLDI 1991: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (1991)
Google Scholar
Wu, P., Eichenberger, A.E., Wang, A., Zhao, P.: An integrated simdization framework using virtual vectors. In: ICS, pp. 169–178 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA
Sain-Zee Ueng, Melvin Lathara, Sara S. Baghsorkhi & Wen-mei W. Hwu

Authors

Sain-Zee Ueng
View author publications
You can also search for this author in PubMed Google Scholar
Melvin Lathara
View author publications
You can also search for this author in PubMed Google Scholar
Sara S. Baghsorkhi
View author publications
You can also search for this author in PubMed Google Scholar
Wen-mei W. Hwu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing Science, University of Alberta, T6G-2E8, Edmonton, AB, Canada
José Nelson Amaral

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ueng, SZ., Lathara, M., Baghsorkhi, S.S., Hwu, Wm.W. (2008). CUDA-Lite: Reducing GPU Programming Complexity. In: Amaral, J.N. (eds) Languages and Compilers for Parallel Computing. LCPC 2008. Lecture Notes in Computer Science, vol 5335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89740-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-540-89740-8_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89739-2
Online ISBN: 978-3-540-89740-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics