Abstract
The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Barnett, M., Leino, K.R.M., Schulte, W.: The Spec# programming system: An overview. In: Barthe, G., Burdy, L., Huisman, M., Lanet, J.-L., Muntean, T. (eds.) CASSIS 2004. LNCS, vol. 3362, pp. 49–69. Springer, Heidelberg (2005)
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
Brunner, R.J., Kindratenko, V.V., Myers, A.D.: Developing and deploying advanced algorithms to novel supercomputing hardware. In: Proceedings of NASA Science Technology Conference - NCTC 2007 (2007)
Guo, J., Bikshandi, G., Fraguela, B.B., Garzaran, M.J., Padua, D.: Programming with tiles. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2008)
Kandemir, M., Choudhary, A.: Compiler-directed scratch pad memory hierarchy design and management. In: DAC 2002: Proceedings of the 39th Conference on Design Automation (2002)
Knight, T.J., Park, J.Y., Ren, M., Mike, H., Erez, M., Fatahalian, K., Aiken, A., Dally, W.J., Hanrahan, P.: Compilation for explicitly managed memory hierarchies. In: Proceedings of the 2007 ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (2007)
Microsoft. Phoenix compiler, http://research.microsoft.com/Phoenix/
Nickolls, J., Buck, I.: NVIDIA CUDA software and GPU parallel computing architecture. Microprocessor Forum (May 2007)
NVIDIA. NVIDIA CUDA, http://www.nvidia.com/cuda
NVIDIA. NVIDIA CUDA Compute Unified Device Architecture Programming Guide: Version 1.0. NVIDIA Corporation (June 2007)
Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A.E., Purcell, T.J.: A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26(1), 80–113 (2007)
Panda, P.R., Dutt, N.D., Nicolau, A.: Efficient utilization of scratch-pad memory in embedded processor applications. In: EDTC 1997: Proceedings of the 1997 European Conference on Design and Test (1997)
Ren, G., Wu, P., Padua, D.A.: Optimizing data permutations for SIMD devices. In: PLDI, pp. 118–131 (2006)
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP, pp. 73–82 (2008)
Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S., Stratton, J.A., Hwu, W.W.: Program optimization space pruning for a multithreaded GPU. In: CGO (April 2008)
Stone, S.S., Haldar, J.P., Tsao, S.C., Hwu, W.W., Liang, Z., Sutton, B.P.: Accelerating advanced MRI reconstructions on GPUs. In: Proceedings of the 2008 International Conference on Computing Frontiers (May 2008)
Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. In: PLDI 1991: Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (1991)
Wu, P., Eichenberger, A.E., Wang, A., Zhao, P.: An integrated simdization framework using virtual vectors. In: ICS, pp. 169–178 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ueng, SZ., Lathara, M., Baghsorkhi, S.S., Hwu, Wm.W. (2008). CUDA-Lite: Reducing GPU Programming Complexity. In: Amaral, J.N. (eds) Languages and Compilers for Parallel Computing. LCPC 2008. Lecture Notes in Computer Science, vol 5335. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89740-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-540-89740-8_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89739-2
Online ISBN: 978-3-540-89740-8
eBook Packages: Computer ScienceComputer Science (R0)