[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2063384.2063402acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

GROPHECY: GPU performance projection from CPU code skeletons

Published: 12 November 2011 Publication History

Abstract

We propose GROPHECY, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration without actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for GPU acceleration. Code skeletons are automatically transformed in various ways to mimic tuned GPU codes with characteristics resembling real implementations. The synthesized characteristics are used by an existing analytical model to project GPU performance. The cost and benefit of GPU development can then be estimated according to the transformed code skeleton that yields the best projected performance. With GROPHECY, users can leap toward GPU acceleration only when the cost-benefit makes sense. The framework is validated using kernel benchmarks and data-parallel codes in legacy scientific applications. The measured performance of manually tuned codes deviates from the projected performance by 17% in geometric mean.

References

[1]
NVIDIA's next generation CUDA compute architecture: Fermi. NVIDIA Corporation, 2009.
[2]
Advanced Micro Devices, Inc. AMD Brook+. http://ati.amd.com/technology/streamcomputing/AMD-Brookplus.pdf.
[3]
S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. Hwu. An adaptive performance modeling tool for GPU architectures. In PPoPP, 2010.
[4]
V. Balasundaram and K. Kennedy. A technique for summarizing data access and its use in parallelism enhancing transformations. In PLDI, 1989.
[5]
M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In CC, 2010.
[6]
C. D. Callahan. A global approach to detection of parallelism. PhD thesis, 1987.
[7]
L. Carrington, M. M. Tikir, C. Olschanowsky, M. Laurenzano, J. Peraza, A. Snavely, and S. Poole. An idiom-finding tool for increasing productivity of accelerators. In ICS, 2011.
[8]
J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In PPoPP, 2010.
[9]
J. W. Davidson and S. Jinturkar. Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In MICRO 28, 1995.
[10]
D. Unat, X. Cai, and S. Baden. Mint: Realizing CUDA performance in 3D Stencil Methods with Annotated C. In ICS, 2011.
[11]
H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on HPC platforms. In ICS, 2011.
[12]
P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Trans. Parallel Distrib. Syst., 2, 1991.
[13]
S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009.
[14]
S. Hong and H. Kim. Memory-level and thread-level parallelism aware GPU architecture performance analytical model. In GIT ECE Technical Report TR-2009-003, 2009.
[15]
W. Huang, M. R. Stan, K. Skadron, S. Ghosh, K. Sankaranarayanan, and S. Velusamy. Compact thermal modeling for temperature-aware design. In DAC'04, 2004.
[16]
W. Jalby and U. Meier. Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system. 1986.
[17]
M. H. Kalos, M. A. Lee, P. A. Whitlock, and G. V. Chester. Modern potentials and the properties of condensed 4He. In Phys. Rev. C 66, 044310-1:14, 1981.
[18]
Khronos Group Std. The OpenCL Specification, Version 1.0. http://www.khronos.org/registry/cl/specs/opencl-1.0.33.pdf, 2009.
[19]
David Kirk and Wen mei Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, 1 edition, February 2010.
[20]
A. Klckner, N. Pinto, Y. Lee, B. C. Catanzaro, P. Ivanov, and A. Fasih. PyCUDA: GPU run-time code generation for high-performance computing. CoRR, 2009.
[21]
K. Kothapalli, R. Mukherjee, M. S. Rehman, S. Patidar, P. J. Narayanan, and K. Srinathan. A performance prediction model for the CUDA GPGPU platform. In HiPC, 2009.
[22]
B. C. Lee and D. M. Brooks. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In ASPLOS-XII, 2006.
[23]
B. C. Lee, J. Collins, H. Wang, and D. Brooks. CPR: Composable performance regression for scalable multiprocessor models. In MICRO, 2008.
[24]
Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. Methods of inference and learning for performance modeling of parallel applications. In PPoPP, 2007.
[25]
S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC, 2010.
[26]
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In ISCA, 2010.
[27]
J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In IPDPS, page 117, 2010.
[28]
J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In ICS, 2009.
[29]
NVIDIA Corporation. NVIDIA CUDA compute unified device architecture programming guide. http://developer.download.nvidia.com/compute/cuda/08/NVIDIA_CUDA_Programming.Guide_0.8.pdf, 2007.
[30]
J. A. Pienaar, A. Raghunathan, and S. Chakradhar. MDR: performance model driven runtime for heterogeneous parallel platforms. In ICS, 2011.
[31]
S. C. Pieper, K. Varga, and R. B. Wiringa. Quantum Monte Carlo calculations of A=9,10 nuclei. In Phys. Rev. C 66, 044310-1:14, 2002.
[32]
S. C. Pieper and R. B. Wiringa. Quantum Monte Carlo calculations of light nuclei. In Annu. Rev. Nucl. Part. Sci. 51, 53, 2001.
[33]
M. Queva. Phase-ordering in optimizing compilers. Master's thesis, 2007.
[34]
J. Ramanujam. Tiling of iteration spaces for multicomputers. In ICPP, pages 179--186, 1990.
[35]
J. Ramanujam. Tiling multidimensional iteration spaces for nonshared memory machines. In SC, pages 111--120, 1991.
[36]
A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha. A framework for performance modeling and prediction. In SC, 2002.
[37]
S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W. m. W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In LCPC, 2008.
[38]
V. Volkov. Better performance at lower occupancy. Presentation at NVIDIA GTC, 2010.
[39]
M. Wolfe. Implementing the PGI accelerator model. In GPGPU, 2010.
[40]
L. T. Yang, X. Ma, and F. Mueller. Cross-platform performance prediction of parallel applications using partial execution. In SC, 2005.
[41]
Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In HPCA, 2011.

Cited By

View all
  • (2021)Automated SmartNIC Offloading Insights for Network FunctionsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483583(772-787)Online publication date: 26-Oct-2021
  • (2021)CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00015(36-47)Online publication date: Mar-2021
  • (2021)Characterizing Loop Acceleration in Heterogeneous Computing2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00059(445-455)Online publication date: Sep-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SC '11
Sponsor:

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)2
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Automated SmartNIC Offloading Insights for Network FunctionsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483583(772-787)Online publication date: 26-Oct-2021
  • (2021)CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00015(36-47)Online publication date: Mar-2021
  • (2021)Characterizing Loop Acceleration in Heterogeneous Computing2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00059(445-455)Online publication date: Sep-2021
  • (2020)A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modelingThe International Journal of High Performance Computing Applications10.1177/1094342020921340(109434202092134)Online publication date: 3-Jun-2020
  • (2020)ClaraProceedings of the 19th ACM Workshop on Hot Topics in Networks10.1145/3422604.3425929(16-22)Online publication date: 4-Nov-2020
  • (2020)High-level hardware feature extraction for GPU performance prediction of stencilsProceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit10.1145/3366428.3380769(21-30)Online publication date: 23-Feb-2020
  • (2020)Performance Prediction for Multi-Application Concurrency on GPUs2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00050(306-315)Online publication date: Aug-2020
  • (2020)Multicore Performance Prediction with MPETJournal of Signal Processing Systems10.1007/s11265-020-01563-wOnline publication date: 1-Jul-2020
  • (2020)Comparison of analytical and ML-based models for predicting CPU–GPU data transfer timeComputing10.1007/s00607-019-00780-xOnline publication date: 8-Jan-2020
  • (2020)Multicore Performance Prediction – Comparing Three Recent Approaches in a Case StudyEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_22(282-294)Online publication date: 29-May-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media