More Web Proxy on the site http://driver.im/

research-article

GROPHECY: GPU performance projection from CPU code skeletons

Authors:

Vitali A. Morozov,

Kalyan Kumaran,

Venkatram Vishwanath,

Thomas D. UramAuthors Info & Claims

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 14, Pages 1 - 11

https://doi.org/10.1145/2063384.2063402

Published: 12 November 2011 Publication History

Abstract

We propose GROPHECY, a GPU performance projection framework that can estimate the performance benefit of GPU acceleration without actual GPU programming or hardware. Users need only to skeletonize pieces of CPU code that are targets for GPU acceleration. Code skeletons are automatically transformed in various ways to mimic tuned GPU codes with characteristics resembling real implementations. The synthesized characteristics are used by an existing analytical model to project GPU performance. The cost and benefit of GPU development can then be estimated according to the transformed code skeleton that yields the best projected performance. With GROPHECY, users can leap toward GPU acceleration only when the cost-benefit makes sense. The framework is validated using kernel benchmarks and data-parallel codes in legacy scientific applications. The measured performance of manually tuned codes deviates from the projected performance by 17% in geometric mean.

References

[1]

NVIDIA's next generation CUDA compute architecture: Fermi. NVIDIA Corporation, 2009.

[2]

Advanced Micro Devices, Inc. AMD Brook+. http://ati.amd.com/technology/streamcomputing/AMD-Brookplus.pdf.

[3]

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. Hwu. An adaptive performance modeling tool for GPU architectures. In PPoPP, 2010.

Digital Library

[4]

V. Balasundaram and K. Kennedy. A technique for summarizing data access and its use in parallelism enhancing transformations. In PLDI, 1989.

Digital Library

[5]

M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic C-to-CUDA code generation for affine programs. In CC, 2010.

Digital Library

[6]

C. D. Callahan. A global approach to detection of parallelism. PhD thesis, 1987.

Digital Library

[7]

L. Carrington, M. M. Tikir, C. Olschanowsky, M. Laurenzano, J. Peraza, A. Snavely, and S. Poole. An idiom-finding tool for increasing productivity of accelerators. In ICS, 2011.

Digital Library

[8]

J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In PPoPP, 2010.

Digital Library

[9]

J. W. Davidson and S. Jinturkar. Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In MICRO 28, 1995.

Digital Library

[10]

D. Unat, X. Cai, and S. Baden. Mint: Realizing CUDA performance in 3D Stencil Methods with Annotated C. In ICS, 2011.

Digital Library

[11]

H. Gahvari, A. H. Baker, M. Schulz, U. M. Yang, K. E. Jordan, and W. Gropp. Modeling the performance of an algebraic multigrid cycle on HPC platforms. In ICS, 2011.

Digital Library

[12]

P. Havlak and K. Kennedy. An implementation of interprocedural bounded regular section analysis. IEEE Trans. Parallel Distrib. Syst., 2, 1991.

Digital Library

[13]

S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009.

Digital Library

[14]

S. Hong and H. Kim. Memory-level and thread-level parallelism aware GPU architecture performance analytical model. In GIT ECE Technical Report TR-2009-003, 2009.

[15]

W. Huang, M. R. Stan, K. Skadron, S. Ghosh, K. Sankaranarayanan, and S. Velusamy. Compact thermal modeling for temperature-aware design. In DAC'04, 2004.

Digital Library

[16]

W. Jalby and U. Meier. Optimizing matrix operations on a parallel multiprocessor with a hierarchical memory system. 1986.

[17]

M. H. Kalos, M. A. Lee, P. A. Whitlock, and G. V. Chester. Modern potentials and the properties of condensed ⁴He. In Phys. Rev. C 66, 044310-1:14, 1981.

[18]

Khronos Group Std. The OpenCL Specification, Version 1.0. http://www.khronos.org/registry/cl/specs/opencl-1.0.33.pdf, 2009.

[19]

David Kirk and Wen mei Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, 1 edition, February 2010.

Digital Library

[20]

A. Klckner, N. Pinto, Y. Lee, B. C. Catanzaro, P. Ivanov, and A. Fasih. PyCUDA: GPU run-time code generation for high-performance computing. CoRR, 2009.

[21]

K. Kothapalli, R. Mukherjee, M. S. Rehman, S. Patidar, P. J. Narayanan, and K. Srinathan. A performance prediction model for the CUDA GPGPU platform. In HiPC, 2009.

[22]

B. C. Lee and D. M. Brooks. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In ASPLOS-XII, 2006.

Digital Library

[23]

B. C. Lee, J. Collins, H. Wang, and D. Brooks. CPR: Composable performance regression for scalable multiprocessor models. In MICRO, 2008.

Digital Library

[24]

Benjamin C. Lee, David M. Brooks, Bronis R. de Supinski, Martin Schulz, Karan Singh, and Sally A. McKee. Methods of inference and learning for performance modeling of parallel applications. In PPoPP, 2007.

Digital Library

[25]

S. Lee and R. Eigenmann. OpenMPC: Extended OpenMP programming and tuning for GPUs. In SC, 2010.

Digital Library

[26]

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In ISCA, 2010.

Digital Library

[27]

J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In IPDPS, page 117, 2010.

[28]

J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In ICS, 2009.

Digital Library

[29]

NVIDIA Corporation. NVIDIA CUDA compute unified device architecture programming guide. http://developer.download.nvidia.com/compute/cuda/08/NVIDIA_CUDA_Programming.Guide_0.8.pdf, 2007.

[30]

J. A. Pienaar, A. Raghunathan, and S. Chakradhar. MDR: performance model driven runtime for heterogeneous parallel platforms. In ICS, 2011.

Digital Library

[31]

S. C. Pieper, K. Varga, and R. B. Wiringa. Quantum Monte Carlo calculations of A=9,10 nuclei. In Phys. Rev. C 66, 044310-1:14, 2002.

[32]

S. C. Pieper and R. B. Wiringa. Quantum Monte Carlo calculations of light nuclei. In Annu. Rev. Nucl. Part. Sci. 51, 53, 2001.

[33]

M. Queva. Phase-ordering in optimizing compilers. Master's thesis, 2007.

[34]

J. Ramanujam. Tiling of iteration spaces for multicomputers. In ICPP, pages 179--186, 1990.

[35]

J. Ramanujam. Tiling multidimensional iteration spaces for nonshared memory machines. In SC, pages 111--120, 1991.

Digital Library

[36]

A. Snavely, L. Carrington, N. Wolter, J. Labarta, R. Badia, and A. Purkayastha. A framework for performance modeling and prediction. In SC, 2002.

Digital Library

[37]

S.-Z. Ueng, M. Lathara, S. S. Baghsorkhi, and W. m. W. Hwu. CUDA-Lite: Reducing GPU programming complexity. In LCPC, 2008.

Digital Library

[38]

V. Volkov. Better performance at lower occupancy. Presentation at NVIDIA GTC, 2010.

[39]

M. Wolfe. Implementing the PGI accelerator model. In GPGPU, 2010.

Digital Library

[40]

L. T. Yang, X. Ma, and F. Mueller. Cross-platform performance prediction of parallel applications using partial execution. In SC, 2005.

Digital Library

[41]

Y. Zhang and J. D. Owens. A quantitative performance analysis model for GPU architectures. In HPCA, 2011.

Digital Library

Cited By

Qiu YXing JHsu KKang QLiu MNarayana SChen A(2021)Automated SmartNIC Offloading Insights for Network FunctionsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483583(772-787)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483583
Anastasiadis PPapadopoulou NGoumas GKoziris N(2021)CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00015(36-47)Online publication date: Mar-2021
https://doi.org/10.1109/ISPASS51385.2021.00015
Biookaghazadeh SRen FZhao M(2021)Characterizing Loop Acceleration in Heterogeneous Computing2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00059(445-455)Online publication date: Sep-2021
https://doi.org/10.1109/CLOUD53861.2021.00059
Show More Cited By

Recommendations

Evaluation of Rodinia Codes on Intel Xeon Phi
ISMS '13: Proceedings of the 2013 4th International Conference on Intelligent Systems, Modelling and Simulation

High performance computing (HPC) is a niche area where various parallel benchmarks are constantly used to explore and evaluate the performance of Heterogeneous computing systems on the horizon. The Rodinia benchmark suite, a collection of parallel ...
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

November 2011

866 pages

ISBN:9781450307710

DOI:10.1145/2063384

Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

SC '11

Sponsor:

SIGARCH
IEEE-CS

SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 18, 2011

Washington, Seattle

Acceptance Rates

SC '11 Paper Acceptance Rate 74 of 352 submissions, 21%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
794
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qiu YXing JHsu KKang QLiu MNarayana SChen A(2021)Automated SmartNIC Offloading Insights for Network FunctionsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483583(772-787)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483583
Anastasiadis PPapadopoulou NGoumas GKoziris N(2021)CoCoPeLia: Communication-Computation Overlap Prediction for Efficient Linear Algebra on GPUs2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00015(36-47)Online publication date: Mar-2021
https://doi.org/10.1109/ISPASS51385.2021.00015
Biookaghazadeh SRen FZhao M(2021)Characterizing Loop Acceleration in Heterogeneous Computing2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00059(445-455)Online publication date: Sep-2021
https://doi.org/10.1109/CLOUD53861.2021.00059
Stevens JKlöckner A(2020)A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modelingThe International Journal of High Performance Computing Applications10.1177/1094342020921340(109434202092134)Online publication date: 3-Jun-2020
https://doi.org/10.1177/1094342020921340
Qiu YKang QLiu MChen AZhao BZheng HMadhyastha HPadmanabhan V(2020)ClaraProceedings of the 19th ACM Workshop on Hot Topics in Networks10.1145/3422604.3425929(16-22)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.1145/3422604.3425929
Remmelg THagedorn BLi LSteuwer MGorlatch SDubach CJog AKayiran OPattnaik A(2020)High-level hardware feature extraction for GPU performance prediction of stencilsProceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing Unit10.1145/3366428.3380769(21-30)Online publication date: 23-Feb-2020
https://dl.acm.org/doi/10.1145/3366428.3380769
Moolchandani DGupta SKumar ASarangi S(2020)Performance Prediction for Multi-Application Concurrency on GPUs2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS48437.2020.00050(306-315)Online publication date: Aug-2020
https://doi.org/10.1109/ISPASS48437.2020.00050
Arndt OLüders MRiggers CBlume H(2020)Multicore Performance Prediction with MPETJournal of Signal Processing Systems10.1007/s11265-020-01563-wOnline publication date: 1-Jul-2020
https://doi.org/10.1007/s11265-020-01563-w
Riahi ASavadi ANaghibzadeh M(2020)Comparison of analytical and ML-based models for predicting CPU–GPU data transfer timeComputing10.1007/s00607-019-00780-xOnline publication date: 8-Jan-2020
https://doi.org/10.1007/s00607-019-00780-x
Lüders MArndt OBlume H(2020)Multicore Performance Prediction – Comparing Three Recent Approaches in a Case StudyEuro-Par 2019: Parallel Processing Workshops10.1007/978-3-030-48340-1_22(282-294)Online publication date: 29-May-2020
https://doi.org/10.1007/978-3-030-48340-1_22
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten