[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2749469.2750375acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Flexible software profiling of GPU architectures

Published: 13 June 2015 Publication History

Abstract

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code "SASS" Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.

References

[1]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2009, pp. 163--174.
[2]
N. Bell and M. Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA," NVIDIA, Tech. Rep. NVR-2008-004, December 2008.
[3]
P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang, "Mambo: A Full System Simulator for the PowerPC Architecture," ACM SIGMETRICS Performance Evaluation Review, vol. 31, no. 4, pp. 8--12, 2004.
[4]
E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl, "PROTEUS: A High-performance Parallel-architecture Simulator," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), June 1992, pp. 247--248.
[5]
D. Brooks and M. Martonosi, "Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 1999, pp. 13--22.
[6]
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of the International Symposium on Workload Characterization (IISWC), November 2012, pp. 141--151.
[7]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), October 2009, pp. 44--54.
[8]
B. Cmelik and D. Keppel, "Shade: A Fast Instruction-set Simulator for Execution Profiling," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1994, pp. 128--137.
[9]
R. C. Covington, S. Madala, V. Mehta, J. R. Jump, and J. B. Sinclair, "The Rice Parallel Processing Testbed," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1988, pp. 4--11.
[10]
H. Davis, S. R. Goldschmidt, and J. Hennessy, "Multiprocessor Tracing and Simulation Using Tango," in Proceedings of the International Conference on Parallel Processing (ICPP), August 1991.
[11]
J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos, "ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 1997, pp. 292--302.
[12]
Derek Bruening, "Efficient, Transparent, and Comprehensive Runtime Code Manipulation," Ph.D. dissertation, Massachusetts Institute of Technology, 2004.
[13]
G. Diamos, A. Kerr, and M. Kesavan, "Translating GPU Binaries to Tiered Many-Core Architectures with Ocelot," Georgia Institute of Technology Center for Experimental Research in Computer Systems (CERCS), Tech. Rep. 0901, January 2009.
[14]
B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, "GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2014, pp. 221--230.
[15]
N. Farooqui, A. Kerr, G. Diamos, S. Yalamanchili, and K. Schwan, "A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, March 2011.
[16]
S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer, "SASSIFI: Evaluating Resilience of GPU Applications," in Proceedings of the Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2015.
[17]
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich, "Improving Performance via Mini-applications," Sandia National Labs, Tech. Rep. SAND2009-5574, September 2009.
[18]
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP)," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 60--71.
[19]
Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, and K. Asanović, "Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2014, pp. 101--113.
[20]
Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanovic, "Convergence and Scalarization for Data-parallel Architectures," in International Symposium on Code Generation and Optimization (CGO), February 2013, pp. 1--11.
[21]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2005, pp. 190--200.
[22]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 235--246.
[23]
J. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, "Graphite: A Distributed Parallel Simulator for Multicores," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 2010, pp. 1--12.
[24]
T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in Onchip Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2009, pp. 196--207.
[25]
O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2007, pp. 146--160.
[26]
S. Narayanasamy, G. Pokam, and B. Calder, "BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging," in Proceedings of the International Symposium on Computer Architecture (ISCA), May 2005, pp. 284--295.
[27]
National Energy Research Scientific Computing Center, "MiniFE," https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minife, 2014.
[28]
N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2007, pp. 89--100.
[29]
NVIDIA. (2013, November) Unified Memory in CUDA 6. Available: http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/
[30]
NVIDIA. (2014, August) CUDA C Best Practices Guides. Available: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
[31]
NVIDIA. (2014, August) CUDA-GDB: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-gdb/index.html
[32]
NVIDIA. (2014, November) CUDA Programming Guide: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/
[33]
NVIDIA. (2014, November) CUPTI: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cupti/index.html
[34]
NVIDIA. (2014) NVIDIA NSIGHT User Guide. Available: http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight_visual_studio_edition_user_guide.htm
[35]
NVIDIA. (2014, August) Visual Profiler Users's Guide. Available: http://docs.nvidia.com/cuda/profiler-users-guide
[36]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2013, pp. 99--110.
[37]
J. Sartori and R. Kumar, "Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications," IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 279--290, February 2013.
[38]
A. Srivastava and A. Eustace, "ATOM: A System for Building Customized Program Analysis Tools," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 1994, pp. 196--205.
[39]
J. E. Stone, D. Gohara, and G. Shi, "OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems," Computing in Science and Engineering, vol. 12, no. 3, pp. 66--73, May/June 2010.
[40]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," University of Illinois at Urbana-Champaign, Center for Reliable and High-Performance Computing, Tech. Rep. IMPACT-12-01, March 2012.
[41]
S. Tallam and R. Gupta, "Bitwidth Aware Global Register Allocation," in Proceedings of the Symposium on Principles of Programming Languages (POPL), January 2003, pp. 85--96.
[42]
P. Xiang, Y. Yang, and H. Zhou, "Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), February 2014, pp. 284--295.

Cited By

View all
  • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
  • (2023)Hybrid PTX Analysis for GPU accelerated CNN inferencing aiding Computer Architecture Design2023 Forum on Specification & Design Languages (FDL)10.1109/FDL59689.2023.10272088(1-8)Online publication date: 13-Sep-2023
  • (2022)Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network ExecutionJournal of Low Power Electronics and Applications10.3390/jlpea1203003612:3(36)Online publication date: 23-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

ISCA '15
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)19
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
  • (2023)Hybrid PTX Analysis for GPU accelerated CNN inferencing aiding Computer Architecture Design2023 Forum on Specification & Design Languages (FDL)10.1109/FDL59689.2023.10272088(1-8)Online publication date: 13-Sep-2023
  • (2022)Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network ExecutionJournal of Low Power Electronics and Applications10.3390/jlpea1203003612:3(36)Online publication date: 23-Jun-2022
  • (2022)ValueExpert: exploring value patterns in GPU-accelerated applicationsProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507708(171-185)Online publication date: 28-Feb-2022
  • (2022)FT-DeepNets: Fault-Tolerant Convolutional Neural Networks with Kernel-based Duplication2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV51458.2022.00194(1878-1887)Online publication date: Jan-2022
  • (2022)A Survey of Performance Optimization for Mobile ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2021.307119348:8(2879-2904)Online publication date: 1-Aug-2022
  • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
  • (2022)Flexible Binary Instrumentation Framework to Profile Code Running on Intel GPUs2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00011(109-120)Online publication date: May-2022
  • (2022)Profiling Intel Graphics Architecture with Long Instruction Traces2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00001(1-11)Online publication date: May-2022
  • (2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media