More Web Proxy on the site http://driver.im/

research-article

Flexible software profiling of GPU architectures

Authors:

Mark Stephenson,

Siva Kumar Sastry Hari,

Eiman Ebrahimi,

Daniel R. Johnson,

Stephen W. KecklerAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 185 - 197

https://doi.org/10.1145/2749469.2750375

Published: 13 June 2015 Publication History

Abstract

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, GPU manufacturers have developed similar tools leveraging hardware profiling and debugging hooks. To date, these tools are largely limited by the fixed menu of options provided by the tool developer and do not offer the user the flexibility to observe or act on events not in the menu. This paper presents SASSI (NVIDIA assembly code "SASS" Instrumentor), a low-level assembly-language instrumentation tool for GPUs. Like CPU binary instrumentation tools, SASSI allows a user to specify instructions at which to inject user-provided instrumentation code. These facilities allow strategic placement of counters and code into GPU assembly code to collect user-directed, fine-grained statistics at hardware speeds. SASSI instrumentation is inherently parallel, leveraging the concurrency of the underlying hardware. In addition to the details of SASSI, this paper provides four case studies that show how SASSI can be used to characterize applications and explore the architecture design space along the dimensions of instruction control flow, memory systems, value similarity, and resilience.

References

[1]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2009, pp. 163--174.

[2]

N. Bell and M. Garland, "Efficient Sparse Matrix-Vector Multiplication on CUDA," NVIDIA, Tech. Rep. NVR-2008-004, December 2008.

[3]

P. Bohrer, J. Peterson, M. Elnozahy, R. Rajamony, A. Gheith, R. Rockhold, C. Lefurgy, H. Shafi, T. Nakra, R. Simpson, E. Speight, K. Sudeep, E. V. Hensbergen, and L. Zhang, "Mambo: A Full System Simulator for the PowerPC Architecture," ACM SIGMETRICS Performance Evaluation Review, vol. 31, no. 4, pp. 8--12, 2004.

Digital Library

[4]

E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl, "PROTEUS: A High-performance Parallel-architecture Simulator," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), June 1992, pp. 247--248.

Digital Library

[5]

D. Brooks and M. Martonosi, "Dynamically Exploiting Narrow Width Operands to Improve Processor Power and Performance," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 1999, pp. 13--22.

Digital Library

[6]

M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in Proceedings of the International Symposium on Workload Characterization (IISWC), November 2012, pp. 141--151.

Digital Library

[7]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A Benchmark Suite for Heterogeneous Computing," in Proceedings of the International Symposium on Workload Characterization (IISWC), October 2009, pp. 44--54.

Digital Library

[8]

B. Cmelik and D. Keppel, "Shade: A Fast Instruction-set Simulator for Execution Profiling," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1994, pp. 128--137.

Digital Library

[9]

R. C. Covington, S. Madala, V. Mehta, J. R. Jump, and J. B. Sinclair, "The Rice Parallel Processing Testbed," in Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1988, pp. 4--11.

Digital Library

[10]

H. Davis, S. R. Goldschmidt, and J. Hennessy, "Multiprocessor Tracing and Simulation Using Tango," in Proceedings of the International Conference on Parallel Processing (ICPP), August 1991.

[11]

J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos, "ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 1997, pp. 292--302.

Digital Library

[12]

Derek Bruening, "Efficient, Transparent, and Comprehensive Runtime Code Manipulation," Ph.D. dissertation, Massachusetts Institute of Technology, 2004.

Digital Library

[13]

G. Diamos, A. Kerr, and M. Kesavan, "Translating GPU Binaries to Tiered Many-Core Architectures with Ocelot," Georgia Institute of Technology Center for Experimental Research in Computer Systems (CERCS), Tech. Rep. 0901, January 2009.

[14]

B. Fang, K. Pattabiraman, M. Ripeanu, and S. Gurumurthi, "GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications," in Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), March 2014, pp. 221--230.

[15]

N. Farooqui, A. Kerr, G. Diamos, S. Yalamanchili, and K. Schwan, "A Framework for Dynamically Instrumenting GPU Compute Applications within GPU Ocelot," in Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, March 2011.

Digital Library

[16]

S. K. S. Hari, T. Tsai, M. Stephenson, S. W. Keckler, and J. Emer, "SASSIFI: Evaluating Resilience of GPU Applications," in Proceedings of the Workshop on Silicon Errors in Logic - System Effects (SELSE), April 2015.

[17]

M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich, "Improving Performance via Mini-applications," Sandia National Labs, Tech. Rep. SAND2009-5574, September 2009.

[18]

A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, "High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP)," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 60--71.

Digital Library

[19]

Y. Lee, V. Grover, R. Krashinsky, M. Stephenson, S. W. Keckler, and K. Asanović, "Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2014, pp. 101--113.

Digital Library

[20]

Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. Asanovic, "Convergence and Scalarization for Data-parallel Architectures," in International Symposium on Code Generation and Optimization (CGO), February 2013, pp. 1--11.

Digital Library

[21]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2005, pp. 190--200.

Digital Library

[22]

J. Meng, D. Tarjan, and K. Skadron, "Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2010, pp. 235--246.

Digital Library

[23]

J. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, "Graphite: A Distributed Parallel Simulator for Multicores," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), January 2010, pp. 1--12.

[24]

T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in Onchip Networks," in Proceedings of the International Symposium on Computer Architecture (ISCA), June 2009, pp. 196--207.

Digital Library

[25]

O. Mutlu and T. Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2007, pp. 146--160.

Digital Library

[26]

S. Narayanasamy, G. Pokam, and B. Calder, "BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging," in Proceedings of the International Symposium on Computer Architecture (ISCA), May 2005, pp. 284--295.

Digital Library

[27]

National Energy Research Scientific Computing Center, "MiniFE," https://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/minife, 2014.

[28]

N. Nethercote and J. Seward, "Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 2007, pp. 89--100.

Digital Library

[29]

NVIDIA. (2013, November) Unified Memory in CUDA 6. Available: http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

[30]

NVIDIA. (2014, August) CUDA C Best Practices Guides. Available: http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

[31]

NVIDIA. (2014, August) CUDA-GDB: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-gdb/index.html

[32]

NVIDIA. (2014, November) CUDA Programming Guide: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/

[33]

NVIDIA. (2014, November) CUPTI: CUDA Toolkit Documentation. Available: http://docs.nvidia.com/cuda/cupti/index.html

[34]

NVIDIA. (2014) NVIDIA NSIGHT User Guide. Available: http://docs.nvidia.com/gameworks/index.html#developertools/desktop/nsight_visual_studio_edition_user_guide.htm

[35]

NVIDIA. (2014, August) Visual Profiler Users's Guide. Available: http://docs.nvidia.com/cuda/profiler-users-guide

[36]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Divergence-aware Warp Scheduling," in Proceedings of the International Symposium on Microarchitecture (MICRO), December 2013, pp. 99--110.

Digital Library

[37]

J. Sartori and R. Kumar, "Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications," IEEE Transactions on Multimedia, vol. 15, no. 2, pp. 279--290, February 2013.

Digital Library

[38]

A. Srivastava and A. Eustace, "ATOM: A System for Building Customized Program Analysis Tools," in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), June 1994, pp. 196--205.

Digital Library

[39]

J. E. Stone, D. Gohara, and G. Shi, "OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems," Computing in Science and Engineering, vol. 12, no. 3, pp. 66--73, May/June 2010.

Digital Library

[40]

J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing," University of Illinois at Urbana-Champaign, Center for Reliable and High-Performance Computing, Tech. Rep. IMPACT-12-01, March 2012.

[41]

S. Tallam and R. Gupta, "Bitwidth Aware Global Register Allocation," in Proceedings of the Symposium on Principles of Programming Languages (POPL), January 2003, pp. 85--96.

Digital Library

[42]

P. Xiang, Y. Yang, and H. Zhou, "Warp-level Divergence in GPUs: Characterization, Impact, and Mitigation," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), February 2014, pp. 284--295.

Cited By

Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Metz CPlump CBerger BDrechsler R(2023)Hybrid PTX Analysis for GPU accelerated CNN inferencing aiding Computer Architecture Design2023 Forum on Specification & Design Languages (FDL)10.1109/FDL59689.2023.10272088(1-8)Online publication date: 13-Sep-2023
https://doi.org/10.1109/FDL59689.2023.10272088
Bloch ACasale-Brunet SMattavelli M(2022)Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network ExecutionJournal of Low Power Electronics and Applications10.3390/jlpea1203003612:3(36)Online publication date: 23-Jun-2022
https://doi.org/10.3390/jlpea12030036
Show More Cited By

Index Terms

Flexible software profiling of GPU architectures
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

Flexible software profiling of GPU architectures
ISCA'15

To aid application characterization and architecture design space exploration, researchers and engineers have developed a wide range of tools for CPUs, including simulators, profilers, and binary instrumentation tools. With the advent of GPU computing, ...
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

72
Total Citations
View Citations
764
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)19

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lin MZhou KSu PAamodt TJerger NSwift M(2023)DrGPUM: Guiding Memory Optimization for GPU-Accelerated ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3582016.3582044(164-178)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3582016.3582044
Metz CPlump CBerger BDrechsler R(2023)Hybrid PTX Analysis for GPU accelerated CNN inferencing aiding Computer Architecture Design2023 Forum on Specification & Design Languages (FDL)10.1109/FDL59689.2023.10272088(1-8)Online publication date: 13-Sep-2023
https://doi.org/10.1109/FDL59689.2023.10272088
Bloch ACasale-Brunet SMattavelli M(2022)Performance Estimation of High-Level Dataflow Program on Heterogeneous Platforms by Dynamic Network ExecutionJournal of Low Power Electronics and Applications10.3390/jlpea1203003612:3(36)Online publication date: 23-Jun-2022
https://doi.org/10.3390/jlpea12030036
Zhou KHao YMellor-Crummey JMeng XLiu XFalsafi BFerdman MLu SWenisch T(2022)ValueExpert: exploring value patterns in GPU-accelerated applicationsProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507708(171-185)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507708
Baek IChen WZhu ZSamii SRajkumar R(2022)FT-DeepNets: Fault-Tolerant Convolutional Neural Networks with Kernel-based Duplication2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV51458.2022.00194(1878-1887)Online publication date: Jan-2022
https://doi.org/10.1109/WACV51458.2022.00194
Hort MKechagia MSarro FHarman M(2022)A Survey of Performance Optimization for Mobile ApplicationsIEEE Transactions on Software Engineering10.1109/TSE.2021.307119348:8(2879-2904)Online publication date: 1-Aug-2022
https://doi.org/10.1109/TSE.2021.3071193
Darabi SSadrosadati MAkbarzadeh NLindegger JHosseini MPark JGomez-Luna JMutlu OSarbazi-Azad H(2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
https://doi.org/10.1109/MICRO56248.2022.00029
Skaletsky ALevit-Gurevich KBerezalsky MKuznetcova YYakov H(2022)Flexible Binary Instrumentation Framework to Profile Code Running on Intel GPUs2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00011(109-120)Online publication date: May-2022
https://doi.org/10.1109/ISPASS55109.2022.00011
Levit-Gurevich KSkaletsky ABerezalsky MKuznetcova YYakov H(2022)Profiling Intel Graphics Architecture with Long Instruction Traces2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS55109.2022.00001(1-11)Online publication date: May-2022
https://doi.org/10.1109/ISPASS55109.2022.00001
Arafa YBadawy AElWazir ABarai AEker AChennupati GSanthi NEidenbenz Sde Supinski BHall MGamblin T(2021)Hybrid, scalable, trace-driven performance modeling of GPGPUsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476221(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476221
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents