More Web Proxy on the site http://driver.im/

research-article

NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

Authors:

Mark Stephenson,

Stephen W. KecklerAuthors Info & Claims

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

Pages 372 - 383

https://doi.org/10.1145/3352460.3358307

Published: 12 October 2019 Publication History

Abstract

Binary instrumentation frameworks are widely used to implement profilers, performance evaluation, error checking, and bug detection tools. While dynamic binary instrumentation tools such as PIN and DynamoRio are supported on CPUs, GPU architectures currently only have limited support for similar capabilities through static compile-time tools, which prohibits instrumentation of dynamically loaded libraries that are foundations for modern high-performance applications. This work presents NVBit, a fast, dynamic, and portable, binary instrumentation framework, that allows users to write instrumentation tools in CUDA/C/C++ and selectively apply that functionality to pre-compiled binaries and libraries executing on NVIDIA GPUs. Using dynamic recompilation at the SASS level, NVBit analyzes GPU kernel register requirements to generate efficient ABI compliant instrumented code without requiring the tool developer to have detailed knowledge of the underlying GPU architecture. NVBit allows basic-block instrumentation, multiple function injections to the same location, inspection of all ISA visible state, dynamic selection of instrumented or uninstrumented code, permanent modification of register state, source code correlation, and instruction removal. NVBit supports all recent NVIDIA GPU architecture families including Kepler, Maxwell, Pascal and Volta and works on any pre-compiled CUDA, OpenACC, OpenCL, or CUDA-Fortran application.

References

[1]

Jaleel Aamer, S. Cohn Robert, Luk Chi-Keung, and Jacob Bruce. 2008. CMP$im: A Pin-Based On-The-Fly Multi-Core Cache Simulator. In Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS).

[2]

Derek Bruening. 2004. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. Ph.D. Dissertation. Massachusetts Institute of Technology.

[3]

Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. CoRR abs/1605.07678 (2016). arXiv:1605.07678

[4]

Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS).

[5]

Shane Cook. 2012. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs. Morgan Kaufmann, Waltham, MA.

[6]

Carlo del Mundo and Wu-chun Feng. 2014. Towards a Performance-portable FFT Library for Heterogeneous Computing. In Proceedings of the 11th ACM Conference on Computing Frontiers.

Digital Library

[7]

Gregory Diamos, Andrew Kerr, and Mukil Kesavan. 2009. Translating GPU Binaries to Tiered SIMD Architectures with Ocelot. Technical Report 09-01. Georgia Institute of Technology. http://www.cercs.gatech.edu/tech-reports/tr2009/abstracts/01.html

[8]

Naila Farooqui, Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili, and Karsten Schwan. 2011. A Framework for Dynamically Instrumenting GPU Compute Applications Within GPU Ocelot. In Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units.

Digital Library

[9]

Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Stephen W. Keckler, and Joel Emer. 2017. SASSIFI: An Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 249--258.

[10]

Matthias Hauswirth and Trishul M. Chilimbi. 2004. Low-overhead Memory Leak Detection Using Adaptive Statistical Profiling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS). 156--164.

[11]

Ari B. Hayes, Fei Hua, Jin Huang, Yan-Hao Chen, and Eddy Z. Zhang. 2019. Decoding CUDA Binary. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). 229--241.

[12]

Kim Hazelwood and Artur Klauser. 2006. A Dynamic Binary Instrumentation Engine for the ARM Architecture. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 261--270.

Digital Library

[13]

Robert Hundt. 2000. HP Caliper: a Framework for Performance Analysis Tools. IEEE Concurrency 8, 4 (October 2000), 64--71.

Digital Library

[14]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the International Conference on Multimedia. 675--678.

Digital Library

[15]

Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2010. Modeling GPU-CPU Workloads and Systems. In Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units.

Digital Library

[16]

Michael O. Lam, Jeffrey K. Hollingsworth, Bronis R. de Supinski, and Matthew P. LeGendre. 2013. Automatically Adapting Programs for Mixed-precision Floating-point Computation. In Proceedings of the International Conference on Supercomputing (ICS). 369--378.

[17]

Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC).

[18]

Linux Programmer's Manual. http://man7.org/linux/man-pages/man8/ld.so.8.html. Accessed: 2019-02-11.

[19]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 190--200.

Digital Library

[20]

Rano Mal and Yul Chu. 2017. A Flexible Multi-core Functional Cache Simulator (FM-SIM). In Proceedings of the Summer Simulation Multi-Conference.

Digital Library

[21]

Edward McLellan. 1993. The Alpha AXP Architecture and 21064 Processor. IEEE Micro 13, 3 (1993), 36--47.

Digital Library

[22]

Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100.

Digital Library

[23]

NVIDIA CUDA Binary Utilities. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html. Accessed: 2019-02-11.

[24]

NVIDIA CUDA Compiler Driver NVCC. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html. Accessed: 2019-02-11.

[25]

NVIDIA CUDA Driver APIs. https://docs.nvidia.com/cuda/cuda-driver-api/index.html. Accessed: 2019-02-11.

[26]

NVIDIA CUDA Fortran. https://developer.nvidia.com/cuda-fortran. Accessed: 2019-02-11.

[27]

NVIDIA CUDA GDB. https://docs.nvidia.com/cuda/cuda-gdb/index.html. Accessed: 2019-02-11.

[28]

NVIDIA CUPTI Callback APIs. https://docs.nvidia.com/cuda/cupti/group__CUPTI__CALLBACK__API.html. Accessed: 2019-02-11.

[29]

NVIDIA GPU Accelerated Libraries for Computing. https://developer.nvidia.com/gpu-accelerated-libraries. Accessed: 2019-02-11.

[30]

NVIDIA Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html. Accessed: 2019-02-11.

[31]

NVIDIA TITAN V. https://www.nvidia.com/en-us/titan/titan-v/. Accessed: 2019-02-11.

[32]

SASSI Instrumentation Tool for NVIDIA GPUs. https://github.com/NVlabs/SASSI. Accessed: 2019-02-11.

[33]

Vijay Janapa Reddi, Alex Settle, Daniel A. Connors, and Robert S. Cohn. 2004. PIN: A Binary Instrumentation Tool for Computer Architecture Research and Education. In Proceedings of the Workshop on Computer Architecture Education.

[34]

Standard Performance Evaluation Corporation (SPEC): ACCEL. https://www.spec.org/accel/. Accessed: 2019-02-11.

[35]

Amitabh Srivastava and Alan Eustace. 1994. ATOM: A System for Building Customized Program Analysis Tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 196--205.

Digital Library

[36]

Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O'Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA). 185--197.

[37]

John. E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering 12, 3 (May-June 2010), 66--73.

[38]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast Convolutional Nets with fbfft: A GPU Performance Evaluation. CoRR abs/1412.7580 (2014).

[39]

Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC: First Experiences with Real-world Applications. In Proceedings of the International Conference on Parallel Processing. 859--870.

Digital Library

[40]

Qin Zhao, Rodric Rabbah, Saman Amarasinghe, Larry Rudolph, and Weng-Fai Wong. 2008. How to Do a Million Watchpoints: Efficient Debugging Using Dynamic Instrumentation. In International Conference on Compiler Construction. 147--162.

[41]

Xiaotong Zhuang, Mauricio J. Serrano, Harold W. Cain, and Jong-Deok Choi. 2006. Accurate, Efficient, and Adaptive Calling Context Profiling. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 263--271.

Digital Library

Cited By

Abdelfattah ACerny TYero JSong ETaibi D(2024)Test Coverage in Microservice Systems: An Automated Approach to E2E and API Test Coverage MetricsElectronics10.3390/electronics1310191313:10(1913)Online publication date: 13-May-2024
https://doi.org/10.3390/electronics13101913
Adhianto LAnderson JBarnett RGrbic DIndic VKrentel MLiu YMilaković SPhan WMellor-Crummey J(2024)Refining HPCToolkit for application performance analysis at exascaleThe International Journal of High Performance Computing Applications10.1177/1094342024127783938:6(612-632)Online publication date: 30-Aug-2024
https://doi.org/10.1177/10943420241277839
Sheridan KDominguez-Trujillo JShipman GLavin PScott CVaca Valverde AVuduc RYoung J(2024)A Workflow for the Synthesis of Irregular Memory Access MicrobenchmarksProceedings of the International Symposium on Memory Systems10.1145/3695794.3695816(219-234)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695816
Show More Cited By

Index Terms

NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Dynamic compilers
      2. Just-in-time compilers

Recommendations

A unified optimizing compiler framework for different GPGPU architectures

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
A framework for dynamically instrumenting GPU compute applications within GPU Ocelot
GPGPU-4: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units

In this paper we present the design and implementation of a dynamic instrumentation infrastructure for PTX programs that procedurally transforms kernels and manages related data structures. We show how performing instrumentation within the GPU Ocelot ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 2019

1104 pages

ISBN:9781450369381

DOI:10.1145/3352460

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

MICRO '52

Sponsor:

SIGMICRO

MICRO '52: The 52nd Annual IEEE/ACM International Symposium on Microarchitecture

October 12 - 16, 2019

OH, Columbus, USA

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
2,039
Total Downloads

Downloads (Last 12 months)330
Downloads (Last 6 weeks)54

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Abdelfattah ACerny TYero JSong ETaibi D(2024)Test Coverage in Microservice Systems: An Automated Approach to E2E and API Test Coverage MetricsElectronics10.3390/electronics1310191313:10(1913)Online publication date: 13-May-2024
https://doi.org/10.3390/electronics13101913
Adhianto LAnderson JBarnett RGrbic DIndic VKrentel MLiu YMilaković SPhan WMellor-Crummey J(2024)Refining HPCToolkit for application performance analysis at exascaleThe International Journal of High Performance Computing Applications10.1177/1094342024127783938:6(612-632)Online publication date: 30-Aug-2024
https://doi.org/10.1177/10943420241277839
Sheridan KDominguez-Trujillo JShipman GLavin PScott CVaca Valverde AVuduc RYoung J(2024)A Workflow for the Synthesis of Irregular Memory Access MicrobenchmarksProceedings of the International Symposium on Memory Systems10.1145/3695794.3695816(219-234)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3695794.3695816
Liu YAzami NVanausdal ABurtscher M(2024)Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common BugsACM Transactions on Parallel Computing10.1145/366525111:3(1-29)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3665251
Issa MSasongko MTurimbetov IBaydamirli JSağbili DUnat D(2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3650200.3656597
Shao YLi HGu XYin HLi YMiao XZhang WCui BChen L(2024)Distributed Graph Neural Network Training: A SurveyACM Computing Surveys10.1145/364835856:8(1-39)Online publication date: 10-Apr-2024
https://dl.acm.org/doi/10.1145/3648358
Guan YQiu YLeng JYang FYu SLiu YFeng YZhu YZhou LLiang YZhang CLi CGuo MTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)Amanda: Unified Instrumentation Framework for Deep Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624864(1-18)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624864
Ahmadilivani MBosio ADeveautour BSantos FGuerrero-Balaguera JJenihhin MKritikakou ASierra RPappalardo SRaik JCondia JReorda MTaheri MTraiola M(2024)Special Session: Reliability Assessment Recipes for DNN Accelerators2024 IEEE 42nd VLSI Test Symposium (VTS)10.1109/VTS60656.2024.10538707(1-11)Online publication date: 22-Apr-2024
https://doi.org/10.1109/VTS60656.2024.10538707
Wei XJiang NYue HWang XZhao JLi GQiu M(2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: Apr-2024
https://doi.org/10.1109/TCAD.2023.3330821
Jacobson JBurtscher MGopalakrishnan G(2024)HiRace: Accurate and Fast Data Race Checking for GPU ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00042(1-14)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00042
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents