[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3352460.3358307acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs

Published: 12 October 2019 Publication History

Abstract

Binary instrumentation frameworks are widely used to implement profilers, performance evaluation, error checking, and bug detection tools. While dynamic binary instrumentation tools such as PIN and DynamoRio are supported on CPUs, GPU architectures currently only have limited support for similar capabilities through static compile-time tools, which prohibits instrumentation of dynamically loaded libraries that are foundations for modern high-performance applications. This work presents NVBit, a fast, dynamic, and portable, binary instrumentation framework, that allows users to write instrumentation tools in CUDA/C/C++ and selectively apply that functionality to pre-compiled binaries and libraries executing on NVIDIA GPUs. Using dynamic recompilation at the SASS level, NVBit analyzes GPU kernel register requirements to generate efficient ABI compliant instrumented code without requiring the tool developer to have detailed knowledge of the underlying GPU architecture. NVBit allows basic-block instrumentation, multiple function injections to the same location, inspection of all ISA visible state, dynamic selection of instrumented or uninstrumented code, permanent modification of register state, source code correlation, and instruction removal. NVBit supports all recent NVIDIA GPU architecture families including Kepler, Maxwell, Pascal and Volta and works on any pre-compiled CUDA, OpenACC, OpenCL, or CUDA-Fortran application.

References

[1]
Jaleel Aamer, S. Cohn Robert, Luk Chi-Keung, and Jacob Bruce. 2008. CMP$im: A Pin-Based On-The-Fly Multi-Core Cache Simulator. In Fourth Annual Workshop on Modeling, Benchmarking and Simulation (MoBS).
[2]
Derek Bruening. 2004. Efficient, Transparent, and Comprehensive Runtime Code Manipulation. Ph.D. Dissertation. Massachusetts Institute of Technology.
[3]
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. CoRR abs/1605.07678 (2016). arXiv:1605.07678
[4]
Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet. 2011. Torch7: A Matlab-like Environment for Machine Learning. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS).
[5]
Shane Cook. 2012. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs. Morgan Kaufmann, Waltham, MA.
[6]
Carlo del Mundo and Wu-chun Feng. 2014. Towards a Performance-portable FFT Library for Heterogeneous Computing. In Proceedings of the 11th ACM Conference on Computing Frontiers.
[7]
Gregory Diamos, Andrew Kerr, and Mukil Kesavan. 2009. Translating GPU Binaries to Tiered SIMD Architectures with Ocelot. Technical Report 09-01. Georgia Institute of Technology. http://www.cercs.gatech.edu/tech-reports/tr2009/abstracts/01.html
[8]
Naila Farooqui, Andrew Kerr, Gregory Diamos, Sudhakar Yalamanchili, and Karsten Schwan. 2011. A Framework for Dynamically Instrumenting GPU Compute Applications Within GPU Ocelot. In Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units.
[9]
Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Stephen W. Keckler, and Joel Emer. 2017. SASSIFI: An Architecture-level Fault Injection Tool for GPU Application Resilience Evaluation. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 249--258.
[10]
Matthias Hauswirth and Trishul M. Chilimbi. 2004. Low-overhead Memory Leak Detection Using Adaptive Statistical Profiling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operation Systems (ASPLOS). 156--164.
[11]
Ari B. Hayes, Fei Hua, Jin Huang, Yan-Hao Chen, and Eddy Z. Zhang. 2019. Decoding CUDA Binary. In Proceedings of the International Symposium on Code Generation and Optimization (CGO). 229--241.
[12]
Kim Hazelwood and Artur Klauser. 2006. A Dynamic Binary Instrumentation Engine for the ARM Architecture. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES). 261--270.
[13]
Robert Hundt. 2000. HP Caliper: a Framework for Performance Analysis Tools. IEEE Concurrency 8, 4 (October 2000), 64--71.
[14]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the International Conference on Multimedia. 675--678.
[15]
Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2010. Modeling GPU-CPU Workloads and Systems. In Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units.
[16]
Michael O. Lam, Jeffrey K. Hollingsworth, Bronis R. de Supinski, and Matthew P. LeGendre. 2013. Automatically Adapting Programs for Mixed-precision Floating-point Computation. In Proceedings of the International Conference on Supercomputing (ICS). 369--378.
[17]
Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC).
[18]
Linux Programmer's Manual. http://man7.org/linux/man-pages/man8/ld.so.8.html. Accessed: 2019-02-11.
[19]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 190--200.
[20]
Rano Mal and Yul Chu. 2017. A Flexible Multi-core Functional Cache Simulator (FM-SIM). In Proceedings of the Summer Simulation Multi-Conference.
[21]
Edward McLellan. 1993. The Alpha AXP Architecture and 21064 Processor. IEEE Micro 13, 3 (1993), 36--47.
[22]
Nicholas Nethercote and Julian Seward. 2007. Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 89--100.
[23]
NVIDIA CUDA Binary Utilities. https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html. Accessed: 2019-02-11.
[24]
NVIDIA CUDA Compiler Driver NVCC. https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html. Accessed: 2019-02-11.
[25]
NVIDIA CUDA Driver APIs. https://docs.nvidia.com/cuda/cuda-driver-api/index.html. Accessed: 2019-02-11.
[26]
NVIDIA CUDA Fortran. https://developer.nvidia.com/cuda-fortran. Accessed: 2019-02-11.
[27]
NVIDIA CUDA GDB. https://docs.nvidia.com/cuda/cuda-gdb/index.html. Accessed: 2019-02-11.
[28]
NVIDIA CUPTI Callback APIs. https://docs.nvidia.com/cuda/cupti/group__CUPTI__CALLBACK__API.html. Accessed: 2019-02-11.
[29]
NVIDIA GPU Accelerated Libraries for Computing. https://developer.nvidia.com/gpu-accelerated-libraries. Accessed: 2019-02-11.
[30]
NVIDIA Parallel Thread Execution ISA. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html. Accessed: 2019-02-11.
[31]
NVIDIA TITAN V. https://www.nvidia.com/en-us/titan/titan-v/. Accessed: 2019-02-11.
[32]
SASSI Instrumentation Tool for NVIDIA GPUs. https://github.com/NVlabs/SASSI. Accessed: 2019-02-11.
[33]
Vijay Janapa Reddi, Alex Settle, Daniel A. Connors, and Robert S. Cohn. 2004. PIN: A Binary Instrumentation Tool for Computer Architecture Research and Education. In Proceedings of the Workshop on Computer Architecture Education.
[34]
Standard Performance Evaluation Corporation (SPEC): ACCEL. https://www.spec.org/accel/. Accessed: 2019-02-11.
[35]
Amitabh Srivastava and Alan Eustace. 1994. ATOM: A System for Building Customized Program Analysis Tools. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 196--205.
[36]
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O'Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In Proceedings of the International Symposium on Computer Architecture (ISCA). 185--197.
[37]
John. E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science and Engineering 12, 3 (May-June 2010), 66--73.
[38]
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast Convolutional Nets with fbfft: A GPU Performance Evaluation. CoRR abs/1412.7580 (2014).
[39]
Sandra Wienke, Paul Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC: First Experiences with Real-world Applications. In Proceedings of the International Conference on Parallel Processing. 859--870.
[40]
Qin Zhao, Rodric Rabbah, Saman Amarasinghe, Larry Rudolph, and Weng-Fai Wong. 2008. How to Do a Million Watchpoints: Efficient Debugging Using Dynamic Instrumentation. In International Conference on Compiler Construction. 147--162.
[41]
Xiaotong Zhuang, Mauricio J. Serrano, Harold W. Cain, and Jong-Deok Choi. 2006. Accurate, Efficient, and Adaptive Calling Context Profiling. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 263--271.

Cited By

View all
  • (2024)Test Coverage in Microservice Systems: An Automated Approach to E2E and API Test Coverage MetricsElectronics10.3390/electronics1310191313:10(1913)Online publication date: 13-May-2024
  • (2024)Refining HPCToolkit for application performance analysis at exascaleThe International Journal of High Performance Computing Applications10.1177/1094342024127783938:6(612-632)Online publication date: 30-Aug-2024
  • (2024)A Workflow for the Synthesis of Irregular Memory Access MicrobenchmarksProceedings of the International Symposium on Memory Systems10.1145/3695794.3695816(219-234)Online publication date: 30-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '52: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture
October 2019
1104 pages
ISBN:9781450369381
DOI:10.1145/3352460
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. Dynamic binary instrumentation
  3. GPGPU
  4. GPU computing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MICRO '52
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)330
  • Downloads (Last 6 weeks)54
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Test Coverage in Microservice Systems: An Automated Approach to E2E and API Test Coverage MetricsElectronics10.3390/electronics1310191313:10(1913)Online publication date: 13-May-2024
  • (2024)Refining HPCToolkit for application performance analysis at exascaleThe International Journal of High Performance Computing Applications10.1177/1094342024127783938:6(612-632)Online publication date: 30-Aug-2024
  • (2024)A Workflow for the Synthesis of Irregular Memory Access MicrobenchmarksProceedings of the International Symposium on Memory Systems10.1145/3695794.3695816(219-234)Online publication date: 30-Sep-2024
  • (2024)Indigo3: A Parallel Graph Analytics Benchmark Suite for Exploring Implementation Styles and Common BugsACM Transactions on Parallel Computing10.1145/366525111:3(1-29)Online publication date: 15-May-2024
  • (2024)Snoopie: A Multi-GPU Communication Profiler and VisualizerProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656597(525-536)Online publication date: 30-May-2024
  • (2024)Distributed Graph Neural Network Training: A SurveyACM Computing Surveys10.1145/364835856:8(1-39)Online publication date: 10-Apr-2024
  • (2024)Amanda: Unified Instrumentation Framework for Deep Neural NetworksProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624864(1-18)Online publication date: 27-Apr-2024
  • (2024)Special Session: Reliability Assessment Recipes for DNN Accelerators2024 IEEE 42nd VLSI Test Symposium (VTS)10.1109/VTS60656.2024.10538707(1-11)Online publication date: 22-Apr-2024
  • (2024)ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333082143:4(1051-1064)Online publication date: Apr-2024
  • (2024)HiRace: Accurate and Fast Data Race Checking for GPU ProgramsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00042(1-14)Online publication date: 17-Nov-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media