Neutrino

by Huang Songlin and Wu Chenshu from the University of Hong Kong.

Neutrino is a Probing-based GPU Kernel Profiler providing eBPF-like user experience for GPU Kernel Profiling, targeting:

Fine-granularity: Directly works on instructions to offer the finest granularity that can be mapped to particular hardware units.
Programmability: Extends the programmability of previous tools to probe cooperation with probe
Versatility: Supports both value profiling (register value like memory address) and value profiling (timestamp from device-side clock).
Hardware-Independence: Support both NVIDIA/CUDA and AMD/ROCm, more platforms to come!
Ecosystem-Compatibility: Built-in compatible with PyTorch (and everything on top like Huggingface), JAX, Triton, CUTLASS...

The foundations of this project are described in our OSDI '25 publication: Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing. Please consider citing this work if you use Neutrino! The official documentation contains more installation instructions, tutorials, internals and the DMAT galley!

Latest News

May 31, 2025: Neutrino's artifact received all badges (Available, Functional, Reproduced) from OSDI 25 Artifact Evaluation!

Quick Start

Demos

Following demos are hosted on Colab with simple click Runtime -> Run All:

Demo	Colab Link
Unrevealing block scheduling cost of `torch.zeros`
Visualizing FlashAttn-v2 Memory Access
Warp Scheduling and Tailing Effect

Installation

Neutrino can be installed as a Python package from source. Building is fast (<30 seconds)!

# Virtual Environmnt is highly recommended!
conda create conda create -y -n <name> python=3.11 && conda activate <name>
git clone https://github.com/open-neutrino/neutrino
cd neutrino && python setup.py install && cd ..
neutrino --help # test installation

Neutrino does not have pre-build wheels, please DO NOT pip instsall neutrino!

Using Neutrino

Inspired by eBPF, probe in Neutrino refers to a tiny sandboxed code snippet that could be attached to the GPU kernel at the assembly level (PTX, GCNAsm, SPIR-V) in the runtime. probe extends a new programmable interface than traditional programming and provides a convenient way for observability to black-boxed GPU runtime. Currently Neutrino probes support two programming ways:

Pythonic Tracing DSL, suitable for beginners.
Direct Assembly probes wrapped in TOML, suitable for advanced usage but it is platform-dependent.

Pythonic Tracing DSL

from neutrino import probe, Map
import neutrino.language as nl
CALLBACK = "block_sched.py" # for trace analysis
# declare maps for persistence
@Map(level="warp", type="array", size=16, cap=1)
class block_sched:
  start: nl.u64
  elapsed: nl.u32
  cuid: nl.u32
# declare probe registers shared across probes
start: nl.u64 = 0 # starting clock
elapsed: nl.u64 = 0 # elapsed time, initialized to 0
# define probes with decorator
@probe(pos="kernel", level="warp", before=True)
def thread_start():
  start = nl.clock()
@probe(pos="kernel", level="warp")
def thread_end():
  elapsed = nl.clock() - start
  block_sched.save(start, elapsed, nl.cuid())

Direct Assembly wrapped in TOML

# CUDA PTX Assembly Example
callback="block_sched.py"
[ map.block_sched ]
type = "array"
level = "warp"
size = "16"
cap = "1"
[ probe.thread_start_thread_end ]
position = "kernel"
level = "warp"
register = {"u32": 2, "u64": 3}
before = """.reg .b64 %PD<3>;
.reg .b32 %P<2>;
mov.u64 %PD0, %clock64;"""
after = """mov.u64 %PD1, %clock64;
sub.u64 %PD1, %PD1, %PD0;
cvt.u32.u64 %P1, %PD1;
mov.u32 %P2, %smid;
SAVE [ block_sched ] {%PD0, %P1, %P2};"""

The interface of @neutrino.Probe is inspired by Triton whose contents (left) will be compiled, rather than executed, into platform-specific assemblies (right). Probes of same level and pos will be merged.

The formulation (and the name) of @neutrino.Map is prompted by eBPF Map. With structured definition, Neutrino can have save (no illegal memory access) and efficient (race-free, no atomics) persistence.

To simplify the development, Neutrino also provides some helper functions / operands:

nl.clock() / nl.time(): for reading device-side clock and timer.
nl.addr/out/in1/in2/in3: for reading register values
Map.save(): for persisting values for posterior analysis.

Compatibility

More information can be found in our documentation. If you have more platforms or workloads need the support, please raise an issue to let us know!

Hardware

Hardware Platform	Support Status
NVIDIA/CUDA/PTX	✅ Supported
AMD/ROCm/GCNAsm	🛠️ Testing
General/OpenCL/SPIR-V	🚀 Planning

Software

Software Framework	Status
cuBLAS/cuFFT/cuSparse...	❌ (no plan for supporting)
CUTLASS	✅ (with macro in building)
PyTorch family (torchvision...)	✅ (with custom build)
JAX	✅ (with envariable in runtime)
Triton	✅

Internals

neutrino is designed to operate in the following workflow:

The source code are placed in the following structure:

neutrino
├── language # DSL and Compiler, Still in Testing
│   ├── __init__.py # DSL Primitive
│   ├── compiler.py # Exported Compiler API
│   ├── frontend.py # Parser and AST Transformer
│   ├── gcn.py     # CUDA PTX Codegen Backend
│   └── ptx.py      # AMD ROCm Codegen Backend
├── probe    # Probe Engine
│   ├── __init__.py # Common Definition and Utilities
│   ├── cuda.py     # CUDA PTX Impl
│   └── hip.py      # AMD ROCm Impl
├── src      # Hook Driver
│   ├── common.h    # Platform-agnostic Definition (GNU-only)
│   ├── cuda.c      # CUDA Impl (NVIDIA-related)
│   ├── hip.c       # ROCm Impl (AMD-related)
│   ├── preload.c   # Injector via LD_PRELOAD
│   ├── parse.py    # Generate Unhook API (NVIDIA/AMD)
│   ├── sha1.h      # third-parties header-only library
│   └── uthash.h    # third-parties header-only library
├── build.py    # Builder for driver in src/
├── cli.py      # Command Line Interface Entry
├── common.py   # Common Internal API not for User import
└── __init__.py # Common Defn for user import like probe, Map

The overall structure is clean and approachable, we welcome developers to hack the system for their need. Raise issues if you need help.

More

How to write my probe? Check the Probe Writing Guide.
How are probes executed? Check the Probe Execution Model.
How to read the neutrino trace? Check the Trace File Structure.
How to Neutrino works and how to extend? Check the Reference and Internals.
How good is Neutrino? Check the Utilities and Extensions

Citation

If you used Neutrino in your research, please cite the paper below. And we welcome you to send us a link to your paper.

@inproceedings{huang2025neutrino,
    author = {Songlin Huang and Chenshu Wu},
    title = {Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing},
    booktitle = {19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25)},
    year = {2025},
    url = {https://www.usenix.org/conference/osdi25/presentation/huang-songlin},
    publisher = {USENIX Association},
}

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets		assets
examples		examples
neutrino		neutrino
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Neutrino

Latest News

Quick Start

Demos

Installation

Using Neutrino

Pythonic Tracing DSL

Direct Assembly wrapped in TOML

Compatibility

Hardware

Software

Internals

More

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

open-neutrino/neutrino

Folders and files

Latest commit

History

Repository files navigation

Neutrino

Latest News

Quick Start

Demos

Installation

Using Neutrino

Pythonic Tracing DSL

Direct Assembly wrapped in TOML

Compatibility

Hardware

Software

Internals

More

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages