[RFC] Generalized Per-Operator Device Capability Registry for PyTorch Operator Testing #154017
Labels
module: cpu
CPU specific problem (e.g., perf, algorithm)
module: cuda
Related to torch.cuda, and CUDA support in general
module: PrivateUse1
private use
module: rocm
AMD GPU support for Pytorch
module: testing
Issues related to the torch.testing module (not tests)
module: xpu
Intel XPU related issues
rocm
This tag is for PRs from ROCm team
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Uh oh!
There was an error while loading. Please reload this page.
🚀 The feature, motivation and pitch
RFC: Generalized Per-Operator Device Capability Registry for PyTorch Operator Testing
Summary
This RFC proposes the introduction of a Generalized Device Capability Registry to centrally declare and manage per-operator, per-device support information in PyTorch’s testing framework. It aims to replace manual, scattered constructs like
dtypesIfCUDA
,skips
, andxfail
inOpInfo
declarations with a structured, extensible, and declarative registry.Motivation
PyTorch's operator testing suite (
common_methods_invocations.py
) uses constructs likedtypesIfCUDA
,skips
, and backend-specific decorators directly inside eachOpInfo
. This approach leads to:This RFC proposes a scalable, maintainable abstraction for managing such variations across device backends.
Explanation
Problem
Currently, every operator test in PyTorch manually encodes its device-specific capabilities:
https://github.com/pytorch/pytorch/blob/a636a92ee9f9d31c1ee34416afabdc70da83f75c/torch/testing/_internal/common_methods_invocations.py#L12154
This approach does not scale as we introduce more ops, devices, dtypes, and constraints.
Proposed Solution
Introduce a device capability registry that allows per-operator support declarations in one centralized location. For example:
Operators (via
OpInfo
) can query this at runtime:This separates core operator logic from backend variability and supports dynamic test generation, better maintainability, and easier backend integration.
Introduce New Class:
DeviceCapability
Sample class for illustration. We should add all capabilities (e.g., dynamic shapes, memory layout, etc.):
Devices will register themselves so that
OpInfo
can reference the capability:Integration with
OpInfo
Rationale and conclusion
The current approach has certain limitation such as
- It does not scale to more than a few backends
- It's hard-coded and duplicated in each
OpInfo
- It mixes backend quirks with operator metadata
This proposal introduces a structured, centralized mechanism to model device-specific operator support in PyTorch. It improves maintainability, supports backend extensibility, and encourages a more declarative and introspectable test infrastructure.
Alternatives
Alternatives considered
Dynamic test filtering: harder to trace and debug.
JSON/YAML device matrix: brittle and disconnected from test code.
Additional context
This builds upon the RFC introduced for device abstraction in PyTorch frontend pytorch/rfcs#66
cc @ptrblck @msaroufim @eqy @jerryzh168 @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @NmomoN @mengpenghui @fwenguang @cdzhan @1274085042 @PHLens @albanD @gujinghui @EikanWang @fengyuan14 @guangyey
The text was updated successfully, but these errors were encountered: