bug: int8 w8a8 doesn't work on 5090

from torchao import quantize_
from torchao.quantization.quant_api import Int8DynamicActivationInt8WeightConfig
import torch
from torch import nn

linear = nn.Linear(1024, 1024, device="cuda", dtype=torch.bfloat16)
quantize_(linear, Int8DynamicActivationInt8WeightConfig())
linear.compile()

x = torch.randn(1, 1024, device="cuda", dtype=torch.bfloat16)
with torch.no_grad():
    linear(x)

File /tmp/torchinductor_thien/h5/ch5dmihigr4vvkco4sbovpymym3e3c5ektb2qdmaja76c4k5dorl.py:198, in call(args)
    196 buf3 = empty_strided_cuda((1, 1024), (1024, 1), torch.int32)
    197 # Topologically Sorted Source Nodes: [data, linear], Original ATen: [aten.reciprocal, aten.mul, aten.add, aten.clamp, aten._to_copy, aten.view, aten._int_mm]
--> 198 extern_kernels._int_mm(buf2, reinterpret_tensor(arg1_1, (1024, 1024), (1, 1024), 0), out=buf3)
    199 del arg1_1
    200 del buf2

RuntimeError: self.size(0) needs to be greater than 16, but got 1

This looks more like inductor issue.

It doesn't fuse int mm with scaling, even though ao sets the inductor flag

ao/torchao/quantization/utils.py

Line 673 in 6243040

def recommended_inductor_config_setter():
IIUC, CuBLAS int8 mm only works with M>=16 (like the error suggested) -> inductor should codegen correctly to NOT use CuBLAS when M<16

torch==2.7.1+cu128
torchao==0.12.0.dev20250614+cu128

Maybe it's because of 5090 (sm120)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions