Open
Description
from torchao import quantize_
from torchao.quantization.quant_api import Int8DynamicActivationInt8WeightConfig
import torch
from torch import nn
linear = nn.Linear(1024, 1024, device="cuda", dtype=torch.bfloat16)
quantize_(linear, Int8DynamicActivationInt8WeightConfig())
linear.compile()
x = torch.randn(1, 1024, device="cuda", dtype=torch.bfloat16)
with torch.no_grad():
linear(x)
File /tmp/torchinductor_thien/h5/ch5dmihigr4vvkco4sbovpymym3e3c5ektb2qdmaja76c4k5dorl.py:198, in call(args)
196 buf3 = empty_strided_cuda((1, 1024), (1024, 1), torch.int32)
197 # Topologically Sorted Source Nodes: [data, linear], Original ATen: [aten.reciprocal, aten.mul, aten.add, aten.clamp, aten._to_copy, aten.view, aten._int_mm]
--> 198 extern_kernels._int_mm(buf2, reinterpret_tensor(arg1_1, (1024, 1024), (1, 1024), 0), out=buf3)
199 del arg1_1
200 del buf2
RuntimeError: self.size(0) needs to be greater than 16, but got 1
This looks more like inductor issue.
- It doesn't fuse int mm with scaling, even though ao sets the inductor flag
ao/torchao/quantization/utils.py
Line 673 in 6243040
- IIUC, CuBLAS int8 mm only works with M>=16 (like the error suggested) -> inductor should codegen correctly to NOT use CuBLAS when M<16
torch==2.7.1+cu128
torchao==0.12.0.dev20250614+cu128
Maybe it's because of 5090 (sm120)?
Metadata
Metadata
Assignees
Labels
No labels