10000 bug: int8 w8a8 doesn't work on 5090 · Issue #2376 · pytorch/ao · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
bug: int8 w8a8 doesn't work on 5090 #2376
Open
@gau-nernst

Description

@gau-nernst
from torchao import quantize_
from torchao.quantization.quant_api import Int8DynamicActivationInt8WeightConfig
import torch
from torch import nn

linear = nn.Linear(1024, 1024, device="cuda", dtype=torch.bfloat16)
quantize_(linear, Int8DynamicActivationInt8WeightConfig())
linear.compile()

x = torch.randn(1, 1024, device="cuda", dtype=torch.bfloat16)
with torch.no_grad():
    linear(x)
File /tmp/torchinductor_thien/h5/ch5dmihigr4vvkco4sbovpymym3e3c5ektb2qdmaja76c4k5dorl.py:198, in call(args)
    196 buf3 = empty_strided_cuda((1, 1024), (1024, 1), torch.int32)
    197 # Topologically Sorted Source Nodes: [data, linear], Original ATen: [aten.reciprocal, aten.mul, aten.add, aten.clamp, aten._to_copy, aten.view, aten._int_mm]
--> 198 extern_kernels._int_mm(buf2, reinterpret_tensor(arg1_1, (1024, 1024), (1, 1024), 0), out=buf3)
    199 del arg1_1
    200 del buf2

RuntimeError: self.size(0) needs to be greater than 16, but got 1

This looks more like inductor issue.

  1. It doesn't fuse int mm with scaling, even though ao sets the inductor flag
    def recommended_inductor_config_setter():
  2. IIUC, CuBLAS int8 mm only works with M>=16 (like the error suggested) -> inductor should codegen correctly to NOT use CuBLAS when M<16
torch==2.7.1+cu128
torchao==0.12.0.dev20250614+cu128

Maybe it's because of 5090 (sm120)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0