[ROCm] Incorrect number of arguments passed to kernel #140800

jataylo · 2024-11-15T11:42:39Z

🐛 Describe the bug

When using ROCm specific tuning parameters for custom triton kernels we run into an exception.

It seems related to this PR #137236 but simply adding matrix_instr_nonkdim, waves_per_eu and kpack to "SPECIAL_CONFIG_NAMES" does not seem to be enough to fix this as kwargs is empty at this point.

import torch
import triton
from triton import language as tl

@triton.autotune(
    configs=[
        triton.Config({"BLOCK_SIZE": 4, "matrix_instr_nonkdim": 32}, num_stages=3, num_warps=8),
    ],
    key=[],
)
@triton.jit
def add_kernel_autotuned(
    in_ptr0,
    in_ptr1,
    out_ptr,
    n_elements,
    BLOCK_SIZE: "tl.constexpr",
):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(in_ptr0 + offsets, mask=mask)
    y = tl.load(in_ptr1 + offsets, mask=mask)
    output = x + y
    tl.store(out_ptr + offsets, output, mask=mask)

@torch.compile(fullgraph=True)
def add_fn(x, y):
    output = torch.zeros_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
    add_kernel_autotuned[grid](x, y, output, n_elements)
    return output

x = torch.randn(4, device="cuda")
y = torch.randn(4, device="cuda")
out = add_fn(x, y)
print(f"Vector addition of\nX:\t{x}\nY:\t{y}\nis equal to\n{out}")

Traceback:

W1115 11:37:10.193439 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
W1115 11:37:10.193439 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0] Traceback (most recent call last):
W1115 11:37:10.193439 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 574, in identify_mutated_tensors
W1115 11:37:10.193439 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
W1115 11:37:10.193439 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 202, in generate_ttir
W1115 11:37:10.193439 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]     raise ValueError("Incorrect number of arguments passed to kernel")
W1115 11:37:10.193439 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0] ValueError: Incorrect number of arguments passed to kernel
{'in_ptr0': FakeTensor(..., device='cuda:0', size=(4,)), 'in_ptr1': FakeTensor(..., device='cuda:0', size=(4,)), 'out_ptr': FakeTensor(..., device='cuda:0', size=(4,)), 'n_elements': 4, 'BLOCK_SIZE': 4, 'matrix_instr_nonkdim': 32}
['in_ptr0', 'in_ptr1', 'out_ptr', 'n_elements', 'BLOCK_SIZE']
W1115 11:37:10.193879 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
W1115 11:37:10.193879 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0] Traceback (most recent call last):
W1115 11:37:10.193879 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 574, in identify_mutated_tensors
W1115 11:37:10.193879 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
W1115 11:37:10.193879 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]   File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 202, in generate_ttir
W1115 11:37:10.193879 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0]     raise ValueError("Incorrect number of arguments passed to kernel")
W1115 11:37:10.193879 313 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:595] [0/0] ValueError: Incorrect number of arguments passed to kernel

Versions

Tip of tree

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @hongxiayang @naromero77amd

The text was updated successfully, but these errors were encountered:

jataylo · 2024-11-15T11:43:19Z

ccing @aakhundov incase anything jumps to your mind on this one. Is this condition https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/triton_kernel_wrap.py#L199 even valid, if we support supplying implicit args via kwargs, then we shouldn't expect kwargs and kernel.arg_names to match

aakhundov · 2024-11-18T17:28:07Z

Thanks for flagging @jataylo, I'll take a look. Is it the intended use case to pass the AMD-specific args like matrix_instr_nonkdim, waves_per_eu and kpack as a part of triton.Config's kwargs dict (first argument) w/o having them in the kernel parameters?

jataylo · 2024-11-18T18:14:55Z

@aakhundov Correct we pass the AMD specific args via kwargs, either as implemented in this reproducer or through kwargs of the kernel args e.g.

extra_kargs = {"waves_per_eu": 4, "matrix_instr_nonkdim": 16, "kpack": 2}
 
_fwd_kernel[grid](
    q_extend,
    k_extend,
    ...,
    ...,
    **extra_kargs,
)

Both cases run into this issue, I couldn't see a solid way to resolve this in the code without challenging this assumption:

if len(kwargs) != len(kernel.arg_names):
    raise ValueError("Incorrect number of arguments passed to kernel")

Fixes #140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. ghstack-source-id: 084f3e3 Pull Request resolved: #141062

aakhundov · 2024-11-19T21:08:14Z

@jataylo #141062 should fix the error showing up here.

Fixes #140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. ghstack-source-id: 70037b8 Pull Request resolved: #141062

…rch#141062) Fixes pytorch#140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: pytorch#141062 Approved by: https://github.com/oulgen

jataylo · 2024-11-20T10:12:55Z

Perfect thank you for investigating this @aakhundov!

…rch#141062) Fixes pytorch#140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: pytorch#141062 Approved by: https://github.com/oulgen

…rch#141062) Fixes pytorch#140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: pytorch#141062 Approved by: https://github.com/oulgen (cherry picked from commit b740a1b)

…rch#141062) Fixes pytorch#140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: pytorch#141062 Approved by: https://github.com/oulgen

… analysis (#141… (#1768) Fixes pytorch#140800. On AMD, backend-specific args like `matrix_instr_nonkdim`, `waves_per_eu` and `kpack` are passed either direclty to the kernel or via `triton.Config`, whereas they don't exist as kernel parameters. Native Triton code handles those excessive args [here](https://github.com/triton-lang/triton/blob/a6bb57d6285e723c58e87dd7cba263db6efff789/python/triton/runtime/jit.py#L594-L596). In this PR, we add similar handling to the TTIR analysis code to avoid bailing out. Pull Request resolved: pytorch#141062 Approved by: https://github.com/oulgen (cherry picked from commit b740a1b) Fixes #ISSUE_NUMBER Co-authored-by: Adnan Akhundov <aakhundov@fb.com>

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Nov 15, 2024

github-project-automation bot added this to PyTorch on ROCm Nov 15, 2024

cpuhrsch added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 16, 2024

aakhundov mentioned this issue Nov 19, 2024

[user triton] Ignore backend-specific args in the TTIR analysis #141062

Closed

pytorchmergebot closed this as completed in b740a1b Nov 20, 2024

github-project-automation bot moved this to Done in PyTorch on ROCm Nov 20, 2024

jataylo mentioned this issue Dec 4, 2024

[release/2.5] [SWDEV-497569] Ignore backend-specific args in the TTIR analysis (#141… ROCm/pytorch#1768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Incorrect number of arguments passed to kernel #140800

[ROCm] Incorrect number of arguments passed to kernel #140800

[ROCm] Incorrect number of arguments passed to kernel #140800

[ROCm] Incorrect number of arguments passed to kernel #140800

Comments

🐛 Describe the bug

Versions