8000 [Codegen][GPU] Add placeholder op for buffer casts on tensors by qedawkins · Pull Request #20589 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Codegen][GPU] Add placeholder op for buffer casts on tensors #20589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 30, 2025

Conversation

qedawkins
Copy link
Contributor

Introduces iree_gpu.buffer_resource_cast to represent where to insert amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking a cache_swizzle stride. To avoid colliding the the existing plumbing of amdgpu buffer resources, there is a pass that walks producers of these cast ops and either drops the casts or drops the annotation on the binding.

Right now the verification for cache_swizzle is simply checking for a chain of single-use producers, but in the future we can add more involved verification that checks that all users specify the same cache swizzle value.

Copy link
Contributor
@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing something: could you explain why we need this and also why it isn't folded into the "can we use buffers for this" checking pass?

@qedawkins
Copy link
Contributor Author

Four main reasons for the op:

  1. We need a way to represent casts other than on the binding. For example, we can do the cast inside a hot loop to potentially reduce VGPR pressure and turn VALU operations into SALU ops.
  2. Setting the value to use for cache swizzle is easier to do in tensor land before we've tiled and distributed than trying to reverse engineer it during bufferization.
  3. The op is marginally more future proof against future patterns on the HAL/Stream side that could collapse a binding into a single one with multiple offsets. We'd still need some kind of "no-alias" equivalent info though.
  4. Admittedly specific to my use case, but I want to be able to set a cache swizzle value from hand-written IR that propagates up to the buffer resource cast. The only way I could think to do that was with an op that represents the cast but folds away if there are any unexpected producers.

I might be able to think of more reasons, but I'm quite sure we want the op based on what I've seen Triton do with buffer instructions.

why it isn't folded into the "can we use buffers for this" checking pass?

Good q, I wasn't quite sure how best to do that yet, and because I was hand coding the op in my use case I kind of cheated with a separate pass. Rolling the DropResourceCasts pass into the checking pass makes sense after I just reviewed the code there.

@qedawkins
Copy link
Contributor Author

@krzysz00 still missing some more tests, but this should be what we discussed. Lot's of room for improvement in the future too, but PTAL whenever you have a moment.

Copy link
Contributor
@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks reasonable - I'm much happier with the design here

Copy link
Contributor
@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - is there a followup patch where we start using this more?

if |input| bufferizes to `storage_buffer` memory space. If |input| resolves
to any other memory space this op is silently dropped and has no effect.

If |cache_swizzle_stride| is present, there is verification before
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why're you using |param|? Is that a style thing I'm unaware of?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty common through the rest of the codebase, e.g.

  • // |data| will be freed with |allocator| when the buffer is deinitialized.
    // If the data is not owned then iree_allocator_null can be used to no-op the
    // free.
    //
    // |access| can be used to control who (guest, host, etc) and how (read/write)
    // the buffer may be accessed. If the allocation being wrapped has its own
    // access requirements (read-only, etc) the caller must specify those flags.
  • Aligns |value| up to the given power-of-two |alignment| if required.

but isn't formal style as far as I am aware, I just like it over backticks or quotes.

@qedawkins
Copy link
Contributor Author

is there a followup patch where we start using this more?

Implemented? No :P

Yes in the semi near future though. I need to work out/benchmark how to pick the right cache swizzle value before we get benefit of on-by-default, but the hope is it should be easy to generate after operand promotion (maybe earlier before workgroup tiling).

Introduces iree_gpu.buffer_resource_cast to represent where to insert
amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking
a cache_swizzle stride. To avoid colliding the the existing plumbing of
amdgpu buffer resources, there is a pass that walks producers of these
cast ops and either drops the casts or drops the annotation on the
binding.

Right now the verification for cache_swizzle is simply checking for
a chain of single-use producers, but in the future we can add more
involved verification that checks that all users specify the same cache
swizzle value.
@qedawkins
Copy link
Contributor Author

All non-bazel tests passed last night. Going to merge through remaining ONNX tests.

@qedawkins qedawkins merged commit 6b7acc0 into iree-org:main Apr 30, 2025
41 checks passed
@qedawkins qedawkins deleted the cache_swizzle branch April 30, 2025 16:42
nirvedhmeshram pushed a commit that referenced this pull request May 6, 2025
Introduces iree_gpu.buffer_resource_cast to represent where to insert
amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking
a cache_swizzle stride. To avoid colliding the the existing plumbing of
amdgpu buffer resources, there is a pass that walks producers of these
cast ops and either drops the casts or drops the annotation on the
binding.

Right now the verification for cache_swizzle is simply checking for a
chain of single-use producers, but in the future we can add more
involved verification that checks that all users specify the same cache
swizzle value.
KyleHerndon pushed a commit to KyleHerndon/iree that referenced this pull request May 7, 2025
…rg#20589)

Introduces iree_gpu.buffer_resource_cast to represent where to insert
amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking
a cache_swizzle stride. To avoid colliding the the existing plumbing of
amdgpu buffer resources, there is a pass that walks producers of these
cast ops and either drops the casts or drops the annotation on the
binding.

Right now the verification for cache_swizzle is simply checking for a
chain of single-use producers, but in the future we can add more
involved verification that checks that all users specify the same cache
swizzle value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0