[Codegen][GPU] Add placeholder op for buffer casts on tensors #20589

qedawkins · 2025-04-21T19:56:07Z

Introduces iree_gpu.buffer_resource_cast to represent where to insert amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking a cache_swizzle stride. To avoid colliding the the existing plumbing of amdgpu buffer resources, there is a pass that walks producers of these cast ops and either drops the casts or drops the annotation on the binding.

Right now the verification for cache_swizzle is simply checking for a chain of single-use producers, but in the future we can add more involved verification that checks that all users specify the same cache swizzle value.

krzysz00

I'm missing something: could you explain why we need this and also why it isn't folded into the "can we use buffers for this" checking pass?

qedawkins · 2025-04-23T23:53:29Z

Four main reasons for the op:

We need a way to represent casts other than on the binding. For example, we can do the cast inside a hot loop to potentially reduce VGPR pressure and turn VALU operations into SALU ops.
Setting the value to use for cache swizzle is easier to do in tensor land before we've tiled and distributed than trying to reverse engineer it during bufferization.
The op is marginally more future proof against future patterns on the HAL/Stream side that could collapse a binding into a single one with multiple offsets. We'd still need some kind of "no-alias" equivalent info though.
Admittedly specific to my use case, but I want to be able to set a cache swizzle value from hand-written IR that propagates up to the buffer resource cast. The only way I could think to do that was with an op that represents the cast but folds away if there are any unexpected producers.

I might be able to think of more reasons, but I'm quite sure we want the op based on what I've seen Triton do with buffer instructions.

why it isn't folded into the "can we use buffers for this" checking pass?

Good q, I wasn't quite sure how best to do that yet, and because I was hand coding the op in my use case I kind of cheated with a separate pass. Rolling the DropResourceCasts pass into the checking pass makes sense after I just reviewed the code there.

qedawkins · 2025-04-28T23:19:35Z

@krzysz00 still missing some more tests, but this should be what we discussed. Lot's of room for improvement in the future too, but PTAL whenever you have a moment.

krzysz00

Overall this looks reasonable - I'm much happier with the design here

compiler/src/iree/compiler/Codegen/Common/GPU/GPUBubbleResourceCasts.cpp

krzysz00

LGTM - is there a followup patch where we start using this more?

krzysz00 · 2025-04-29T21:32:58Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

+    if |input| bufferizes to `storage_buffer` memory space. If |input| resolves
+    to any other memory space this op is silently dropped and has no effect.
+
+    If |cache_swizzle_stride| is present, there is verification before


Why're you using |param|? Is that a style thing I'm unaware of?

It's pretty common through the rest of the codebase, e.g.

iree/runtime/src/iree/vm/buffer.h

Lines 76 to 82 in 72f1157

// |data| will be freed with |allocator| when the buffer is deinitialized.

// If the data is not owned then iree_allocator_null can be used to no-op the

// free.

//

// |access| can be used to control who (guest, host, etc) and how (read/write)

// the buffer may be accessed. If the allocation being wrapped has its own

// access requirements (read-only, etc) the caller must specify those flags.

iree/compiler/src/iree/compiler/Dialect/Util/IR/UtilOps.td

Line 345 in 72f1157

Aligns |value| up to the given power-of-two |alignment| if required.

but isn't formal style as far as I am aware, I just like it over backticks or quotes.

qedawkins · 2025-04-29T22:22:04Z

is there a followup patch where we start using this more?

Implemented? No :P

Yes in the semi near future though. I need to work out/benchmark how to pick the right cache swizzle value before we get benefit of on-by-default, but the hope is it should be easy to generate after operand promotion (maybe earlier before workgroup tiling).

Introduces iree_gpu.buffer_resource_cast to represent where to insert amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking a cache_swizzle stride. To avoid colliding the the existing plumbing of amdgpu buffer resources, there is a pass that walks producers of these cast ops and either drops the casts or drops the annotation on the binding. Right now the verification for cache_swizzle is simply checking for a chain of single-use producers, but in the future we can add more involved verification that checks that all users specify the same cache swizzle value.

qedawkins · 2025-04-30T16:42:00Z

All non-bazel tests passed last night. Going to merge through remaining ONNX tests.

Introduces iree_gpu.buffer_resource_cast to represent where to insert amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking a cache_swizzle stride. To avoid colliding the the existing plumbing of amdgpu buffer resources, there is a pass that walks producers of these cast ops and either drops the casts or drops the annotation on the binding. Right now the verification for cache_swizzle is simply checking for a chain of single-use producers, but in the future we can add more involved verification that checks that all users specify the same cache swizzle value.

…rg#20589) Introduces iree_gpu.buffer_resource_cast to represent where to insert amdgpu.fat_raw_buffer_cast ops when bufferizing. It also supports taking a cache_swizzle stride. To avoid colliding the the existing plumbing of amdgpu buffer resources, there is a pass that walks producers of these cast ops and either drops the casts or drops the annotation on the binding. Right now the verification for cache_swizzle is simply checking for a chain of single-use producers, but in the future we can add more involved verification that checks that all users specify the same cache swizzle value.

qedawkins requested review from krzysz00 and kuhar April 21, 2025 19:56

qedawkins requested review from MaheshRavishankar, Groverkss, antiagainst and hanhanW as code owners April 21, 2025 19:56

krzysz00 reviewed Apr 22, 2025

View reviewed changes

qedawkins force-pushed the cache_swizzle branch from 9037460 to a79751b Compare April 28, 2025 23:18

krzysz00 reviewed Apr 28, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUBubbleResourceCasts.cpp Show resolved Hide resolved

compiler/src/iree/compiler/Codegen/Common/GPU/GPUBubbleResourceCasts.cpp Show resolved Hide resolved

qedawkins force-pushed the cache_swizzle branch from a79751b to 4c9081b Compare April 29, 2025 19:27

qedawkins requested a review from krzysz00 April 29, 2025 19:27

qedawkins mentioned this pull request Apr 29, 2025

[Codegen][AMDGPU] Add pingpong to default gfx942 tuning #20678

Merged

krzysz00 approved these changes Apr 29, 2025

View reviewed changes

qedawkins added 5 commits April 30, 2025 10:03

Rebase resource cast PR

04156bc

Make casts bubble optimistically instead

5451355

add more tests

b938e2e

Fix bazel

34f9f7b

qedawkins force-pushed the cache_swizzle branch from 8cd8e0c to 34f9f7b Compare April 30, 2025 14:09

qedawkins added 2 commits April 30, 2025 11:01

wrong TU

618e647

missing dep

8369b00

qedawkins merged commit 6b7acc0 into iree-org:main Apr 30, 2025
41 checks passed

qedawkins deleted the cache_swizzle branch April 30, 2025 16:42

ScottTodd mentioned this pull request May 5, 2025

Release tracker - 3.4.0 (2025-05-05) #20361

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Codegen][GPU] Add placeholder op for buffer casts on tensors #20589

[Codegen][GPU] Add placeholder op for buffer casts on tensors #20589

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	// \|data\| will be freed with \|allocator\| when the buffer is deinitialized.
	// If the data is not owned then iree_allocator_null can be used to no-op the
	// free.
	//
	// \|access\| can be used to control who (guest, host, etc) and how (read/write)
	// the buffer may be accessed. If the allocation being wrapped has its own
	// access requirements (read-only, etc) the caller must specify those flags.

[Codegen][GPU] Add placeholder op for buffer casts on tensors #20589

[Codegen][GPU] Add placeholder op for buffer casts on tensors #20589

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!