-
Notifications
You must be signed in to change notification settings - Fork 702
[Codegen][ROCDL] Use buffer fat pointers where possible #19918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ok, so I ran the UNet kernel gemm benchmarks from iree-kernel-benchmark on sharkmi300x-3. I'm not sure I believe these numbers, given how wide the variance is, 3% +- 8% is ... not bad, I guess? I'll take a look at some assembly diffs, see if anything jumps out at me here.
Raw TFlops Numbers
|
Peeking at the anonymized assembly diff on
(That's one of the ones that performed mildly worse, I realized afterwards) Similarly, the diffs for In conclusion: ... On second thought, my rocMLIR "going to global loads sometimes helps" numbers were from an index=i32 where save universe, so maybe part of we're seeing is just the effects of needing less registers |
@krzysz00 how to interpret those values in #19918 (comment)? Are negative values an improvement or a regression? What's the unit? How repeatable/noisy is each configuration? |
@kuhar Right, I left off detail. The unit's % increase in TFlops, since I went with I don't know how repeatable these are, I'd assumed that the iree-kernel-benchmark scripts do something reasonable to smooth out the worst of the noise (The two TFlops columns and the name column are straight out of iree-kernel-benchmark CSVs) |
Ah, makes sense. Looks promising!
It does but we don't report what the noise levels are. Could you see if this helps with SDXL (unet)? The reported number should be much more stable and meaningful than these micro benchmarks. |
Ok, after fixing some bugs that I'll need to toss upstream, we have a correct SDXL unet run, done as close to what the CI does as I can manage, and I'm pleased to report a 2 ms improvement from this PR. So Commands used to produce these[get to ~/test_unet or some other scratch directory]
pytest ../iree/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py
iree-benchmark-module --device=hip --device_allocator=caching --module=sdxl_unet_fp16_vmfbs/model.rocm_gfx942.vmfb --parameters=model=artifacts/sdxl_unet_fp16/real_weights.irpa --function=run_forward --input=1x4x128x128xf16 --input=2x64x2048xf16 --input=2x1280xf16 --input=2x6xf16 --input=1xf16 --input=1xi64 --benchmark_repetitions=10 --benchmark_min_warmup_time=3.0 I'm leaving out the individual 10 runs since, for both benchmarks, there wasn't much more 0.1-0.2 ms drift between the runs. My branch
Main:
I'll note I ran the benchmark commands twice and saw consistent results In conclusion ... using the buffer ops is something we'll likely want to productionize since they save us from some arithmetic. I suspect some of this is the benefits of forcing the memref indexing logic into i32, but ... hey, I'll take alternate routes to the goal |
To give the stats another way, this PR makes UNet 1.027x faster, about, which isn't as good as the per-kernel numbers implied but is still nice. |
Note from codegen pre-sync: get this landed under a flag |
@manupak I'm pretty sure it gets called decently early after bufferization - I can't remember the exact spot but I know you can grep for it and it does happen reliably |
ok yea.. i saw the effects of it after my experiments. |
ceca895
to
e821670
Compare
e031c4d
to
63361f2
Compare
matchAndRewrite(amdgpu::FatRawBufferCastOp op, OpAdaptor adaptor, | ||
ConversionPatternRewriter &rewriter) const override { | ||
Type newTy = getTypeConverter()->convertType(op.getType()); | ||
if (!newTy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit : Add {
}
op, newTy, adaptor.getSource(), adaptor.getValidBytes(), | ||
adaptor.getCacheSwizzleStride(), adaptor.getBoundsCheck(), | ||
adaptor.getResetOffset()); | ||
LLVM_DEBUG(llvm::dbgs() << "Bf16Emulation: new op: " << newOp << "\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know if we want to add such debug prints... The dialect conversion already does a pretty good job of debug output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing practice in the file, but I'm good to delete this
compiler/src/iree/compiler/Codegen/Common/EraseHALDescriptorTypeFromMemRef.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/lowering_scalar_dispatch.mlir
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLConfigureBufferInstructions.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLConfigureBufferInstructions.cpp
Show resolved
Hide resolved
@@ -912,6 +913,24 @@ std::optional<SmallVector<int64_t>> getMmaNativeVectorSize(Operation *op) { | |||
return std::nullopt; | |||
} | |||
|
|||
bool hasGlobalMemoryAddressSpace(MemRefType memrefType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is crying for an attribute interface...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor issues, but overall looks good to me.
221a646
to
9401369
Compare
1078ae7
to
541edef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! It looks good to me. Another pair of eyes might be worth it, but good to go from me.
Note: depends on next integrate, and then I might need to fix a bazel issue |
541edef
to
f776277
Compare
Blocked by llvm/llvm-project#131386 |
5322fce
to
8753cc8
Compare
85fdd6f
to
8c5206b
Compare
This commit enables using the `#amdgpu.address_space<fat_raw_buffer>` address space for memref subspans in our kernels in cases where it's safe to do so. This both gains us immediate performance improvements (-2ms on unet) and enables further performance improvements on MI-300 through the use of the cache swizzling mechanism. Note: this PR currently depends on a bunch of open upstream work. # Context AMD GPUs, on top of having normal `global_{load,store,...}_T` instructions, which take a 64-bit pointer, also have `buffer_*` instructions, which use the texture unit to enable more complex logic. These instructions, for our purposes, take a "buffer resource" (a base pointer into global memory + flags) and an offset. The resource takes 4 **scalar** registers (and thus must be wave-uniform) and the offset is 1 **vector** register, which is added to the base pointer to form the effective address of the buffer access. This means that we can use these buffer operations if we can be assured that the non-uniform part of our accesses never exceeds 2 GB from the start of whatever we decide the base pointer is. IREE will often pass in N copies of the same pointer and then take global offsets from that pointer to locate the logical bindings/tensors within the model's data. Fortunately, these offsets are globally uniform. Therefore, we can conservatively construct buffer descriptors by finding the start of a given bound tensor/memref (which is always in a contiguous layout) and using that as a base pointer. Then, if we know that the memref is 2 GB or smaller, we can use the buffer instructions to access it. # How this patch works 1. Before bufferization, we run iree-rocdl-configure-buffer-instructions / ROCDLConfigureBufferInstructions , which will annotate `hal.interface.binding.subspan` operations with the key ` iree_gpu.use_rocdl_buffer_instructions` as a discardable attribute on the subspans *if* we can prove that said subspans are safe to use as buffer instructions 2. During bufferization, if ` iree_gpu.use_rocdl_buffer_instructions` is set, then we use the `#amdgpu.address_space<buffer_fat_pointer>` memory space instead of the `#hal.descriptor_type<*>` one. This required refactoring a few passes that were using `#hal.descriptor_type` as a synonym for "global memory" in tests, but that was a fairly small change 3. Finally, in `iree-codegen-convert-hal-descriptor-type-to-gpu-address-space`, we implement the buffer fat pointer ABI by converting these buffer bindings to normal global bindings and then adding an `amdgpu.fat_raw_buffer_cast` with the `resetOffset` option set to fold in the global byte offset. (I'm not entirely happy with this part and maybe we should move it to bufferization) So it's recorded: +~3% on SDXL on MI-300 from this Signed-off-by: Elias Joseph <eljoseph@amd.com>
This commit enables using the
#amdgpu.address_space<fat_raw_buffer>
address space for memref subspans in our kernels in cases where it's safe to do so. This both gains us immediate performance improvements (-2ms on unet) and enables further performance improvements on MI-300 through the use of the cache swizzling mechanism.Note: this PR currently depends on a bunch of open upstream work.
Context
AMD GPUs, on top of having normal
global_{load,store,...}_T
instructions, which take a 64-bit pointer, also havebuffer_*
instructions, which use the texture unit to enable more complex logic.These instructions, for our purposes, take a "buffer resource" (a base pointer into global memory + flags) and an offset. The resource takes 4 scalar registers (and thus must be wave-uniform) and the offset is 1 vector register, which is added to the base pointer to form the effective address of the buffer access.
This means that we can use these buffer operations if we can be assured that the non-uniform part of our accesses never exceeds 2 GB from the start of whatever we decide the base pointer is.
IREE will often pass in N copies of the same pointer and then take global offsets from that pointer to locate the logical bindings/tensors within the model's data. Fortunately, these offsets are globally uniform.
Therefore, we can conservatively construct buffer descriptors by finding the start of a given bound tensor/memref (which is always in a contiguous layout) and using that as a base pointer. Then, if we know that the memref is 2 GB or smaller, we can use the buffer instructions to access it.
How this patch works
hal.interface.binding.subspan
operations with the keyiree_gpu.use_rocdl_buffer_instructions
as a discardable attribute on the subspans if we can prove that said subspans are safe to use as buffer instructionsiree_gpu.use_rocdl_buffer_instructions
is set, then we use the#amdgpu.address_space<buffer_fat_pointer>
memory space instead of the#hal.descriptor_type<*>
one. This required refactoring a few passes that were using#hal.descriptor_type
as a synonym for "global memory" in tests, but that was a fairly small changeiree-codegen-convert-hal-descriptor-type-to-gpu-address-space
, we implement the buffer fat pointer ABI by converting these buffer bindings to normal global bindings and then adding anamdgpu.fat_raw_buffer_cast
with theresetOffset
option set to fold in the global byte offset. (I'm not entirely happy with this part and maybe we should move it to bufferization)So it's recorded: +~3% on SDXL on MI-300 from this