[Codegen][ROCDL] Use buffer fat pointers where possible #19918

krzysz00 · 2025-02-05T20:10:28Z

This commit enables using the #amdgpu.address_space<fat_raw_buffer> address space for memref subspans in our kernels in cases where it's safe to do so. This both gains us immediate performance improvements (-2ms on unet) and enables further performance improvements on MI-300 through the use of the cache swizzling mechanism.

Note: this PR currently depends on a bunch of open upstream work.

Context

AMD GPUs, on top of having normal global_{load,store,...}_T instructions, which take a 64-bit pointer, also have buffer_* instructions, which use the texture unit to enable more complex logic.

These instructions, for our purposes, take a "buffer resource" (a base pointer into global memory + flags) and an offset. The resource takes 4 scalar registers (and thus must be wave-uniform) and the offset is 1 vector register, which is added to the base pointer to form the effective address of the buffer access.

This means that we can use these buffer operations if we can be assured that the non-uniform part of our accesses never exceeds 2 GB from the start of whatever we decide the base pointer is.

IREE will often pass in N copies of the same pointer and then take global offsets from that pointer to locate the logical bindings/tensors within the model's data. Fortunately, these offsets are globally uniform.

Therefore, we can conservatively construct buffer descriptors by finding the start of a given bound tensor/memref (which is always in a contiguous layout) and using that as a base pointer. Then, if we know that the memref is 2 GB or smaller, we can use the buffer instructions to access it.

How this patch works

Before bufferization, we run iree-rocdl-configure-buffer-instructions / ROCDLConfigureBufferInstructions , which will annotate hal.interface.binding.subspan operations with the key iree_gpu.use_rocdl_buffer_instructions as a discardable attribute on the subspans if we can prove that said subspans are safe to use as buffer instructions
During bufferization, if iree_gpu.use_rocdl_buffer_instructions is set, then we use the #amdgpu.address_space<buffer_fat_pointer> memory space instead of the #hal.descriptor_type<*> one. This required refactoring a few passes that were using #hal.descriptor_type as a synonym for "global memory" in tests, but that was a fairly small change
Finally, in iree-codegen-convert-hal-descriptor-type-to-gpu-address-space, we implement the buffer fat pointer ABI by converting these buffer bindings to normal global bindings and then adding an amdgpu.fat_raw_buffer_cast with the resetOffset option set to fold in the global byte offset. (I'm not entirely happy with this part and maybe we should move it to bufferization)

So it's recorded: +~3% on SDXL on MI-300 from this

krzysz00 · 2025-02-05T22:22:44Z

Ok, so I ran the UNet kernel gemm benchmarks from iree-kernel-benchmark on sharkmi300x-3. I'm not sure I believe these numbers, given how wide the variance is, 3% +- 8% is ... not bad, I guess?

I'll take a look at some assembly diffs, see if anything jumps out at me here.

Stat	Value
Average	3.036367
Std. Dev.	8.519886
Min	-21.7395
Max	38.63633
25%ile	-1.08869
Median	1.951279
75%ile	4.742739

Raw TFlops Numbers

Problem	Changed Tflops	Main Tflops	% Inc
gemm_2048_10240_1280_f16_f32	298.2616	296.6138	0.555537
gemm_2048_1280_1280_f16_f32	94.5195	87.1544	8.450635
gemm_2048_1280_5120_f16_f32	168.8273	166.7301	1.257841
gemm_128_1280_2048_f16_f32	9.7259	12.4276	-21.7395
gemm_8192_640_640_f16_f32	100.1625	93.2068	7.462653
gemm_8192_640_2560_f16_f32	194.5184	197.379	-1.44929
gemm_8192_5120_640_f16_f32	299.9279	298.2616	0.558671
gemm_64_640_2048_f16_f32	2.6214	2.6631	-1.56584
gemm_64_1280_2048_f16_f32	5.412	5.084	6.451613
gemm_1024_1280_1280_f16_f32	51.6222	51.6222	0
gemm_1024_1280_5120_f16_f32	104.0448	110.9237	-6.20147
gemm_1024_10240_1280_f16_f32	225.576	223.6962	0.840336
gemm_4096_640_640_f16_f32	54.1201	54.1201	0
gemm_4096_640_2560_f16_f32	130.3085	126.6205	2.912641
gemm_4096_5120_640_f16_f32	227.4877	221.8475	2.542377
gemm_2048_10240_1280_f16_f32_tB	367.7198	353.2045	4.109602
gemm_2048_1280_1280_f16_f32_tB	95.8698	94.5195	1.428594
gemm_2048_1280_5120_f16_f32_tB	175.448	175.448	0
gemm_128_1280_2048_f16_f32_tB	12.4276	11.9837	3.704198
gemm_8192_640_640_f16_f32_tB	98.6895	110.0145	-10.2941
gemm_8192_640_2560_f16_f32_tB	204.9126	211.3665	-3.05342
gemm_8192_5120_640_f16_f32_tB	365.2183	350.896	4.081637
gemm_64_640_2048_f16_f32_tB	2.8436	2.8926	-1.69398
gemm_64_1280_2048_f16_f32_tB	6.2138	5.5924	11.11151
gemm_1024_1280_1280_f16_f32_tB	54.1201	59.9186	-9.6773
gemm_1024_1280_5120_f16_f32_tB	138.3688	122.0161	13.40208
gemm_1024_10240_1280_f16_f32_tB	298.2616	250.8743	18.88886
gemm_4096_640_640_f16_f32_tB	76.2601	55.0073	38.63633
gemm_4096_640_2560_f16_f32_tB	150.8064	134.2177	12.35955
gemm_4096_5120_640_f16_f32_tB	305.0403	250.8743	21.59089
gemm_2048_10240_1280_bf16_f32	305.0403	293.3721	3.97727
gemm_2048_1280_1280_bf16_f32	94.5195	84.9479	11.26761
gemm_2048_1280_5120_bf16_f32	168.8273	161.7081	4.402501
gemm_128_1280_2048_bf16_f32	11.7735	12.662	-7.01706
gemm_8192_640_640_bf16_f32	91.93	94.5195	-2.73965
gemm_8192_640_2560_bf16_f32	197.379	190.3798	3.67644
gemm_8192_5120_640_bf16_f32	299.9279	290.2005	3.351958
gemm_64_640_2048_bf16_f32	2.6214	2.5041	4.684318
gemm_64_1280_2048_bf16_f32	5.7852	4.9345	17.23984
gemm_1024_1280_1280_bf16_f32	51.6222	50.0812	3.077003
gemm_1024_1280_5120_bf16_f32	110.0145	104.8576	4.918003
gemm_1024_10240_1280_bf16_f32	225.576	220.0291	2.520985
gemm_4096_640_640_bf16_f32	54.1201	56.8719	-4.83859
gemm_4096_640_2560_bf16_f32	130.3085	127.8264	1.941774
gemm_4096_5120_640_bf16_f32	225.576	225.576	0
gemm_2048_10240_1280_bf16_f32_tB	355.5437	360.316	-1.32448
gemm_2048_1280_1280_bf16_f32_tB	100.1625	98.6895	1.49256
gemm_2048_1280_5120_bf16_f32_tB	189.0391	176.6023	7.042264
gemm_128_1280_2048_bf16_f32_tB	11.1848	11.5705	-3.33348
gemm_8192_640_640_bf16_f32_tB	111.8481	106.522	5
gemm_8192_640_2560_bf16_f32_tB	206.4888	209.7152	-1.53847
gemm_8192_5120_640_bf16_f32_tB	350.896	344.148	1.960784
gemm_64_640_2048_bf16_f32_tB	2.6631	2.7962	-4.76003
gemm_64_1280_2048_bf16_f32_tB	5.7852	5.6872	1.723168
gemm_1024_1280_1280_bf16_f32_tB	53.261	51.6222	3.174603
gemm_1024_1280_5120_bf16_f32_tB	123.1355	118.7768	3.669656
gemm_1024_10240_1280_bf16_f32_tB	255.6528	253.241	0.952373
gemm_4096_640_640_bf16_f32_tB	67.1089	56.8719	18.0001
gemm_4096_640_2560_bf16_f32_tB	135.5735	136.9569	-1.0101
gemm_4096_5120_640_bf16_f32_tB	253.241	253.241	0

krzysz00 · 2025-02-05T22:46:40Z

Peeking at the anonymized assembly diff on gemm_64_640_2048_f16_f32, I'm mainly seeing

The changes from global_* to buffer_* I expected
Less or different math as there's a touch less 64-bit arithmetic running raound (though thit is the sort of thing I'd expect to also clean up with aggressive use of nsw, nusw and other compiler hints, even without buffers)
Somewhat different instruction schedules (the f16=>f32 conversions moved up a bit, some waitcnts were collapsed into a wait-to-0)

(That's one of the ones that performed mildly worse, I realized afterwards)

Similarly, the diffs for gemm_1024_10240_1280_f16_f32_tB - one of the big improvements - is showing ... mainly arithmetic changes, the instructions getting swapped out, and some knock-on scheduling effects.

In conclusion:
a) Huh?
b) We should keep going

... On second thought, my rocMLIR "going to global loads sometimes helps" numbers were from an index=i32 where save universe, so maybe part of we're seeing is just the effects of needing less registers

kuhar · 2025-02-05T22:53:06Z

@krzysz00 how to interpret those values in #19918 (comment)? Are negative values an improvement or a regression? What's the unit? How repeatable/noisy is each configuration?

krzysz00 · 2025-02-05T22:55:27Z

@kuhar Right, I left off detail. The unit's % increase in TFlops, since I went with (tflops_my_change - tflops_main) / tflops_main * 100. So positive's better.

I don't know how repeatable these are, I'd assumed that the iree-kernel-benchmark scripts do something reasonable to smooth out the worst of the noise

(The two TFlops columns and the name column are straight out of iree-kernel-benchmark CSVs)

kuhar · 2025-02-05T22:59:06Z

Ah, makes sense. Looks promising!

I'd assumed that the iree-kernel-benchmark scripts do something reasonable to smooth out the worst of the noise

It does but we don't report what the noise levels are.

Could you see if this helps with SDXL (unet)? The reported number should be much more stable and meaningful than these micro benchmarks.

krzysz00 · 2025-02-06T04:03:38Z

Ok, after fixing some bugs that I'll need to toss upstream, we have a correct SDXL unet run, done as close to what the CI does as I can manage, and I'm pleased to report a 2 ms improvement from this PR.

So

Commands used to produce these

[get to ~/test_unet or some other scratch directory]
pytest ../iree/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py
iree-benchmark-module --device=hip --device_allocator=caching --module=sdxl_unet_fp16_vmfbs/model.rocm_gfx942.vmfb --parameters=model=artifacts/sdxl_unet_fp16/real_weights.irpa --function=run_forward --input=1x4x128x128xf16 --input=2x64x2048xf16 --input=2x1280xf16 --input=2x6xf16 --input=1xf16 --input=1xi64 --benchmark_repetitions=10 --benchmark_min_warmup_time=3.0

which were run for this PR and for a checkout of eb19497, which is the commit on `main` this branch started from.

I'm leaving out the individual 10 runs since, for both benchmarks, there wasn't much more 0.1-0.2 ms drift between the runs.

My branch

BM_run_forward/process_time/real_time_mean         70.7 ms         71.7 ms           10 items_per_second=14.138/s
BM_run_forward/process_time/real_time_median       70.7 ms         71.7 ms           10 items_per_second=14.1368/s
BM_run_forward/process_time/real_time_stddev      0.107 ms        0.115 ms           10 items_per_second=0.0214142/s
BM_run_forward/process_time/real_time_cv           0.15 %          0.16 %            10 items_per_second=0.15%

Main:

BM_run_forward/process_time/real_time_mean         72.6 ms         74.3 ms           10 items_per_second=13.777/s
BM_run_forward/process_time/real_time_median       72.6 ms         74.4 ms           10 items_per_second=13.776/s
BM_run_forward/process_time/real_time_stddev      0.097 ms        0.099 ms           10 items_per_second=0.0184087/s
BM_run_forward/process_time/real_time_cv           0.13 %          0.13 %            10 items_per_second=0.13%

I'll note I ran the benchmark commands twice and saw consistent results

In conclusion ... using the buffer ops is something we'll likely want to productionize since they save us from some arithmetic.

I suspect some of this is the benefits of forcing the memref indexing logic into i32, but ... hey, I'll take alternate routes to the goal

krzysz00 · 2025-02-06T04:11:02Z

To give the stats another way, this PR makes UNet 1.027x faster, about, which isn't as good as the per-kernel numbers implied but is still nice.

krzysz00 · 2025-02-06T17:28:07Z

Note from codegen pre-sync: get this landed under a flag

manupak · 2025-02-11T16:08:51Z

@krzysz00 @kuhar where do we call the pass createEraseHALDescriptorTypeFromMemRefPass in any of our GPU pass pipelines ?

krzysz00 · 2025-02-11T22:45:39Z

@manupak I'm pretty sure it gets called decently early after bufferization - I can't remember the exact spot but I know you can grep for it and it does happen reliably

manupak · 2025-02-12T06:43:06Z

ok yea.. i saw the effects of it after my experiments.
(its just I could not see it in print-after-all)

compiler/src/iree/compiler/Codegen/Common/GPU/VectorReductionToGPU.cpp

MaheshRavishankar · 2025-02-18T21:35:41Z

compiler/src/iree/compiler/Codegen/Common/ConvertBf16ToUInt16Buffers.cpp

+  matchAndRewrite(amdgpu::FatRawBufferCastOp op, OpAdaptor adaptor,
+                  ConversionPatternRewriter &rewriter) const override {
+    Type newTy = getTypeConverter()->convertType(op.getType());
+    if (!newTy)


Nit : Add { }

MaheshRavishankar · 2025-02-18T21:36:20Z

compiler/src/iree/compiler/Codegen/Common/ConvertBf16ToUInt16Buffers.cpp

+        op, newTy, adaptor.getSource(), adaptor.getValidBytes(),
+        adaptor.getCacheSwizzleStride(), adaptor.getBoundsCheck(),
+        adaptor.getResetOffset());
+    LLVM_DEBUG(llvm::dbgs() << "Bf16Emulation: new op: " << newOp << "\n");


I dont know if we want to add such debug prints... The dialect conversion already does a pretty good job of debug output.

Existing practice in the file, but I'm good to delete this

compiler/src/iree/compiler/Codegen/Common/EraseHALDescriptorTypeFromMemRef.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/lowering_scalar_dispatch.mlir

compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLConfigureBufferInstructions.cpp

MaheshRavishankar · 2025-02-18T21:55:09Z

compiler/src/iree/compiler/Codegen/Utils/GPUUtils.cpp

@@ -912,6 +913,24 @@ std::optional<SmallVector<int64_t>> getMmaNativeVectorSize(Operation *op) {
  return std::nullopt;
 }

+bool hasGlobalMemoryAddressSpace(MemRefType memrefType) {


This is crying for an attribute interface...

MaheshRavishankar

Just minor issues, but overall looks good to me.

MaheshRavishankar

Awesome! It looks good to me. Another pair of eyes might be worth it, but good to go from me.

krzysz00 · 2025-03-06T16:41:11Z

Note: depends on next integrate, and then I might need to fix a bazel issue

krzysz00 · 2025-03-17T15:36:33Z

Blocked by llvm/llvm-project#131386

This commit enables using the `#amdgpu.address_space<fat_raw_buffer>` address space for memref subspans in our kernels in cases where it's safe to do so. This both gains us immediate performance improvements (-2ms on unet) and enables further performance improvements on MI-300 through the use of the cache swizzling mechanism. Note: this PR currently depends on a bunch of open upstream work. # Context AMD GPUs, on top of having normal `global_{load,store,...}_T` instructions, which take a 64-bit pointer, also have `buffer_*` instructions, which use the texture unit to enable more complex logic. These instructions, for our purposes, take a "buffer resource" (a base pointer into global memory + flags) and an offset. The resource takes 4 **scalar** registers (and thus must be wave-uniform) and the offset is 1 **vector** register, which is added to the base pointer to form the effective address of the buffer access. This means that we can use these buffer operations if we can be assured that the non-uniform part of our accesses never exceeds 2 GB from the start of whatever we decide the base pointer is. IREE will often pass in N copies of the same pointer and then take global offsets from that pointer to locate the logical bindings/tensors within the model's data. Fortunately, these offsets are globally uniform. Therefore, we can conservatively construct buffer descriptors by finding the start of a given bound tensor/memref (which is always in a contiguous layout) and using that as a base pointer. Then, if we know that the memref is 2 GB or smaller, we can use the buffer instructions to access it. # How this patch works 1. Before bufferization, we run iree-rocdl-configure-buffer-instructions / ROCDLConfigureBufferInstructions , which will annotate `hal.interface.binding.subspan` operations with the key ` iree_gpu.use_rocdl_buffer_instructions` as a discardable attribute on the subspans *if* we can prove that said subspans are safe to use as buffer instructions 2. During bufferization, if ` iree_gpu.use_rocdl_buffer_instructions` is set, then we use the `#amdgpu.address_space<buffer_fat_pointer>` memory space instead of the `#hal.descriptor_type<*>` one. This required refactoring a few passes that were using `#hal.descriptor_type` as a synonym for "global memory" in tests, but that was a fairly small change 3. Finally, in `iree-codegen-convert-hal-descriptor-type-to-gpu-address-space`, we implement the buffer fat pointer ABI by converting these buffer bindings to normal global bindings and then adding an `amdgpu.fat_raw_buffer_cast` with the `resetOffset` option set to fold in the global byte offset. (I'm not entirely happy with this part and maybe we should move it to bufferization) So it's recorded: +~3% on SDXL on MI-300 from this Signed-off-by: Elias Joseph <eljoseph@amd.com>

krzysz00 changed the title ~~[Very unmergable] Buffer fat pointer experiments~~ [Very unmergeable] Buffer fat pointer experiments Feb 5, 2025

krzysz00 force-pushed the buffer-experiments branch from ceca895 to e821670 Compare February 12, 2025 21:42

manupak mentioned this pull request Feb 14, 2025

[LLVMGPUVectorDistribute] Add general support for statically tiled codegen on dynamic shapes #19992

Closed

krzysz00 changed the title ~~[Very unmergeable] Buffer fat pointer experiments~~ [Codegen][ROCDL] Use buffer fat pointers where possible Feb 17, 2025

krzysz00 marked this pull request as ready for review February 17, 2025 19:06

krzysz00 requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss, antiagainst and hanhanW as code owners February 17, 2025 19:06

krzysz00 force-pushed the buffer-experiments branch from e031c4d to 63361f2 Compare February 17, 2025 21:31