8000 [Codegen][ROCDL] Use buffer fat pointers where possible by krzysz00 · Pull Request #19918 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Codegen][ROCDL] Use buffer fat pointers where possible #19918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 19, 2025

Conversation

krzysz00
Copy link
Contributor
@krzysz00 krzysz00 commented Feb 5, 2025

This commit enables using the #amdgpu.address_space<fat_raw_buffer> address space for memref subspans in our kernels in cases where it's safe to do so. This both gains us immediate performance improvements (-2ms on unet) and enables further performance improvements on MI-300 through the use of the cache swizzling mechanism.

Note: this PR currently depends on a bunch of open upstream work.

Context

AMD GPUs, on top of having normal global_{load,store,...}_T instructions, which take a 64-bit pointer, also have buffer_* instructions, which use the texture unit to enable more complex logic.

These instructions, for our purposes, take a "buffer resource" (a base pointer into global memory + flags) and an offset. The resource takes 4 scalar registers (and thus must be wave-uniform) and the offset is 1 vector register, which is added to the base pointer to form the effective address of the buffer access.

This means that we can use these buffer operations if we can be assured that the non-uniform part of our accesses never exceeds 2 GB from the start of whatever we decide the base pointer is.

IREE will often pass in N copies of the same pointer and then take global offsets from that pointer to locate the logical bindings/tensors within the model's data. Fortunately, these offsets are globally uniform.

Therefore, we can conservatively construct buffer descriptors by finding the start of a given bound tensor/memref (which is always in a contiguous layout) and using that as a base pointer. Then, if we know that the memref is 2 GB or smaller, we can use the buffer instructions to access it.

How this patch works

  1. Before bufferization, we run iree-rocdl-configure-buffer-instructions / ROCDLConfigureBufferInstructions , which will annotate hal.interface.binding.subspan operations with the key iree_gpu.use_rocdl_buffer_instructions as a discardable attribute on the subspans if we can prove that said subspans are safe to use as buffer instructions
  2. During bufferization, if iree_gpu.use_rocdl_buffer_instructions is set, then we use the #amdgpu.address_space<buffer_fat_pointer> memory space instead of the #hal.descriptor_type<*> one. This required refactoring a few passes that were using #hal.descriptor_type as a synonym for "global memory" in tests, but that was a fairly small change
  3. Finally, in iree-codegen-convert-hal-descriptor-type-to-gpu-address-space, we implement the buffer fat pointer ABI by converting these buffer bindings to normal global bindings and then adding an amdgpu.fat_raw_buffer_cast with the resetOffset option set to fold in the global byte offset. (I'm not entirely happy with this part and maybe we should move it to bufferization)

So it's recorded: +~3% on SDXL on MI-300 from this

@krzysz00 krzysz00 changed the title [Very unmergable] Buffer fat pointer experiments [Very unmergeable] Buffer fat pointer experiments Feb 5, 2025
@krzysz00
Copy link
Contributor Author
krzysz00 commented Feb 5, 2025

Ok, so I ran the UNet kernel gemm benchmarks from iree-kernel-benchmark on sharkmi300x-3. I'm not sure I believe these numbers, given how wide the variance is, 3% +- 8% is ... not bad, I guess?

I'll take a look at some assembly diffs, see if anything jumps out at me here.

Stat Value
Average 3.036367
Std. Dev. 8.519886
Min -21.7395
Max 38.63633
25%ile -1.08869
Median 1.951279
75%ile 4.742739
Raw TFlops Numbers
Problem Changed Tflops Main Tflops % Inc
gemm_2048_10240_1280_f16_f32 298.2616 296.6138 0.555537
gemm_2048_1280_1280_f16_f32 94.5195 87.1544 8.450635
gemm_2048_1280_5120_f16_f32 168.8273 166.7301 1.257841
gemm_128_1280_2048_f16_f32 9.7259 12.4276 -21.7395
gemm_8192_640_640_f16_f32 100.1625 93.2068 7.462653
gemm_8192_640_2560_f16_f32 194.5184 197.379 -1.44929
gemm_8192_5120_640_f16_f32 299.9279 298.2616 0.558671
gemm_64_640_2048_f16_f32 2.6214 2.6631 -1.56584
gemm_64_1280_2048_f16_f32 5.412 5.084 6.451613
gemm_1024_1280_1280_f16_f32 51.6222 51.6222 0
gemm_1024_1280_5120_f16_f32 104.0448 110.9237 -6.20147
gemm_1024_10240_1280_f16_f32 225.576 223.6962 0.840336
gemm_4096_640_640_f16_f32 54.1201 54.1201 0
gemm_4096_640_2560_f16_f32 130.3085 126.6205 2.912641
gemm_4096_5120_640_f16_f32 227.4877 221.8475 2.542377
gemm_2048_10240_1280_f16_f32_tB 367.7198 353.2045 4.109602
gemm_2048_1280_1280_f16_f32_tB 95.8698 94.5195 1.428594
gemm_2048_1280_5120_f16_f32_tB 175.448 175.448 0
gemm_128_1280_2048_f16_f32_tB 12.4276 11.9837 3.704198
gemm_8192_640_640_f16_f32_tB 98.6895 110.0145 -10.2941
gemm_8192_640_2560_f16_f32_tB 204.9126 211.3665 -3.05342
gemm_8192_5120_640_f16_f32_tB 365.2183 350.896 4.081637
gemm_64_640_2048_f16_f32_tB 2.8436 2.8926 -1.69398
gemm_64_1280_2048_f16_f32_tB 6.2138 5.5924 11.11151
gemm_1024_1280_1280_f16_f32_tB 54.1201 59.9186 -9.6773
gemm_1024_1280_5120_f16_f32_tB 138.3688 122.0161 13.40208
gemm_1024_10240_1280_f16_f32_tB 298.2616 250.8743 18.88886
gemm_4096_640_640_f16_f32_tB 76.2601 55.0073 38.63633
gemm_4096_640_2560_f16_f32_tB 150.8064 134.2177 12.35955
gemm_4096_5120_640_f16_f32_tB 305.0403 250.8743 21.59089
gemm_2048_10240_1280_bf16_f32 305.0403 293.3721 3.97727
gemm_2048_1280_1280_bf16_f32 94.5195 84.9479 11.26761
gemm_2048_1280_5120_bf16_f32 168.8273 161.7081 4.402501
gemm_128_1280_2048_bf16_f32 11.7735 12.662 -7.01706
gemm_8192_640_640_bf16_f32 91.93 94.5195 -2.73965
gemm_8192_640_2560_bf16_f32 197.379 190.3798 3.67644
gemm_8192_5120_640_bf16_f32 299.9279 290.2005 3.351958
gemm_64_640_2048_bf16_f32 2.6214 2.5041 4.684318
gemm_64_1280_2048_bf16_f32 5.7852 4.9345 17.23984
gemm_1024_1280_1280_bf16_f32 51.6222 50.0812 3.077003
gemm_1024_1280_5120_bf16_f32 110.0145 104.8576 4.918003
gemm_1024_10240_1280_bf16_f32 225.576 220.0291 2.520985
gemm_4096_640_640_bf16_f32 54.1201 56.8719 -4.83859
gemm_4096_640_2560_bf16_f32 130.3085 127.8264 1.941774
gemm_4096_5120_640_bf16_f32 225.576 225.576 0
gemm_2048_10240_1280_bf16_f32_tB 355.5437 360.316 -1.32448
gemm_2048_1280_1280_bf16_f32_tB 100.1625 98.6895 1.49256
gemm_2048_1280_5120_bf16_f32_tB 189.0391 176.6023 7.042264
gemm_128_1280_2048_bf16_f32_tB 11.1848 11.5705 -3.33348
gemm_8192_640_640_bf16_f32_tB 111.8481 106.522 5
gemm_8192_640_2560_bf16_f32_tB 206.4888 209.7152 -1.53847
gemm_8192_5120_640_bf16_f32_tB 350.896 344.148 1.960784
gemm_64_640_2048_bf16_f32_tB 2.6631 2.7962 -4.76003
gemm_64_1280_2048_bf16_f32_tB 5.7852 5.6872 1.723168
gemm_1024_1280_1280_bf16_f32_tB 53.261 51.6222 3.174603
gemm_1024_1280_5120_bf16_f32_tB 123.1355 118.7768 3.669656
gemm_1024_10240_1280_bf16_f32_tB 255.6528 253.241 0.952373
gemm_4096_640_640_bf16_f32_tB 67.1089 56.8719 18.0001
gemm_4096_640_2560_bf16_f32_tB 135.5735 136.9569 -1.0101
gemm_4096_5120_640_bf16_f32_tB 253.241 253.241 0

@krzysz00
Copy link
Contributor Author
krzysz00 commented Feb 5, 2025

Peeking at the anonymized assembly diff on gemm_64_640_2048_f16_f32, I'm mainly seeing

  • The changes from global_* to buffer_* I expected
  • Less or different math as there's a touch less 64-bit arithmetic running raound (though thit is the sort of thing I'd expect to also clean up with aggressive use of nsw, nusw and other compiler hints, even without buffers)
  • Somewhat different instruction schedules (the f16=>f32 conversions moved up a bit, some waitcnts were collapsed into a wait-to-0)

(That's one of the ones that performed mildly worse, I realized afterwards)

Similarly, the diffs for gemm_1024_10240_1280_f16_f32_tB - one of the big improvements - is showing ... mainly arithmetic changes, the instructions getting swapped out, and some knock-on scheduling effects.

In conclusion:
a) Huh?
b) We should keep going

... On second thought, my rocMLIR "going to global loads sometimes helps" numbers were from an index=i32 where save universe, so maybe part of we're seeing is just the effects of needing less registers

@kuhar
Copy link
Member
kuhar commented Feb 5, 2025

@krzysz00 how to interpret those values in #19918 (comment)? Are negative values an improvement or a regression? What's the unit? How repeatable/noisy is each configuration?

@krzysz00
Copy link
Contributor Author
krzysz00 commented Feb 5, 2025

@kuhar Right, I left off detail. The unit's % increase in TFlops, since I went with (tflops_my_change - tflops_main) / tflops_main * 100. So positive's better.

I don't know how repeatable these are, I'd assumed that the iree-kernel-benchmark scripts do something reasonable to smooth out the worst of the noise

(The two TFlops columns and the name column are straight out of iree-kernel-benchmark CSVs)

@kuhar
Copy link
Member
kuhar commented Feb 5, 2025

Ah, makes sense. Looks promising!

I'd assumed that the iree-kernel-benchmark scripts do something reasonable to smooth out the worst of the noise

It does but we don't report what the noise levels are.

Could you see if this helps with SDXL (unet)? The reported number should be much more stable and meaningful than these micro benchmarks.

@krzysz00
Copy link
Contributor Author
krzysz00 commented Feb 6, 2025

Ok, after fixing some bugs that I'll need to toss upstream, we have a correct SDXL unet run, done as close to what the CI does as I can manage, and I'm pleased to report a 2 ms improvement from this PR.

So

Commands used to produce these
[get to ~/test_unet or some other scratch directory]
pytest ../iree/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py
iree-benchmark-module --device=hip --device_allocator=caching --module=sdxl_unet_fp16_vmfbs/model.rocm_gfx942.vmfb --parameters=model=artifacts/sdxl_unet_fp16/real_weights.irpa --function=run_forward --input=1x4x128x128xf16 --input=2x64x2048xf16 --input=2x1280xf16 --input=2x6xf16 --input=1xf16 --input=1xi64 --benchmark_repetitions=10 --benchmark_min_warmup_time=3.0
which were run for this PR and for a checkout of eb19497, which is the commit on `main` this branch started from.

I'm leaving out the individual 10 runs since, for both benchmarks, there wasn't much more 0.1-0.2 ms drift between the runs.

My branch

BM_run_forward/process_time/real_time_mean         70.7 ms         71.7 ms           10 items_per_second=14.138/s
BM_run_forward/process_time/real_time_median       70.7 ms         71.7 ms           10 items_per_second=14.1368/s
BM_run_forward/process_time/real_time_stddev      0.107 ms        0.115 ms           10 items_per_second=0.0214142/s
BM_run_forward/process_time/real_time_cv           0.15 %          0.16 %            10 items_per_second=0.15%

Main:

BM_run_forward/process_time/real_time_mean         72.6 ms         74.3 ms           10 items_per_second=13.777/s
BM_run_forward/process_time/real_time_median       72.6 ms         74.4 ms           10 items_per_second=13.776/s
BM_run_forward/process_time/real_time_stddev      0.097 ms        0.099 ms           10 items_per_second=0.0184087/s
BM_run_forward/process_time/real_time_cv           0.13 %          0.13 %            10 items_per_second=0.13%

I'll note I ran the benchmark commands twice and saw consistent results

In conclusion ... using the buffer ops is something we'll likely want to productionize since they save us from some arithmetic.

I suspect some of this is the benefits of forcing the memref indexing logic into i32, but ... hey, I'll take alternate routes to the goal

@krzysz00
Copy link
Contributor Author
krzysz00 commented Feb 6, 2025

To give the stats another way, this PR makes UNet 1.027x faster, about, which isn't as good as the per-kernel numbers implied but is still nice.

@krzysz00
Copy link
Contributor Author
krzysz00 commented Feb 6, 2025

Note from codegen pre-sync: get this landed under a flag

@manupak
Copy link
Contributor
manupak commented Feb 11, 2025

@krzysz00 @kuhar where do we call the pass createEraseHALDescriptorTypeFromMemRefPass in any of our GPU pass pipelines ?

@krzysz00
Copy link
Contributor Author

@manupak I'm pretty sure it gets called decently early after bufferization - I can't remember the exact spot but I know you can grep for it and it does happen reliably

@manupak
Copy link
Contributor
manupak commented Feb 12, 2025

ok yea.. i saw the effects of it after my experiments.
(its just I could not see it in print-after-all)

@krzysz00 krzysz00 changed the title [Very unmergeable] Buffer fat pointer experiments [Codegen][ROCDL] Use buffer fat pointers where possible Feb 17, 2025
@krzysz00 krzysz00 marked this pull request as ready for review February 17, 2025 19:06
matchAndRewrite(amdgpu::FatRawBufferCastOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
Type newTy = getTypeConverter()->convertType(op.getType());
if (!newTy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : Add { }

op, newTy, adaptor.getSource(), adaptor.getValidBytes(),
adaptor.getCacheSwizzleStride(), adaptor.getBoundsCheck(),
adaptor.getResetOffset());
LLVM_DEBUG(llvm::dbgs() << "Bf16Emulation: new op: " << newOp << "\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know if we want to add such debug prints... The dialect conversion already does a pretty good job of debug output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing practice in the file, but I'm good to delete this

@@ -912,6 +913,24 @@ std::optional<SmallVector<int64_t>> getMmaNativeVectorSize(Operation *op) {
return std::nullopt;
}

bool hasGlobalMemoryAddressSpace(MemRefType memrefType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is crying for an attribute interface...

Copy link
Contributor
@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor issues, but overall looks good to me.

Copy link
Contributor
@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! It looks good to me. Another pair of eyes might be worth it, but good to go from me.

@krzysz00
Copy link
Contributor Author
krzysz00 commented Mar 6, 2025

Note: depends on next integrate, and then I might need to fix a bazel issue

@krzysz00 krzysz00 force-pushed the buffer-experiments branch from 541edef to f776277 Compare March 11, 2025 20:29
@krzysz00
Copy link
Contributor Author

Blocked by llvm/llvm-project#131386

@krzysz00 krzysz00 force-pushed the buffer-experiments branch from 85fdd6f to 8c5206b Compare March 18, 2025 18:41
@krzysz00 krzysz00 merged commit 2dd6e83 into iree-org:main Mar 19, 2025
57 of 61 checks passed
Eliasj42 pushed a commit that referenced this pull request Mar 24, 2025
This commit enables using the `#amdgpu.address_space<fat_raw_buffer>`
address space for memref subspans in our kernels in cases where it's
safe to do so. This both gains us immediate performance improvements
(-2ms on unet) and enables further performance improvements on MI-300
through the use of the cache swizzling mechanism.

Note: this PR currently depends on a bunch of open upstream work.

# Context

AMD GPUs, on top of having normal `global_{load,store,...}_T`
instructions, which take a 64-bit pointer, also have `buffer_*`
instructions, which use the texture unit to enable more complex logic.

These instructions, for our purposes, take a "buffer resource" (a base
pointer into global memory + flags) and an offset. The resource takes 4
**scalar** registers (and thus must be wave-uniform) and the offset is 1
**vector** register, which is added to the base pointer to form the
effective address of the buffer access.

This means that we can use these buffer operations if we can be assured
that the non-uniform part of our accesses never exceeds 2 GB from the
start of whatever we decide the base pointer is.

IREE will often pass in N copies of the same pointer and then take
global offsets from that pointer to locate the logical bindings/tensors
within the model's data. Fortunately, these offsets are globally
uniform.

Therefore, we can conservatively construct buffer descriptors by finding
the start of a given bound tensor/memref (which is always in a
contiguous layout) and using that as a base pointer. Then, if we know
that the memref is 2 GB or smaller, we can use the buffer instructions
to access it.

# How this patch works

1. Before bufferization, we run iree-rocdl-configure-buffer-instructions
/ ROCDLConfigureBufferInstructions , which will annotate
`hal.interface.binding.subspan` operations with the key `
iree_gpu.use_rocdl_buffer_instructions` as a discardable attribute on
the subspans *if* we can prove that said subspans are safe to use as
buffer instructions
2. During bufferization, if ` iree_gpu.use_rocdl_buffer_instructions` is
set, then we use the `#amdgpu.address_space<buffer_fat_pointer>` memory
space instead of the `#hal.descriptor_type<*>` one. This required
refactoring a few passes that were using `#hal.descriptor_type` as a
synonym for "global memory" in tests, but that was a fairly small change
3. Finally, in
`iree-codegen-convert-hal-descriptor-type-to-gpu-address-space`, we
implement the buffer fat pointer ABI by converting these buffer bindings
to normal global bindings and then adding an
`amdgpu.fat_raw_buffer_cast` with the `resetOffset` option set to fold
in the global byte offset. (I'm not entirely happy with this part and
maybe we should move it to bufferization)

So it's recorded: +~3% on SDXL on MI-300 from this

Signed-off-by: Elias Joseph <eljoseph@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0