Lower `linalg.copy` to direct global load #20568

lialan · 2025-04-17T13:57:25Z

Summary

This PR sets the foundation for using global_load_lds instruction to load values from global to LDS memory. The pipeline is as follows:

Only convert linalg.copy emitted in PromoteGPUMatMulOperands. When it sees fit, insert a different attribute (#iree_gpu.use_global_load_dma) to linalg.copy to tag it along the pipeline.
Tagged linalg.copy will not be decomposed/tiled until bufferization.
after distributed to threads and bufferization, the tagged linalg.copy will then be lowered to a sequence of code responsible for subgroup-coalesced loading op iree_gpu.global_load_dma.
iree_gpu.global_load_dma will be mapped to amdgpu.gather_to_lds op, which will mapped to corresponding rocdl op.
Disable padding to reduce bank conflict pass because the destination workgroup memory has to be contiguous.

Lowering `linalg.copy`

After bufferization and distribute to threads, tagged linalg.copy still exists in the IR:

linalg.copy {lowering_config = #iree_gpu.use_global_load_dma}
  ins(%subview_12 : memref<64x128xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>>)
  outs(%alloc_4 : memref<64x128xi8, #gpu.address_space<workgroup>>)

Note that this linalg.copy is kept in the thread's code. The op itself is then converted into a for loop, in which subgroup of threads loads coalesced chunk of values. For example, assume there are N subgroups loading from tensor<a x b x c>:

then i-th subgruop will load a sub tensor of size [a/N, b, c], so each slice is consecutive.
- At this moment, assume row-major, and only tile the outermost dim.
- The reason right now we are only dealing with linalg.copy emitted by GPUPromoteMatmulOperands is that we know the destination is allocated contiguously.
- TODO: expand to any memref slices.
given gpu.subgroup_id and gpu.lane_id, each thread calculates the consecutive data chunk the subgroup the thread belongs to is responsible to load:
- the chunk indices is the delinearized indices of the input tensor, from:
  - affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) / num_subgroups)], to
  - affine.delinearize_index[(gpu.subgroup_id + 1) * (num_elems_of(tensor) / num_subgroups) - 1]
Assume each subgroup will load n values from linearized index [N_f, N_b], then thread with lane id i will try to load: iter = 0 to n : N_f + subgroup_size * iter + (i - 1) .
Then it will be converted to something like the following (in the example, assume workgroup size = 256, subgroup_size = 64, loading 64x128xi8):

scf.for %indvar = %c0 to %c32 step %c1 {
  ;; thread-specific gathering address from global address
  %17 = affine.apply affine_map<()[s0, s1, s2] -> (s0 + s1 * 2048 + s2 * 64)>()[%lane_id, %subgroup_id, %indvar]
  %18:2 = affine.delinearize_index %17 into (128, 64) : index, index
  ;; this iteration's base storing index
  %19 = affine.apply affine_map<()[s0, s1] -> (s0 * 2048 + s1 * 64)>()[%subgroup_id, %indvar]
  %20:2 = affine.delinearize_index %19 into (128, 64) : index, index 
  iree_gpu.global_load_dma %subview_13[%18#0, %18#1] -> %alloc_5[%20#0, %20#1] : memref<128x64xi8, strided<[256, 1], offset: ?>, #amdgpu.address_space<fat_raw_buffer>> -> memref<128x64xi8, #gpu.address_space<workgroup>>
}
;; if there are residual elements (subgroup_copy_region_size % subgroup_size != 0), copy residual elements here 
gpu.barrier

Dependent PRs:

krzysz00 · 2025-04-17T18:51:38Z

Side note, I'm still poking at getting the buffer fat pointer to LDS intrinsic set up - it's caught up in bikeshedding on the compiler team

krzysz00

High-level observation: at the point this is being called, shouldn't we know the subgroup size, so that we don't need the subgroup_id op?

Like, you can just look at the workgroup sizes to see which subgroup you're in

krzysz00 · 2025-04-22T20:13:49Z

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp

+//===----------------------------------------------------------------------===//
+
+SmallVector<int64_t>
+UseGlobalLoadDMAAttr::getStaticTilingLevelSizes(unsigned level,


Why's there an unsigned in here?

This is how the LoweringConfigAttrInterface exposes tiling levels. It's up to the backend + lowering config to interpret the level consistently.

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp

krzysz00

A few notes, some of which I apparently failed to submit on Friday x.x

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td

lialan · 2025-05-13T15:03:53Z

~~Getting this error in CI:~~

iree/runtime/src/iree/hal/drivers/hip/native_executable.c:358: FAILED_PRECONDITION;
HIP driver error 'hipErrorSharedObjectInitFailed' (303): shared object initialization failed;
mismatched target chip? missing/wrong bitcode directory?;
while invoking native function hal.executable.create; while calling import;

~~Needs to wait until llvm/llvm-project#137425 is integrated.~~

fixed.

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPUPromoteMatmulOperands.cpp

tests/e2e/linalg/BUILD.bazel

tests/e2e/linalg/lds_matmul.mlir

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/test/direct_load.mlir

compiler/src/iree/compiler/Codegen/Common/IREEComprehensiveBufferizePass.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

hanhanW

Thanks, @lialan. It looks better! Here is my final round of the review. (I can skim through the code again after you address the comments.)

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/plugins/target/ROCM/ROCMTargetUtils.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPUPromoteMatmulOperands.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

tests/e2e/linalg/lds_matmul.mlir

krzysz00

Some notes

compiler/plugins/target/ROCM/ROCMTargetUtils.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

krzysz00 · 2025-05-19T22:48:06Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

+  return numElements;
+}
+
+static bool distributeLinalgCopyToThreads(RewriterBase &rewriter,


Also, this might want to be a LogicalResult?

compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp

This reverts commit d7a09a9.

lialan force-pushed the users/lialan/global_load_lds branch from e4ce145 to 335319d Compare April 22, 2025 02:00

krzysz00 reviewed Apr 22, 2025

View reviewed changes

lialan force-pushed the users/lialan/global_load_lds branch from bbe9565 to f6c7290 Compare April 28, 2025 19:19

lialan changed the title ~~Implement iree_gpu.global_load_dma op~~ Lower linalg.copy to direct global load Apr 28, 2025

lialan force-pushed the users/lialan/global_load_lds branch 5 times, most recently from 275e72b to 97161a8 Compare May 1, 2025 20:05

hanhanW reviewed May 1, 2025

View reviewed changes

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUOps.td Show resolved Hide resolved

compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp Outdated Show resolved Hide resolved

krzysz00 reviewed May 5, 2025

View reviewed changes

lialan force-pushed the users/lialan/global_load_lds branch 3 times, most recently from ee462fd to c2b31d4 Compare May 13, 2025 14:27

lialan force-pushed the users/lialan/global_load_lds branch 2 times, most recently from 8152e33 to f7a8bda Compare May 14, 2025 02:21

lialan marked this pull request as ready for review May 14, 2025 19:31

lialan requested review from kuhar, MaheshRavishankar, qedawkins, Groverkss and antiagainst as code owners May 14, 2025 19:31

hanhanW reviewed May 15, 2025

View reviewed changes

lialan force-pushed the users/lialan/global_load_lds branch 2 times, most recently from e48e8b2 to 0b6d30c Compare May 19, 2025 15:04

hanhanW reviewed May 19, 2025

View reviewed changes

krzysz00 reviewed May 19, 2025

View reviewed changes

lialan added 26 commits May 28, 2025 00:33

Use affine index

a04d0d0

Add a rocm e2e test

0078319

Fix tiny issue

8c4db66

Move LowerToGlobalLoads to earlier to avoid linalg.copy being dismantled

9494439

Address some comments

b05b8c5

Revert "Address some comments"

a052c71

This reverts commit d7a09a9.

Fix load size.

e509923

Refactor.

32fbcec

Always use 32bit load

81d48a1

Buffer alloc at GPUPromoteMatmul phase

fff4e2d

Fix check

6658b5b

We cannot buffer alloca at GPUMatMulOperand phase

9702521

Adding a trivial e2e test.

da4cb84

Fix test

f281ca3

Adding a unit test for it.

78d6521

Address comments

ba4a5c1

Cannot handle residual elements case.

aa199fd

Use RewritePattern

0bf2136

adding a new restriction condition.

1f9efa7

Update test file

cf54720

Use AffineLinearizeIndexOp instead of affine exprs.

0fd4724

Linting

859248a

Update according to comments

074f567

Fix deps

3240f57

Update according to comments

d9792db

Address comments

bf808cf

lialan force-pushed the users/lialan/global_load_lds branch from c66f276 to c5206f1 Compare May 28, 2025 01:09

address comments

cb3bad7

lialan force-pushed the users/lialan/global_load_lds branch from c5206f1 to cb3bad7 Compare May 28, 2025 01:36

lialan requested a review from qedawkins May 28, 2025 02:37

Lower linalg.copy to direct global load #20568

Are you sure you want to change the base?

Lower linalg.copy to direct global load #20568

Uh oh!

Conversation

Uh oh!

Summary

Lowering linalg.copy

Dependent PRs:

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lower `linalg.copy` to direct global load #20568

Lower `linalg.copy` to direct global load #20568

Lowering `linalg.copy`