-
Notifications
You must be signed in to change notification settings - Fork 699
Lower linalg.copy
to direct global load
#20568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Side note, I'm still poking at getting the buffer fat pointer to LDS intrinsic set up - it's caught up in bikeshedding on the compiler team |
e4ce145
to
335319d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level observation: at the point this is being called, shouldn't we know the subgroup size, so that we don't need the subgroup_id op?
Like, you can just look at the workgroup sizes to see which subgroup you're in
//===----------------------------------------------------------------------===// | ||
|
||
SmallVector<int64_t> | ||
UseGlobalLoadDMAAttr::getStaticTilingLevelSizes(unsigned level, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why's there an unsigned
in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the LoweringConfigAttrInterface exposes tiling levels. It's up to the backend + lowering config to interpret the level consistently.
bbe9565
to
f6c7290
Compare
iree_gpu.global_load_dma
oplinalg.copy
to direct global load
275e72b
to
97161a8
Compare
compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes, some of which I apparently failed to submit on Friday x.x
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
ee462fd
to
c2b31d4
Compare
fixed. |
8152e33
to
f7a8bda
Compare
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUPromoteMatmulOperands.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/test/direct_load.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/IREEComprehensiveBufferizePass.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
e48e8b2
to
0b6d30c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @lialan. It looks better! Here is my final round of the review. (I can skim through the code again after you address the comments.)
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/DerivedConfigUtils.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUPromoteMatmulOperands.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
return numElements; | ||
} | ||
|
||
static bool distributeLinalgCopyToThreads(RewriterBase &rewriter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this might want to be a LogicalResult
?
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPULowerToGlobalLoads.cpp
Outdated
Show resolved
Hide resolved
This reverts commit d7a09a9.
c66f276
to
c5206f1
Compare
c5206f1
to
cb3bad7
Compare
Summary
This PR sets the foundation for using
global_load_lds
instruction to load values from global to LDS memory. The pipeline is as follows:linalg.copy
emitted inPromoteGPUMatMulOperands
. When it sees fit, insert a different attribute (#iree_gpu.use_global_load_dma
) tolinalg.copy
to tag it along the pipeline.linalg.copy
will not be decomposed/tiled until bufferization.linalg.copy
will then be lowered to a sequence of code responsible for subgroup-coalesced loading opiree_gpu.global_load_dma
.iree_gpu.global_load_dma
will be mapped toamdgpu.gather_to_lds
op, which will mapped to corresponding rocdl op.Lowering
linalg.copy
After bufferization and distribute to threads, tagged
linalg.copy
still exists in the IR:Note that this
linalg.copy
is kept in the thread's code. The op itself is then converted into afor loop
, in which subgroup of threads loads coalesced chunk of values. For example, assume there are N subgroups loading fromtensor<a x b x c>
:i
-th subgruop will load a sub tensor of size[a/N, b, c]
, so each slice is consecutive.linalg.copy
emitted byGPUPromoteMatmulOperands
is that we know the destination is allocated contiguously.gpu.subgroup_id
andgpu.lane_id
, each thread calculates the consecutive data chunk the subgroup the thread belongs to is responsible to load:affine.delinearize_index[gpu.subgroup_id * (num_elems_of(tensor) / num_subgroups)]
, toaffine.delinearize_index[(gpu.subgroup_id + 1) * (num_elems_of(tensor) / num_subgroups) - 1]
n
values from linearized index[N_f, N_b]
, then thread with lane idi
will try to load:iter = 0 to n : N_f + subgroup_size * iter + (i - 1)
.Then it will be converted to something like the following (in the example, assume
workgroup size = 256
,subgroup_size = 64
, loading64x128xi8
):Dependent PRs: