-
Notifications
You must be signed in to change notification settings - Fork 700
[NFC][AMDGPU] Refactor getSingleSubgroupLayout() for MFMAs #20561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently, all the MFMA intrinsics had the same two accumulaor layouts copy-pasted (depending on whether they had a 16x16 or 32x32 output). These have been factored out into constants, which'll make the code cleaer and allow more MFMAs to not need to copy paste. Similarly, the layout for the left- and right-hand side inputs to an MFMA (at least for the non-blocked ones, which are all we use) follows a simple pattern that's a function of the K dimension and whether the not-reduction dimension is 16 or 32. These patterns have been factored out into small lambdas to make things clearer and prevent excessive copy-paste.
compiler/src/iree/compiler/Codegen/Dialect/GPU/IR/IREEGPUAttrs.cpp
Outdated
Show resolved
Hide resolved
/*element=*/{1, k / 4}}; | ||
}; | ||
auto mfmaRhsKx16 = [](int64_t k) -> MMASingleSubgroupLayout { | ||
assert(k % 4 == 0 && "doesn't support blockef MFMAs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: blockef
/*element=*/{1, k / 2}}; | ||
}; | ||
auto mfmaRhsKx32 = [](int64_t k) -> MMASingleSubgroupLayout { | ||
assert(k % 2 == 0 && "doesn't support blockef MFMAs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
@@ -162,83 +162,96 @@ getUnsupportedMNKShape(MMAIntrinsic intrinsic) { | |||
|
|||
MMASingleSubgroupLayout getSingleSubgroupLayout(MMAIntrinsic intrinsic, | |||
MMAFragment fragment) { | |||
auto mfmaLhs16xK = [](int64_t k) -> MMASingleSubgroupLayout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be useful if this didn't have a hidden 4 in the implementation that the name mfmaLhs16xk
doesn't convey?
How about mfmaLhs16x4xE(int e)
where the caller passes for e
what is currently k / 4
? That would also remove the need for the assert
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert is there because there are some wonderful old intrinsics like mfma_f32_32x32x1_f32
which perform batched matmul over multiple "blocks"
I figured that just using the K dimension would make it easiest to spot-check that the value is passed is correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that there's a division by 4 is an implementation detail - logically, you're always passing K elements to the MFMA, and the layout tells you how they're distributed
Currently, all the MFMA intrinsics had the same two accumulaor layouts copy-pasted (depending on whether they had a 16x16 or 32x32 output). These have been factored out into constants, which'll make the code cleaer and allow more MFMAs to not need to copy paste.
Similarly, the layout for the left- and right-hand side inputs to an MFMA (at least for the non-blocked ones, which are all we use) follows a simple pattern that's a function of the K dimension and whether the not-reduction dimension is 16 or 32. These patterns have been factored out into small lambdas to make things clearer and prevent excessive copy-paste.