-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This proposal is based on the existing AMX proposal, making small adaptation to accomodate the Nvidia tensorcore hardware. The intention is that the proposal should be compatible with both architectures (AMX and tensorcore). Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way to blend this with the existing matrix doc.
Note that the existing one is out of date.
I will update it with the changes we recently made to enable DPAS of Intel GPUs. This is not really an AMX proposal but a unified matrix proposal. The implementation now supports both AMX and DPAS plus the one you are adding here for Nvidia tensor cores.
We should have one file that has the interface. Then we can add sections (or new files) that talk about implementation restrictions for each hardware (AMX, DPAS, TensorCores).
This sounds fine. It could be useful to keep header files separate for the time being to allow for experimental changes that only affect one backend. However I don't think that any functional changes to the joint_matrix, joint_matrix_load, joint_matrix_store, joint_matrix_mad interfaces will be necessary on the Tensorcore side as more cases are implemented, so a single header should be fine from the Tensorcore side. We can wait for #4707 to be merged before adding any missing changes that are included in the doc from this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Minor suggestion for when it is moved.
|
||
A(double, 8x4, row_major/col_major), B(double, 4x8, row_major/col_major), C(double, 8x8, row_major/col_major) | ||
|
||
In order to deal with different cases we use partial specialization of the various template functions introduced by the extension. LLVM builtins are available for all possible matrix shapes, and runtime implementations covering these cases will be progressively added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this the tensorcore side of the matrix extension is likely to stay somewhat restrictive in matrix type/shape/layout combinations I think it would make sense to make the overview of currently supported type/shape/layout combinations more extendable. Maybe a table for joint_matrix_store
, joint_matrix_load
, and joint_matrix_mad
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a good idea thanks! I'll add it when I merge the changes from this proposal doc into the existing proposal.
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Added nvidia compilation instructions. Other clarifications added on distinction between Nvidia and AMX use cases. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
} | ||
``` | ||
This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS. | ||
|
||
This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS or to the matrix "fragments" for Nvidia Tensorcore. Note that `Layout` is not included as an argument since it may be determined from the joint_matrix argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it also acceptable for the other backends to determine matrix layout from joint_matrix instead of having an extra argument to joint_matrix_load/joint_matrix_store for matrix_layout Layout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS/Nvidia tensorcores
fragment is not the hardware name right? can you just refer to them as registers?
If not, you can still merge things as something like this:
+This function loads data from memory to the 2d tiles/registers/fragments of Intel AMX/DPAS/Nvidia tensorcores, respectively.
The comment about "
Note that Layout
is not included as an argument since it may be determined from the joint_matrix argument."
should be added in the tensorcores subsection that adds use argument.
@dkhaldi I've added matrix_use to the proposal, as well as some minimal Nvidia specific information where necessary. I've removed the separate Nvidia based matrix proposal document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are a lot of comments :)
If anything is not clear, we can have a meeting to go over the comments
@@ -33,11 +33,11 @@ SYCL specification refer to that revision. | |||
|
|||
**_NOTE:_** _This document describes the current design and API for the matrix | |||
extension to {dpcpp}. This is an initial experimental version to try out functionality | |||
and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving | |||
and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX), DPAS, and Nvidia Tensorcore. We are going to work with the community on incrementally improving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA®
size_t Rows = sycl::dynamic_extent, | ||
size_t Cols = sycl::dynamic_extent, | ||
matrix_layout Layout = matrix_layout::row_major, | ||
typename Group, typename Cond = void> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this "Cond" argument was handy for the implementation, it should not be part of the interface to the user
|
||
```c++ | ||
namespace sycl::ext::oneapi::experimental::matrix { | ||
template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, | ||
matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group> | ||
template <typename T, matrix_use MT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you revert back to the original definition.
Add a new subsection here called something like: "Additional "use" argument for the tensorcore case". Put there this new definition along with the reason behind the addition of this argument. You can also specify that the long term plan is to incorporate the use argument for other TPUs (AMX and DPAS) as well.
}; | ||
} | ||
``` | ||
|
||
The final optional conditional argument can be used to remove cases for template parameter values which are incompatible with a particular backend. | ||
For example, the Nvidia Tensorcore backend does not allow usage of `matrix_layout::packed_a` or `matrix_layout::packed_b`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this as this should be part of the implementation code not the spec
|
||
IMPORTANT: In the current implementation, only the subgroup scope is supported | ||
IMPORTANT: In the current implementation, only the subgroup scope is supported. For Nvidia Tensorcore only the subgroup scope is supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is the case for the three current use cases, change that to something like:
In the current implementations of Intel AMX, Intel DPAS, and Nvidia tensorcores, only the subgroup scope is supported.
@@ -235,6 +265,16 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) | |||
}).wait(); | |||
``` | |||
|
|||
## Nvidia Compilation instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this section as +important note under the example above
- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K | ||
- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups | ||
- Add a more realistic and complete example that shows the value of the general query | ||
- Clarify USM compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tensorcores:
- Clarify USM compatibility
- Add support for other combinations, the query interface, and consider how the future looking API can be added here.
@@ -583,13 +621,15 @@ We did not utilize this extension for this matrix API version because sub-group | |||
- Ronan Keryell: "It would be interesting to investigate whether providing also 9E88 member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" | |||
|
|||
- In the future looking APIs, `get_wi_slice` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified. | |||
|
|||
- multi_ptr can be constructed from T* since https://github.com/intel/llvm/pull/1183. However currently this cannot be used with USM for all cases. | |||
It is expected that eventually the `joint_matrix_load` and `joint_matrix_store` interfaces will be fully compatible with USM. | |||
## TODO List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the matrix use to do:
AMX and DPAS:
Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform
@@ -599,4 +639,5 @@ We did not utilize this extension for this matrix API version because sub-group | |||
|Rev |Date |Author |Changes | |||
|1 |2021-04-13 |Dounia Khaldi |Initial public working draft. | |||
|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS | |||
|3 |2021-11-08 |Jack Kirk |Added matrix_use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial AOT use case on Nvidia tensorcores
Thanks. Regarding USM I am surprised that it is fully working for the other backends since I thought there was a general issue casting multi_ptr to T*. Here are my notes on testing USM with the Nvidia matrix backend: multi_ptr can be constructed from T* since #1183. However
Where
However even this case is not reliable and requires more testing. |
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Thanks for all the comments. Hopefully I have addressed them now. |
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
Unnecessary template parameters are removed. Although only one "shape" is currently supported: m8n8k128, the shape template parameters are still used so that when future shapes are supported by the hardware they can be implemented. Signed-off-by: jack.kirk <jack.kirk@codeplay.com>
I have recently watched SIMD in C++20 - EVE of a new Era - Joël Falcou - CPPP 2021 on https://github.com/jfalcou/eve by @jfalcou and there are a lot of interesting design ideas which might be useful in some SYCL SIMD 1D & 2D extensions. |
This proposal is based on the existing AMX proposal, making small adaptation to accommodate the Nvidia Tensorcore hardware.
The intention is that this proposal should be compatible with both architectures (AMX and Tensorcore).
The corresponding implementation is #4696.
Integration tests are here: intel/llvm-test-suite#760
Signed-off-by: JackAKirk jack.kirk@codeplay.com