-
Notifications
You must be signed in to change notification settings - Fork 769
[SYCL][Doc][matrix] Update the experimental matrix interface that mat… #4707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ches the new AMX/DPAS JIT implementation This includes: - Minor changes to the joint_matrix interface - Added the query interface - Update implementation information: Added JIT support for both DPAS and AMX
template <typename Group, typename T, size_t NumRows, size_t NumCols, | ||
matrix_layout Layout, | ||
matrix_layout L, | ||
access::address_space Space> | ||
void joint_matrix_load(Group sg, joint_matrix<Group, T, NumRows, NumCols, Layout> &res, | ||
multi_ptr<T, Space> src, size_t stride, matrix_layout layout = matrix_layout::row_major); | ||
void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &res, | ||
multi_ptr<T, Space> src, size_t stride, matrix_layout memL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for changing these names? I think Layout
and layout
were much better names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rationale was to distinguish between the matrix layout and memory layout. I will change them to Layout and MemLayout
|
||
A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) | ||
=== Valid or not Example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this subsection title a little confusing. I'm nit-picking, but I don't see any example of querying here, and it's not immediately obvious from this example where the assertion happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will improve the example.
#### Option3: Restrictive conversion in the interface from SIMD to SIMT | ||
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. | ||
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms. | ||
|
||
#### proposal: Explicit conversion in the interface from SIMD to SIMT | ||
We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Option3: Restrictive conversion in the interface from SIMD to SIMT | |
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. | |
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms. | |
#### proposal: Explicit conversion in the interface from SIMD to SIMT | |
We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:. | |
#### Option3: Restrictive conversion in the interface from SIMD to SPMD | |
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. | |
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms. | |
#### proposal: Explicit conversion in the interface from SIMD to SPMD | |
We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SPMD array object:. |
SYCL is a Single Program Multiple Data (SPMD) model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will make the change
@@ -113,58 +115,58 @@ We define three new functions needed to perform the main and common operations o | |||
|
|||
The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout, columns for column major layout. | |||
|
|||
Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. | |||
Note that for getting maximum performance on AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is implied in various places that packed_a
should only be used with matrix A and packed_b
with matrix B. Here it states "packed_a
in column major case, packed_b
in row major case". This gives the impression that matrix A must always be indexed column major and matrix B row_major for AMX and DPAS (at least when they are packed
). If what I wrote is correct it would be clearer to state this explicitly, or otherwise clarify the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by column major or row major, I am referring to matrix C. There are two cases:
A row major, B row major, C row major. In this case B has to be VNNIed (packed_b)
A col major, B col major, C col major, in this case A has to be VNNIed (packed_a).
I will add clarification.
BTW, it is at this place where we should add a subsetion that for Nvidia tensorcores, an additional argument is added to specify the use of the argument and that packed is not needed for the Nvdia tensorcores implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks that's clear.
|
||
#### Option3: Restrictive conversion in the interface from SIMD to SIMT | ||
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. | ||
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms. | |
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. However using the 'mma' ptx instructions as opposed to the 'wmma' ptx instructions the mapping is known. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will add this change.
|
||
- Construct the matrices using a default shape if user does not provide a combination | ||
|
||
- General query interface for sizes, types, static/dynamic, scope. This is needed to void padding by the user, for tuning, and efficient code generation if used by a library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what "void padding" means in this context. I understand that the user will be able to work out whether padding of their matrix will be necessary once they are aware of the possible submatrix sizes - Did you mean something related to this? More detail would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, that was a typo :) --> avoid padding.
I provide an example in the General query example section:
M = 1500; // with msize = 8 and msize = 4,
// M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
In this case no padding is needed.
I will fix the typo. Let me know if I need to add something to make it more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. It makes sense.
|
||
- At compile time, inform the user whether a specific combination is valid or not. | ||
|
||
- Construct the matrices using a default shape if user does not provide a combination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For people that aren't aware perhaps clarify that this corresponds to the case where the user provides the sizes of large "tile" matrices but doesn't specify the sizes of corresponding submatrices of the "tiles" - Construct the matrices refers to constructing the submatrices of the matrices whose size the user provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
…pecifically: - John: improve the naming for layouts arguments, fix valid or not example, replace SIMT with SPMD - Jack: clarify packed_a/b layouts, add a comment about ptx 'mma' , fix a typo (void), clarify the default sizes query - Greg: add layout to the parameter list of the joint_matrix type alias in the query, adding types alias, explain difference between msize and max_msize, remove mt alias as it can confuse the reader, fix indentation, fix the general query example,
- Add changing the default sizes in the query to M, N, K to the todo list - Add missing layout to the alias matrices in query - Add comment about the combinations array in the size-only query case - Add the combinations array to the validation case as well, for consistency - Add a table that explains each of the query class members and type aliases - Adjust the order of types and sizes in the combination type
|
||
- Construct the matrices using a default shape if user does not provide a combination. This corresponds to the case where the user provides the sizes of large `tile` matrices but does not specify the sizes of the corresponding submatrices of the `tiles`. In this case, the query will construct these submatrices of the matrices whose size the user provided. | ||
|
||
- General query interface for sizes, types, static/dynamic, scope. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description here about the meaning of the member variables / type aliases can be removed now that you have the table. In this introduction section, it would be better to focus on:
- The mechanics of how the user specializes
tpu_params
in each of its three modes. - Explain why each mode is useful.
|
||
Example where each WI gets 1 column: | ||
```c++ | ||
marray<T,msize> wi_C = get_wi_slice(C, 0, wi_idx, msize, 1, matrix_layout::row_major); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, wi_C
seems to be an independent copy of elements of the matrix C
. And unlike options 1 and 2, whatever we do with wi_C
, matrix C
doesn't change. @dkhaldi is it intentional? Maybe we should return a reference or std::span
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intentional. The idea is that these post operations will be done using these slices in an owned fashion (independent set of registers).
Having said that, I see why it can be useful to return a reference or a view. I will keep your comment in the open questions list. We will discuss it.
- Greg: Add an additional column to the table to specify for each member, in which forms they are defined - Greg: Clarify the definition of the default values form - Alexey S.: add an open question about the returned type of get_wi_slice: owned vs. view object
…listic general query example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed that the remaining issues with the query API can be addressed later since this is an experimental extension. Those issue are captured in the TODO section.
Ping, can we merge this PR? |
…ches the new AMX/DPAS JIT implementation
This includes: