8000 [SYCL][Doc][matrix] Update the experimental matrix interface that mat… by dkhaldi · Pull Request #4707 · intel/llvm · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[SYCL][Doc][matrix] Update the experimental matrix interface that mat… #4707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 8, 2021

Conversation

dkhaldi
Copy link
Contributor
@dkhaldi dkhaldi commented Oct 5, 2021

…ches the new AMX/DPAS JIT implementation

This includes:

  • Minor changes to the joint_matrix interface
  • Added the query interface
  • Update implementation information: Added JIT support for both DPAS and AMX

…ches the new AMX/DPAS JIT implementation

This includes:
- Minor changes to the joint_matrix interface
- Added the query interface
- Update implementation information: Added JIT support for both DPAS and AMX
Comment on lines 131 to 135
template <typename Group, typename T, size_t NumRows, size_t NumCols,
matrix_layout Layout,
matrix_layout L,
access::address_space Space>
void joint_matrix_load(Group sg, joint_matrix<Group, T, NumRows, NumCols, Layout> &res,
multi_ptr<T, Space> src, size_t stride, matrix_layout layout = matrix_layout::row_major);
void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &res,
multi_ptr<T, Space> src, size_t stride, matrix_layout memL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for changing these names? I think Layout and layout were much better names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale was to distinguish between the matrix layout and memory layout. I will change them to Layout and MemLayout


A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major)
=== Valid or not Example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this subsection title a little confusing. I'm nit-picking, but I don't see any example of querying here, and it's not immediately obvious from this example where the assertion happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will improve the example.

Comment on lines 504 to 509
#### Option3: Restrictive conversion in the interface from SIMD to SIMT
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.

#### proposal: Explicit conversion in the interface from SIMD to SIMT
We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#### Option3: Restrictive conversion in the interface from SIMD to SIMT
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
#### proposal: Explicit conversion in the interface from SIMD to SIMT
We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:.
#### Option3: Restrictive conversion in the interface from SIMD to SPMD
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
#### proposal: Explicit conversion in the interface from SIMD to SPMD
We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SPMD array object:.

SYCL is a Single Program Multiple Data (SPMD) model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will make the change

@@ -113,58 +115,58 @@ We define three new functions needed to perform the main and common operations o

The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout, columns for column major layout.

Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
Note that for getting maximum performance on AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
Copy link
Contributor
@JackAKirk JackAKirk Oct 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is implied in various places that packed_a should only be used with matrix A and packed_b with matrix B. Here it states "packed_a in column major case, packed_b in row major case". This gives the impression that matrix A must always be indexed column major and matrix B row_major for AMX and DPAS (at least when they are packed). If what I wrote is correct it would be clearer to state this explicitly, or otherwise clarify the situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by column major or row major, I am referring to matrix C. There are two cases:
A row major, B row major, C row major. In this case B has to be VNNIed (packed_b)
A col major, B col major, C col major, in this case A has to be VNNIed (packed_a).
I will add clarification.
BTW, it is at this place where we should add a subsetion that for Nvidia tensorcores, an additional argument is added to specify the use of the argument and that packed is not needed for the Nvdia tensorcores implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that's clear.


#### Option3: Restrictive conversion in the interface from SIMD to SIMT
Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. However using the 'mma' ptx instructions as opposed to the 'wmma' ptx instructions the mapping is known.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will add this change.


- Construct the matrices using a default shape if user does not provide a combination

- General query interface for sizes, types, static/dynamic, scope. This is needed to void padding by the user, for tuning, and efficient code generation if used by a library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what "void padding" means in this context. I understand that the user will be able to work out whether padding of their matrix will be necessary once they are aware of the possible submatrix sizes - Did you mean something related to this? More detail would be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, that was a typo :) --> avoid padding.
I provide an example in the General query example section:
M = 1500; // with msize = 8 and msize = 4,
// M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
In this case no padding is needed.
I will fix the typo. Let me know if I need to add something to make it more clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It makes sense.


- At compile time, inform the user whether a specific combination is valid or not.

- Construct the matrices using a default shape if user does not provide a combination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For people that aren't aware perhaps clarify that this corresponds to the case where the user provides the sizes of large "tile" matrices but doesn't specify the sizes of corresponding submatrices of the "tiles" - Construct the matrices refers to constructing the submatrices of the matrices whose size the user provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

…pecifically:

- John: improve the naming for layouts arguments, fix valid or not example, replace SIMT with SPMD
- Jack: clarify packed_a/b layouts, add a comment about ptx 'mma' , fix a typo (void), clarify the default sizes query
- Greg: add layout to the parameter list of the joint_matrix type alias in the query, adding types alias, explain difference between msize and max_msize, remove mt alias as it can confuse the reader, fix indentation, fix the general query example,
JackAKirk
JackAKirk previously approved these changes Oct 8, 2021
	- Add changing the default sizes in the query to M, N, K to the todo list
	- Add missing layout to the alias matrices in query
	- Add comment about the combinations array in the size-only query case
	- Add the combinations array to the validation case as well, for consistency
	- Add a table that explains each of the query class members and type aliases
	- Adjust the order of types and sizes in the combination type

- Construct the matrices using a default shape if user does not provide a combination. This corresponds to the case where the user provides the sizes of large `tile` matrices but does not specify the sizes of the corresponding submatrices of the `tiles`. In this case, the query will construct these submatrices of the matrices whose size the user provided.

- General query interface for sizes, types, static/dynamic, scope. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description here about the meaning of the member variables / type aliases can be removed now that you have the table. In this introduction section, it would be better to focus on:

  • The mechanics of how the user specializes tpu_params in each of its three modes.
  • Explain why each mode is useful.


Example where each WI gets 1 column:
```c++
marray<T,msize> wi_C = get_wi_slice(C, 0, wi_idx, msize, 1, matrix_layout::row_major);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, wi_C seems to be an independent copy of elements of the matrix C. And unlike options 1 and 2, whatever we do with wi_C, matrix C doesn't change. @dkhaldi is it intentional? Maybe we should return a reference or std::span?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional. The idea is that these post operations will be done using these slices in an owned fashion (independent set of registers).
Having said that, I see why it can be useful to return a reference or a view. I will keep your comment in the open questions list. We will discuss it.

- Greg: Add an additional column to the table to specify for each member, in which forms they are defined
- Greg: Clarify the definition of the default values form
- Alexey S.: add an open question about the  returned type of get_wi_slice: owned vs. view object
Copy link
Contributor
@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agreed that the remaining issues with the query API can be addressed later since this is an experimental extension. Those issue are captured in the TODO section.

@dkhaldi
Copy link
Contributor Author
dkhaldi commented Nov 3, 2021

Ping, can we merge this PR?

@dm-vodopyanov dm-vodopyanov merged commit 6495575 into sycl Nov 8, 2021
@bader bader deleted the matrix-v2 branch November 15, 2021 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0