8000 [SYCL][Doc][matrix] Update the experimental matrix interface that mat… by dkhaldi · Pull Request #4707 · intel/llvm · GitHub

More Web Proxy on the site http://driver.im/

[SYCL][Doc][matrix] Update the experimental matrix interface that mat… #4707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

dm-vodopyanov merged 6 commits into sycl from matrix-v2

Nov 8, 2021

Contributor

dkhaldi commented

…ches the new AMX/DPAS JIT implementation

This includes:

Minor changes to the joint_matrix interface
Added the query interface
Update implementation information: Added JIT support for both DPAS and AMX


          [SYCL][Doc][matrix] Update the experimental matrix interface that mat…

ce12ec0

…ches the new AMX/DPAS JIT implementation

This includes:
- Minor changes to the joint_matrix interface
- Added the query interface
- Update implementation information: Added JIT support for both DPAS and AMX

dkhaldi requested a review from a team as a code owner

October 5, 2021 14:41

dkhaldi requested review from gmlueck, yubingex007-a11y, LuoYuanke, AlexeySotkin and JackAKirk

October 5, 2021 14:41

JackAKirk mentioned this pull request

[SYCL][CUDA][MATRIX][DOC] Tensorcore Matrix extension proposal #4695

Closed

Pennycook reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated

Comment on lines 131 to 135

    
                template <typename Group, typename T, size_t NumRows, size_t NumCols,

                        matrix_layout Layout,

                        matrix_layout L,

                        access::address_space Space>

                void joint_matrix_load(Group sg, joint_matrix<Group, T, NumRows, NumCols, Layout> &res,

              		    multi_ptr<T, Space> src, size_t stride, matrix_layout layout = matrix_layout::row_major);

                void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &res,

              		    multi_ptr<T, Space> src, size_t stride, matrix_layout memL);

Contributor

Pennycook

What's the motivation for changing these names? I think Layout and layout were much better names.

Contributor Author

dkhaldi

The rationale was to distinguish between the matrix layout and memory layout. I will change them to Layout and MemLayout

Pennycook reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated


		A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major)
		=== Valid or not Example:

Contributor

Pennycook

I found this subsection title a little confusing. I'm nit-picking, but I don't see any example of querying here, and it's not immediately obvious from this example where the assertion happens.

Contributor Author

dkhaldi

Thanks, I will improve the example.

Pennycook reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated

Comment on lines 504 to 509

+              #### Option3: Restrictive conversion in the interface from SIMD to SIMT
+              Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
+              While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
+              #### proposal: Explicit conversion in the interface from SIMD to SIMT
+              We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:.

Contributor

Pennycook

Suggested change

      
            #### Option3: Restrictive conversion in the interface from SIMD to SIMT
          
            Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. 
          
            While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
          
            #### proposal: Explicit conversion in the interface from SIMD to SIMT
          
            We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SIMT array object:.
          
            #### Option3: Restrictive conversion in the interface from SIMD to SPMD
          
            Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. 
          
            While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
          
            #### proposal: Explicit conversion in the interface from SIMD to SPMD
          
            We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SPMD array object:.

SYCL is a Single Program Multiple Data (SPMD) model.

Contributor Author

dkhaldi

Thanks, I will make the change

JackAKirk reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated

               The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout,  columns for column major layout.
-              Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
+              Note that for getting maximum performance on AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.

Contributor

JackAKirk

•

It is implied in various places that packed_a should only be used with matrix A and packed_b with matrix B. Here it states "packed_a in column major case, packed_b in row major case". This gives the impression that matrix A must always be indexed column major and matrix B row_major for AMX and DPAS (at least when they are packed). If what I wrote is correct it would be clearer to state this explicitly, or otherwise clarify the situation.

Contributor Author

dkhaldi

by column major or row major, I am referring to matrix C. There are two cases:
A row major, B row major, C row major. In this case B has to be VNNIed (packed_b)
A col major, B col major, C col major, in this case A has to be VNNIed (packed_a).
I will add clarification.
BTW, it is at this place where we should add a subsetion that for Nvidia tensorcores, an additional argument is added to specify the use of the argument and that packed is not needed for the Nvdia tensorcores implementation.

Contributor

JackAKirk

Thanks that's clear.

JackAKirk reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated

+              #### Option3: Restrictive conversion in the interface from SIMD to SIMT
+              Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix.
+              While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.

Contributor

JackAKirk

Suggested change

      
            While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. This puts restriction on the user to implement new operations like sum of rows of a matrix for quantized algorithms.
          
            While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix.  However using the 'mma' ptx instructions as opposed to the 'wmma' ptx instructions the mapping is known.

Contributor Author

dkhaldi

Thanks, I will add this change.

JackAKirk reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated


		- Construct the matrices using a default shape if user does not provide a combination

		- General query interface for sizes, types, static/dynamic, scope. This is needed to void padding by the user, for tuning, and efficient code generation if used by a library.

Contributor

JackAKirk

I don't understand what "void padding" means in this context. I understand that the user will be able to work out whether padding of their matrix will be necessary once they are aware of the possible submatrix sizes - Did you mean something related to this? More detail would be useful.

Contributor Author

dkhaldi

First, that was a typo :) --> avoid padding.
I provide an example in the General query example section:
M = 1500; // with msize = 8 and msize = 4,
// M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
In this case no padding is needed.
I will fix the typo. Let me know if I need to add something to make it more clear

Contributor

JackAKirk

I see. It makes sense.

JackAKirk reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated


		- At compile time, inform the user whether a specific combination is valid or not.

		- Construct the matrices using a default shape if user does not provide a combination

Contributor

JackAKirk

For people that aren't aware perhaps clarify that this corresponds to the case where the user provides the sizes of large "tile" matrices but doesn't specify the sizes of corresponding submatrices of the "tiles" - Construct the matrices refers to constructing the submatrices of the matrices whose size the user provided.

Contributor Author

dkhaldi

sounds good

gmlueck reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Show resolved Hide resolved


          [SYCL][Doc][matrix] Incorporate John's, Jack's, and Greg's reviews, s…

c1b7a13

…pecifically:

- John: improve the naming for layouts arguments, fix valid or not example, replace SIMT with SPMD
- Jack: clarify packed_a/b layouts, add a comment about ptx 'mma' , fix a typo (void), clarify the default sizes query
- Greg: add layout to the parameter list of the joint_matrix type alias in the query, adding types alias, explain difference between msize and max_msize, remove mt alias as it can confuse the reader, fix indentation, fix the general query example,

bader requested review from JackAKirk, gmlueck and Pennycook

October 8, 2021 15:38

JackAKirk previously approved these changes

View reviewed changes

gmlueck reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved


          [SYCL][Doc][matrix] Incorporate Greg's review specifically:

e8482ac

	- Add changing the default sizes in the query to M, N, K to the todo list
	- Add missing layout to the alias matrices in query
	- Add comment about the combinations array in the size-only query case
	- Add the combinations array to the validation case as well, for consistency
	- Add a table that explains each of the query class members and type aliases
	- Adjust the order of types and sizes in the combination type

dkhaldi dismissed JackAKirk’s stale review via

e8482ac

October 16, 2021 01:20

gmlueck reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated


		- Construct the matrices using a default shape if user does not provide a combination. This corresponds to the case where the user provides the sizes of large `tile` matrices but does not specify the sizes of the corresponding submatrices of the `tiles`. In this case, the query will construct these submatrices of the matrices whose size the user provided.

		- General query interface for sizes, types, static/dynamic, scope. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied.

Contributor

gmlueck

The description here about the meaning of the member variables / type aliases can be removed now that you have the table. In this introduction section, it would be better to focus on:

The mechanics of how the user specializes tpu_params in each of its three modes.
Explain why each mode is useful.

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

AlexeySotkin reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc

+              Example where each WI gets 1 column:
+              ```c++
+              marray<T,msize> wi_C = get_wi_slice(C, 0, wi_idx, msize, 1, matrix_layout::row_major);

Contributor

AlexeySotkin

If I understand it correctly, wi_C seems to be an independent copy of elements of the matrix C. And unlike options 1 and 2, whatever we do with wi_C, matrix C doesn't change. @dkhaldi is it intentional? Maybe we should return a reference or std::span?

Contributor Author

dkhaldi

Yes, this is intentional. The idea is that these post operations will be done using these slices in an owned fashion (independent set of registers).
Having said that, I see why it can be useful to return a reference or a view. I will keep your comment in the open questions list. We will discuss it.

dkhaldi added 2 commits

October 25, 2021 13:54


          [SYCL][Matrix] Incorporate Greg's and Alexey S.'s reviews, specifically,

206d2cf

- Greg: Add an additional column to the table to specify for each member, in which forms they are defined
- Greg: Clarify the definition of the default values form
- Alexey S.: add an open question about the  returned type of get_wi_slice: owned vs. view object


          [SYCL][Matrix] Add a todo item about Greg's request to add a more rea…

41133cc

…listic general query example

gmlueck reviewed

View reviewed changes

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Outdated Show resolved Hide resolved

sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc Show resolved< FDE3 /span> Hide resolved


          [SYCL][Matrix] Incorporate suggestions from Greg to modify some text

7dab76e

gmlueck approved these changes

View reviewed changes

Contributor

gmlueck left a comment

We agreed that the remaining issues with the query API can be addressed later since this is an experimental extension. Those issue are captured in the TODO section.

Contributor Author

dkhaldi commented

Ping, can we merge this PR?

dm-vodopyanov merged commit 6495575 into sycl

bader deleted the matrix-v2 branch

November 15, 2021 20:11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

AlexeySotkin AlexeySotkin left review comments

gmlueck gmlueck approved these changes

JackAKirk JackAKirk left review comments

yubingex007-a11y Awaiting requested review from yubingex007-a11y

LuoYuanke Awaiting requested review from LuoYuanke

Pennycook Awaiting requested review from Pennycook

Labels

None yet

0