[SYCL][CUDA][MATRIX] joint_matrix_bmad implementation #5363

JackAKirk · 2022-01-21T17:34:09Z

cc @dkhaldi

Implementation corresponding to the matrix extension proposal section "Bitwise Multiply and Add" in #4695

Integration tests here: intel/llvm-test-suite#760

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

dkhaldi · 2022-02-12T02:41:37Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

@@ -495,14 +562,59 @@ struct joint_matrix_mad_impl<
                                     get_layout_pair_id<LayoutA, LayoutB>(), 0);
        }
      }
-    } else if constexpr (std::is_same<T1, double>::value) {
+    } else if constexpr (M == 8 && N == 8 && K == 4) {


is this change related to bmad addition?

No this is a superficial/non-important change that I made just for better consistency of the if constexpr statements in this function.

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

dkhaldi · 2022-02-12T02:49:31Z

sycl/include/sycl/ext/oneapi/matrix/matrix-tensorcore.hpp

+                              get_layout_id<Layout>());
+    } else if constexpr (NumRows == 128 && NumCols == 8) {
+      int32_t *tileptr = reinterpret_cast<int32_t *>(src.get());
+      __bmma_m8n8k128_ld_b_b1(res.data, tileptr, stride,


what types are supported in bmad? only double and i32?

NVPTX bmad requires that matrix elements are stored in 32 bit untyped registers, int32_t is used here because when the NVPTX builtins for these functions were created int32_t register arguments were defined (uint32_t can also be used but there is no difference as far as NVPTX backend is concerned.). As far as the user is concerned, I think that they should work with uint32_t for bmad cases as in intel/llvm-test-suite#760.

double is not supported as a register storage type for bmad in NVPTX and I did not create a case for the user to use double with bmad.

JackAKirk · 2022-02-15T17:42:31Z

Hi @dkhaldi

If it is preferred for reviewing purposes I could add the temporary/initial fp19 implementation that uses uint32_t directly to this PR? Hopefully the uint32_t fp19 should be a bit more straightforward to review compared to the bmad cases, since in the end we realized we can implement the fp19 cases in a way which is completely compliant with the existing matrix extension, whereas the bmad cases require a different interface.

Otherwise it is fine to put them up one at a time, I just thought it might make it easier to review them at once.

Thanks

dkhaldi · 2022-02-15T18:51:22Z

If it is preferred for reviewing purposes I could add the temporary/initial fp19 implementation that uses uint32_t directly to this PR? Hopefully the uint32_t fp19 should be a bit more straightforward to review compared to the bmad cases, since in the end we realized we can implement the fp19 cases in a way which is completely compliant with the existing matrix extension, whereas the bmad cases require a different interface.

I think separate PRs is better.

JackAKirk · 2022-02-16T08:40:50Z

I think separate PRs is better.

OK

… dimension matrix elements divided by 32. The stride argument in the joint_matrix_load function now refers to the number of registers to stride rather than the number of matrix elements. This leads to a cleaner example because all factors or 32 can be removed.

dkhaldi · 2022-02-17T16:51:43Z

sycl/test/check_device_code/matrix/matrix-nvptx-single-bit-test.cpp

+                       // number of cols of b.
+constexpr int N = 8;   // number of cols of accumulator,
+                       // number of rows of a.
+constexpr int K = 128; // number of cols of a/number of rows of b.


you missed to make the change here.
K should be 4 here.
Can you please add a comment where you are making these changes?
Basically, saying that the underlying intrinsics are expecting a shape of K equals to number of total bits, not number of elements.

Thanks I forgot that. Fixed now, and I also updated the test so it will work with the legacy pass-manager.

I've added a more detailed comment describing Bitwise Dot Product and how this dictates the relation between the number of Array elements used for A/B arrays and the number of single-bit matrix elements that the A/B arrays represent. I've also correspondingly updated the test in intel/llvm-test-suite and the tensor cores matrix extension PR #4695.

dkhaldi · 2022-02-22T20:41:10Z

sycl/test/check_device_code/matrix/matrix-nvptx-single-bit-test.cpp

+                     // number of cols of b.
+constexpr int N = 8; // number of cols of accumulator,
+                     // number of rows of a.
+constexpr int K = 4; // number of cols of a/number of rows of b divided by 32


number of cols of a/number of rows of b divided by 32

should be:
number of bits in cols of a/number of bits in rows of B divided by 32.
If this is true, do we need to add the "divided by 32" in the code example.
I meant before to add the "multiplies by 32" in the implementation code to explain that this is how we get number of bits that exist in the intrinsics. But at the user level code, is this needed?

Here K=4 is not the number of cols in a subgroup matrix: you have to multiply by 32, since K gives a dimension of the arrays A/B which hold the single-bit matrix elements in uint32_t storage type. There are 32 single-bit matrix elements per uint32_t storage type.

I tried to describe the purpose of these bitwise matrix multiplications here without going into too much detail. I added references for full details on the origins of the single-bit models and how they use the bitwise matrix multiplications within them. I do not find references to the usage of such "bitwise matrix multiplications" outside of such models (although of course this does not mean they don't exist/will exist in the future), but I think that
this functionality was introduced specifically with such use cases in mind.

It is important for such users to understand that each bit is considered an element of the matrix by joint_matrix_bmad (the matrix element is "quantized" to a single bit), which is why in the original implementation I set K = 128. However as you pointed out this leads to lots of factors of 32 because we have to divide by 32 to get the number of uint32_t array elements that are used to store the matrix.
In the current implementation it is nice that these factors are gone, but there should still be proper documentation (see here) describing the relationship between "K" and the actual number of (single-bit) matrix elements. Since this is experimental I think it is normal to expect that once people start using this there could be feedback suggesting small changes to the interface: I'm not sure whether the interface I originally set up that led to the factors of 32 or the one you suggested is preferable for the users, but I imagined that at this experimental stage it can (and I imagine most likely will!) be changed in some way in the future anyway.

I could add back the naming scheme A -> A_Packed, B-> B_Packed that I originally used when I switched to K=128 -> K=4 to make is clearer that I am calling "a" the matrix and "A_Packed" a packed array representation of the matrix?
Then I could also add some more detailed description in both the tests and the implementation? I did not want to go into too much detail in tests/implementation because I thought that the proper place for such descriptions would be in the documentation of the extension? This is why I kept things concise here and did not mention details in the implementation.

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk · 2022-08-10T16:18:37Z

/verify with intel/llvm-test-suite#760

JackAKirk added 3 commits January 21, 2022 15:46

Added Bitwise MAD impl for Nvidia backend

d34df92

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Simplified joint_matrix_bmad, added static_assert in joint_matrix.

04a4a34

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Removed unnecessary line break.

48ab1f3

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

JackAKirk requested review from a team as code owners January 21, 2022 17:34

JackAKirk requested a review from smaslov-intel January 21, 2022 17:34

JackAKirk mentioned this pull request Jan 21, 2022

[SYCL][CUDA][MATRIX] Added integration test for single bit BMAD ops. intel/llvm-test-suite#760

Draft

JackAKirk added 3 commits January 21, 2022 17:47

Format.

fc7ebbd

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Format.

af13da9

Format.

0f423c6

JackAKirk requested a review from dkhaldi January 21, 2022 18:26

JackAKirk added 3 commits January 24, 2022 09:50

Switched back to original builtin arg types.

f35956c

Signed-off-by: jack.kirk <jack.kirk@codeplay.com>

Reverted to original tf32 builtin arg types.

6c11a2c

reverted tf32 builtin arg types.

95df95d

JackAKirk changed the title ~~[CUDA][MATRIX] joint_matrix_bmad implementation~~ [SYCL][CUDA][MATRIX] joint_matrix_bmad implementation Jan 24, 2022

Fznamznon removed the request for review from a team January 24, 2022 14:20

dkhaldi reviewed Feb 12, 2022

View reviewed changes

JackAKirk added 2 commits February 16, 2022 09:30

Merge branch 'sycl' into tensorcore-bitwise

82f1996

dkhaldi reviewed Feb 17, 2022

View reviewed changes

JackAKirk added 2 commits February 18, 2022 11:05

Updated test and added comment describing Bitwise Dot Product.

a693cd0

format

cabcee6

JackAKirk requested a review from dkhaldi February 22, 2022 16:17

dkhaldi reviewed Feb 22, 2022

View reviewed changes

JackAKirk requested a review from dkhaldi February 28, 2022 10:02

JackAKirk marked this pull request as draft March 2, 2022 09:29

Merge branch 'sycl' into tensorcore-bitwise

c05d2a1

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk added 5 commits August 3, 2022 17:45

update test

c36bea2

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

fix test

48978cc

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

format

445a41f

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

format

7457400

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

Removed unnecessary hashes that are causing failures.

0283942

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk mentioned this pull request Aug 4, 2022

Add bmad details and nvptx compilation instrs. #6525

Closed

bmad doesn't use marray.

42e2b17

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>

JackAKirk marked this pull request as ready for review August 10, 2022 16:17

JackAKirk mentioned this pull request Oct 6, 2022

[CUDA][Matrix][Doc] Introduced sycl_ext_oneapi_matrix_cuda extension. #6968

Closed

JackAKirk marked this pull request as draft November 28, 2022 11:45

smaslov-intel removed their request for review December 16, 2022 21:46

JackAKirk mentioned this pull request Jan 9, 2024

Question about migrating CUDA bmma_sync #12325

Closed

JackAKirk closed this Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL][CUDA][MATRIX] joint_matrix_bmad implementation #5363

[SYCL][CUDA][MATRIX] joint_matrix_bmad implementation #5363

[SYCL][CUDA][MATRIX] joint_matrix_bmad implementation #5363

[SYCL][CUDA][MATRIX] joint_matrix_bmad implementation #5363

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment