Question about migrating CUDA bmma_sync #12325

jinz2014 · 2024-01-08T21:53:12Z

Does the joint matrix support the similar operation ?

bmma_sync
Waits until all warp lanes have executed bmma_sync, and then performs the warp-synchronous bit matrix multiply-accumulate operation D = (A op B) + C, where op consists of a logical operation bmmaBitOp followed by the accumulation defined by bmmaAccumulateOp. The available operations are:

bmmaBitOpXOR, a 128-bit XOR of a row in matrix_a with the 128-bit column of matrix_b

bmmaBitOpAND, a 128-bit AND of a row in matrix_a with the 128-bit column of matrix_b, available on devices with compute capability 8.0 and higher.

The accumulate op is always bmmaAccumulateOpPOPC which counts the number of set bits.

JackAKirk · 2024-01-09T11:22:58Z

Does the joint matrix support the similar operation ?

bmma_sync Waits until all warp lanes have executed bmma_sync, and then performs the warp-synchronous bit matrix multiply-accumulate operation D = (A op B) + C, where op consists of a logical operation bmmaBitOp followed by the accumulation defined by bmmaAccumulateOp. The available operations are:

bmmaBitOpXOR, a 128-bit XOR of a row in matrix_a with the 128-bit column of matrix_b

bmmaBitOpAND, a 128-bit AND of a row in matrix_a with the 128-bit column of matrix_b, available on devices with compute capability 8.0 and higher.

The accumulate op is always bmmaAccumulateOpPOPC which counts the number of set bits.

There is a draft impl for bmma here that is fully working within its branch:
#5363

The point however is that

the only notable hardware that so far supports bitwise mma is nvidia, and it is still experimental. Therefore a oneapi extension that must support multiple hardware is currently not possible. Note also that the XOR operator you mentioned is deprecated and unsupported on the latest Nvidia hardware.
I'm not aware that any notable libraries currently support bmma (Although i suspect there may be since the last time I investigated was a long time ago now). It has some usage but it doesn't seem to be widely adopted as of this moment.
More generally I get the impression that although bmma has been shown to work for certain practical applications, generally the question of preferred quantized data types for inference (or backprop) has not been settled.

Hence as far as it concerns us it is pretty low priority. If users wish to play with it then they can do so via the above mentioned branch, or with latest hardware via inline ptx as advised by Nvidia docs.

jinz2014 added the enhancement New feature or request label Jan 8, 2024

jinz2014 closed this as completed Jan 9, 2024

jinz2014 mentioned this issue Jan 9, 2024

Migration of nvcuda::wmma::bmma_sync is not supported oneapi-src/SYCLomatic#1599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about migrating CUDA bmma_sync #12325

Question about migrating CUDA bmma_sync #12325

Question about migrating CUDA bmma_sync #12325

Question about migrating CUDA bmma_sync #12325

Comments