8000 Question about migrating CUDA bmma_sync · Issue #12325 · intel/llvm · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Question about migrating CUDA bmma_sync #12325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

< 8000 div class="d-flex flex-items-center"> Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jinz2014 opened this issue Jan 8, 2024 · 1 comment
Closed

Question about migrating CUDA bmma_sync #12325

jinz2014 opened this issue Jan 8, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@jinz2014
Copy link
Contributor
jinz2014 commented Jan 8, 2024

Does the joint matrix support the similar operation ?

bmma_sync
Waits until all warp lanes have executed bmma_sync, and then performs the warp-synchronous bit matrix multiply-accumulate operation D = (A op B) + C, where op consists of a logical operation bmmaBitOp followed by the accumulation defined by bmmaAccumulateOp. The available operations are:

bmmaBitOpXOR, a 128-bit XOR of a row in matrix_a with the 128-bit column of matrix_b

bmmaBitOpAND, a 128-bit AND of a row in matrix_a with the 128-bit column of matrix_b, available on devices with compute capability 8.0 and higher.

The accumulate op is always bmmaAccumulateOpPOPC which counts the number of set bits.

@jinz2014 jinz2014 added the enhancement New feature or request label Jan 8, 2024
@JackAKirk
Copy link
Contributor
JackAKirk commented Jan 9, 2024

Does the joint matrix support the similar operation ?

bmma_sync Waits until all warp lanes have executed bmma_sync, and then performs the warp-synchronous bit matrix multiply-accumulate operation D = (A op B) + C, where op consists of a logical operation bmmaBitOp followed by the accumulation defined by bmmaAccumulateOp. The available operations are:

bmmaBitOpXOR, a 128-bit XOR of a row in matrix_a with the 128-bit column of matrix_b

bmmaBitOpAND, a 128-bit AND of a row in matrix_a with the 128-bit column of matrix_b, available on devices with compute capability 8.0 and higher.

The accumulate op is always bmmaAccumulateOpPOPC which counts the number of set bits.

There is a draft impl for bmma here that is fully working within its branch:
#5363

The point however is that

  • the only notable hardware that so far supports bitwise mma is nvidia, and it is still experimental. Therefore a oneapi extension that must support multiple hardware is currently not possible. Note also that the XOR operator you mentioned is deprecated and unsupported on the latest Nvidia hardware.
  • I'm not aware that any notable libraries currently support bmma (Although i suspect there may be since the last time I investigated was a long time ago now). It has some usage but it doesn't seem to be widely adopted as of this moment.
  • More generally I get the impression that although bmma has been shown to work for certain practical applications, generally the question of preferred quantized data types for inference (or backprop) has not been settled.

Hence as far as it concerns us it is pretty low priority. If users wish to play with it then they can do so via the above mentioned branch, or with latest hardware via inline ptx as advised by Nvidia docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants
0