8000 [Mosaic GPU] Check in WIP grouped GEMM by andportnoy · Pull Request #26997 · jax-ml/jax · GitHub

More Web Proxy on the site http://driver.im/

[Mosaic GPU] Check in WIP grouped GEMM #26997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

andportnoy wants to merge 29 commits into jax-ml:main from andportnoy:fdgg

+508 −2

Contributor

andportnoy commented

•

No description provided.

andportnoy added 2 commits

March 7, 2025 19:35


          [Mosaic GPU] Check in WIP persistent matmul kernel

a16ab07

Currently the kernel hangs in the second iteration because the
mbarrier state is not carried over correctly. In the last iteration of
the k-loop we arrive on mma_done_barrier instead of ab_empty_barrier,
so the phase of ab_empty_barrier is not flipped.


          Remove cluster-related logic

5243dad

We are not using clusters, the extra code was only hurting readabilitiy.

andportnoy force-pushed the fdgg branch from 8ef134d to 5243dad Compare

March 11, 2025 15:00

andportnoy added 4 commits

March 11, 2025 18:31


          Remove unused is_block_0

c08d8f8


          Fix persistent iteration

86b2636

We accomplish this by:

1. Round-robin iterating through input buffer slots based on
`persistent_ki`, which tracks k-loop iterations over the entire
lifetime of a CTA worker, and not just a single work item.

2. Arriving on an ab_empty_barrier even in the last iteration of a
k-loop, so that barrier phase is flipped correctly across work items.

3. Waiting on an ab_empty_barrier in all iterations of all work items'
k-loops, except for the first `max_concurrent_steps` iterations of the
very first work item's k-loop.


          Generate ~normally-distributed group sizes

561f5a8


          Generate a grouped GEMM schedule on host

f873ed4

andportnoy changed the title ~~[Mosaic GPU] Check in WIP persistent matmul kernel~~ [Mosaic GPU] Check in WIP grouped GEMM

andportnoy and others added 15 commits

March 20, 2025 19:30


          Add reference, reorient signature towards grouped GEMM

26a8136


          Check in gory debug code, grouped GEMM is coming together

1be7b71


          Flatten a_smem reference to 2D for TMA load purposes

b4657eb

Otherwise when TMA box coordinates are not aligned with a subtile, the
4D TMA box goes out of bounds of the 4D TMA tensor in a non-contiguous
fashion.


          Compute n_start separately for weights and output

12bd1f9


          Make grouped GEMM work with multiple experts, CTAs, random offsets

6d8439d

Committing with all the messy debug code so I can come back to it
later if needed.

Caveats/hacks:

- We store group chunks at inexact offsets (aligned to tile_m size),
  so the computation is not fully correct/compliant. We should modify
  the kernel to store at exact offsets. We could do this by storing
  directly from registers, or by using tensormap.replace before each
  TMA store. Storing from registers feels simpler.

- We compute the grouped GEMM schedule separately on host. We should
  either do this in a single optimized kernel or fuse this into the
  grouped GEMM, or compute on the fly somehow.


          Remove debug code

52a72f5


          Use multiple TMA descriptors of log2 sizes to "paint" the output

aabb711

Co-authored-by: Adam Paszke <apaszke@google.com>


          Remove dead code

cd731f7


          Check in attempt at computing the schedule on the fly

2aba6a2

It's broken, fails with an illegal instruction. Something is off in
the subgroup_for loop, work_id is not carried correctly. I probably
need a while loop instead.


          Fix on the fly schedule computation

ca5a33a


          Merge branch 'main' into fdgg

4ffc56e


          Deallocate tensor memory

fef133a

This needs to be moved to Mosaic GPU utilities.


          Merge branch 'main' into fdgg

7248f49


          Merge branch 'main' into fdgg

65a144f


          Profile the kernel using the on-device profiler

c78be3a

andportnoy force-pushed the fdgg branch from 6871344 to c78be3a Compare

April 22, 2025 18:36

andportnoy added 4 commits

April 22, 2025 20:47


          Double buffer the accumulator

2ec1d9d


          Add a WarpAllocator to manage allocation for warp specialization

9a2b936


          Merge branch 'main' into fdgg

3c2b0a5


          Merge branch 'main' into fdgg

andportnoy force-pushed the fdgg branch from e44fc6e to 4853515 Compare

May 2, 2025 22:20


          Deallocate TMEM using a single warp

4c38435

andportnoy added 3 commits

May 2, 2025 22:22


          Do not deallocate TMEM explicitly

57f0e82


          Use CUPTI mode of profiler.measure()

45283d1


          Use a larger tile size for better perf

114f647

apaszke reviewed

View reviewed changes

jax/experimental/mosaic/gpu/examples/fdgg.py



		# TODO(andportnoy) move into tcgen05.py
		def tmem_dealloc(tmem: ir.Value, ncols: int, collective: bool = False):

Member

apaszke

I think this can be deleted now, as it's done automatically by MGPU

jax/experimental/mosaic/gpu/examples/fdgg.py

+              import jax.numpy as jnp
+              import jax.random as jr
+              import numpy as np
+              from cuda.bindings.runtime import cudaDeviceGetAttribute, cudaDeviceAttr, cudaGetDeviceCount, cudaError_t

Member

apaszke

I don't think we can depend on the CUDA Python APIs in those files. Is there some way to do what you want using the JAX device APIs? If not, we should consider exposing it

jax/experimental/mosaic/gpu/examples/fdgg.py



		def bytecount(shape, dtype):
		return int(np.prod(shape) * dtype.dtype.itemsize)

Member

apaszke

nit: whitespace between top-level declarations

jax/experimental/mosaic/gpu/examples/fdgg.py



		@contextlib.contextmanager
		def single_warp_thread():

Member

apaszke

You already have mgpu.single_thread(scope=mgpu.ThreadSubset.WARP)

jax/experimental/mosaic/gpu/examples/fdgg.py

+                    jnp.float16)
+                smem_buffers = mgpu.Union([compute_buffers, epilogue_buffer])
+                tmem_cols = tmem_slot_count * tile_n
+                assert tmem_cols.bit_count() == 1

Member

apaszke

That should not be necessary

jax/experimental/mosaic/gpu/examples/fdgg.py



		TMA_WARP = 1
		MMA_WARP = 0

Member

apaszke

Those seem unused

jax/experimental/mosaic/gpu/examples/fdgg.py

+                warpgroup = allocator.alloc_warpgroup
+                tma_warp = warp()
+                mma_warp = warp()

Member

apaszke

What's the benefit of the whole WarpAllocator indirection? They're still allocated statically, so you can just hardcode them to TMA_WARP = 0 and MMA_WARP = 1 as you did before, right?

jax/experimental/mosaic/gpu/examples/fdgg.py

+                          shape=(tile_m, tile_n),
+                          layout=acc.layout,
+                          dtype=acc.dtype,
+                      )

Member

apaszke

This is just slicing the TMEM ref, right? Perhaps extend TMEMRef.slice to support this, instead?

jax/experimental/mosaic/gpu/examples/fdgg.py

		return value


		def subgroup_for(work_id, work_id_group_end, worker_count):

Member

apaszke

What does this do? Docs would be helpful

jax/experimental/mosaic/gpu/examples/fdgg.py

+                          )
+                          a_smem_slot = mgpu.memref_slice(a_smem, slot)
+                          a_smem_slot_2d_shape, _ = mma_utils.tiled_memref_shape(a_smem_slot)

Member

apaszke

Yeah you should not use mma_utils in this kernel. It's a private file. What's happening here?

apaszke reviewed

View reviewed changes

jax/experimental/mosaic/gpu/examples/fdgg.py

+                  work_id_group_start = cx(0)
+                  initial_work_id = worker_id
+                  group_offset = cx(0)
+                  @mgpu.fori(cx(expert_count), [initial_work_id, work_id_group_start, group_offset])

Member

apaszke

I'm not sure if I understand the iteration scheme of this kernel. Could you please describe it a bit better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

0