[float8 moe training] Add TP support #2425

danielvegamyhre · 2025-06-23T17:10:09Z

Note: this should be merged AFTER this bug fix: #2451 I will rebase and retest all of this once that's merged.

Summary

Add TP support for routed experts and shared expert.
- Make target dim of scale squeze() ops explicit to handle both 2D and 3D "A" tensors (routed experts case has 2D "A", shared expert has 3D "A").
- Make offs optional to handle shared_expert case where num_experts=1 (scaled grouped GEMM only processing 1 expert)
Add debug logging

Test plan

Added integration test using torchtitan llama4 TP implementation. Test cases for (1) routed experts, and (2) routed experts + shared expert.
Manual testing with torchtitan llama4 debug model with TP=2, targeting routed experts AND shared experts works (logs).
Manual testing with torch titan llama4 debug model with FSDP=2 + TP=2 confirms this 2D parallelism is working for routed experts (logs)

Limitations

2D parallel witih FDSP+TP for shared experts is not yet supported yet (see comment) below). Need to debug this, which I will do in a subsequent PR.

pytorch-bot · 2025-06-23T17:10:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2425

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

VolumeLimitExceeded Issue for linux.2xlarge and linux.4xlarge

❌ 2 New Failures

As of commit 29be4b2 with merge base 2898903 ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 533117a2d7bafca98cef6a8decab8f278453c27faba267bb77dd267aaa05b584 /exec failed with exit code 2
Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh)
Process completed with exit code 127.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-06-23T21:11:31Z

Error with FSDP=2, TP=2 targeting both routed experts AND shared expert. The issue is specific to the shared expert using 2D parallelism. Will debug and resolve in separate PR.

The logs are a bit confusing, I first see an error in meta registration that "B" tensor is fp32 instead of bf16. This is odd, since I'm not using torch.compile and I thought meta registrations were only used for compile.

File "/home/danvm/.conda/envs/torchtitan/lib/python3.13/site-packages/torch/_meta_registrations.py", line 7527, in _meta_grouped_mm_common
...
RuntimeError: Expected inputs of BF16 type but got mat_a.dtype=torch.bfloat16 and mat_b.dtype=torch.float32.

Then a few lines later, I see my log lines during the forward pass, just before the grouped mm, confirming the "B" tensor (W1) is bf16, not fp32:

[rank0]:X dtype: torch.bfloat16
[rank0]:W1 dtype: torch.bfloat16
[rank0]:W1 type: <class 'torch.distributed.tensor.DTensor'>
[rank0]:W1.to_local() type: <class 'torchao.prototype.moe_training.tensor.ScaledGroupedMMTensor'>
[rank0]:W1.to_local() dtype: torch.bfloat16

(as an aside, it's strange these log lines appear AFTER the error has already occured (?). I assume it must be due to how log writes are buffered)

Then at the end of the logs, I see a different error related to strides/sizes not matching a storage of size 0, but i'm guessing this a downstream affect of the first error:

RuntimeError: setStorage: sizes [1, 256, 256], strides [65536, 256, 1], storage offset 0, and itemsize 2 requiring a storage size of 131072 are out of bounds for storage of size 0

Full logs: https://www.internalfb.com/phabricator/paste/view/P1850071143

danielvegamyhre · 2025-06-24T17:36:47Z

cc @drisspg @vkuzo for review

fyi @tianyu-l @lessw2020 @ngimel for awareness as well

drisspg · 2025-06-26T23:41:28Z

test/prototype/moe_training/test_tp.py

+from torch.nn import functional as F
+
+# this feature requires CUDA and SM89+
+if not torch.cuda.is_available() or torch.cuda.get_device_capability() < (8, 9):


nit we have some helpers for this in ao/utils

drisspg · 2025-06-26T23:42:12Z

test/prototype/moe_training/test_tp.py

+
+# this test requires torchtitan
+try:
+    from torchtitan.experiments.llama4.infra.parallelize import apply_moe_tp


we should add this test to test_float8 ->

ao/.github/workflows/float8_test.yml

Line 46 in 994a4ba

script: |

drisspg · 2025-06-26T23:43:54Z

test/prototype/moe_training/test_tp.py

+    dist.destroy_process_group()
+
+
+def _validate_model_conversion(


did I review another PR that had teh same util? if so maybe put into torchao.testing so we can reuse

drisspg · 2025-06-26T23:45:09Z

test/prototype/moe_training/test_tp.py

+    return device_mesh
+
+
+def apply_moe_tp(


this is always specific to module structure e.g. the fqn's right?

drisspg · 2025-06-26T23:46:16Z

torchao/prototype/moe_training/conversion_utils.py

@@ -8,6 +14,8 @@
    register_quantize_module_handler,
 )

+logger: logging.Logger = logging.getLogger(__name__)


side note, we should setup better logging in torchao
alas: https://docs.python.org/3/howto/logging.html#configuring-logging-for-a-library

just getting the root logger going w/ null handler

make offs optional for scaled grouped mm

efd993f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2025

danielvegamyhre marked this pull request as draft June 23, 2025 17:10

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jun 23, 2025

danielvegamyhre changed the title ~~[float8 moe training] TP support for routed experts~~ [float8 moe training] Add TP support Jun 23, 2025

danielvegamyhre force-pushed the optional-offs branch 2 times, most recently from 80cf6d4 to 44778d0 Compare June 23, 2025 21:40

tp on routed experts working

accbb27

danielvegamyhre force-pushed the optional-offs branch from 44778d0 to accbb27 Compare June 23, 2025 21:49

danielvegamyhre changed the title ~~[float8 moe training] Add TP support~~ [float8 moe training] Add TP and FSDP+TP support Jun 24, 2025

danielvegamyhre force-pushed the optional-offs branch from c61add8 to 976bc10 Compare June 24, 2025 17:30

danielvegamyhre changed the title ~~[float8 moe training] Add TP and FSDP+TP support~~ [float8 moe training] Add TP support Jun 24, 2025

danielvegamyhre marked this pull request as ready for review June 24, 2025 17:32

add tp integration test

a80b9a0

danielvegamyhre force-pushed the optional-offs branch 2 times, most recently from e4ff51d to 074b423 Compare June 24, 2025 17:36

danielvegamyhre requested review from vkuzo and drisspg June 24, 2025 17:36

remove excessive logging

fb0122e

danielvegamyhre force-pushed the optional-offs branch from 074b423 to fb0122e Compare June 24, 2025 17:38

danielvegamyhre mentioned this pull request Jun 24, 2025

[roadmap/tracker] Low precision MoE training #2147

Open

36 tasks

fix dtype bug

29be4b2

danielvegamyhre force-pushed the optional-offs branch from 97e55e8 to 29be4b2 Compare June 26, 2025 23:14

drisspg reviewed Jun 26, 2025

View reviewed changes

drisspg 10000 reviewed Jun 26, 2025

View reviewed changes

drisspg reviewed Jun 26, 2025

View reviewed changes

drisspg approved these changes Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[float8 moe training] Add TP support #2425

[float8 moe training] Add TP support #2425

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[float8 moe training] Add TP support #2425

Are you sure you want to change the base?

[float8 moe training] Add TP support #2425

Conversation

Uh oh!

Summary

Test plan

Limitations

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2425

❗ 1 Active SEVs

❌ 2 New Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!