8000 [SimpleFSDP] Add support for hsdp+tp by ruisizhang123 · Pull Request #1343 · pytorch/torchtitan · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[SimpleFSDP] Add support for hsdp+tp #1343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ruisizhang123
Copy link
Contributor
@ruisizhang123 ruisizhang123 commented Jun 26, 2025

As titled, this pr adds support for SimpleFSDP's HSDP + TP

The profile trace below shows the three streams for FSDP's All-gather/Reduce-scatter; DDP's All-reduce; and TP's communications.

Screenshot 2025-06-25 at 8 13 39 PM

The loss below shows SimpleFSDP & FSDP2's losses matches under HSDP + TP mode (seed=42).

Screenshot 2025-06-25 at 8 27 11 PM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 26, 2025
@ruisizhang123 ruisizhang123 force-pushed the ruisi/simplefsdp_hsdp_and_tp branch from dc82ddf to cbd4fd1 Compare June 26, 2025 04:25
@ruisizhang123 ruisizhang123 requested a review from tianyu-l June 26, 2025 04:26
target_spec=target_spec,
)

for placement in placements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this for loop over placements (which tries to be generic but actually hard to make every possible case correct), let's be more explicit about the 3 types DP / FSDP / HSDP, where there is at most 1 replicate dim, and at most 1 shard dim.

I think we can just have three if-else conditions (instead of DP / FSDP before), and one redistribute_local_tensor outside, just like before.

The FSDP2 code for dealing with HSDP is also here, for your reference
https://github.com/pytorch/pytorch/blob/dfef1e44085bb156abc4aff0f34a0b82a4a337b8/torch/distributed/fsdp/_fully_shard/_fsdp_param.py#L324

@ruisizhang123 ruisizhang123 force-pushed the ruisi/simplefsdp_hsdp_and_tp branch from cbd4fd1 to fe081b9 Compare June 29, 2025 00:03
@ruisizhang123 ruisizhang123 requested a review from tianyu-l June 29, 2025 03:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0