8000 preserve collectives by theodorbadea · Pull Request #190 · mlcommons/chakra · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

preserve collectives #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

theodorbadea
Copy link
Contributor
@theodorbadea theodorbadea commented Mar 27, 2025

Mostly addressing #161
Issue: some collectives are missing from the .et file, but are present in the kineto file.

It all starts from the host trace, these missing collectives have host operations with missing parents. To be more specific, the host ops corresponding to these collectives have a parent which is not part of the p_trace file.
Specific example:
host_op1: { id: 439, name: "record_param_comms", ctrl_deps: 438, rf_id: 115 } -> this corresponds to a ncclDevKernel_Broadcast
host_op2: { id: 438, name: "c10d::broadcast_", ctrl_deps: 9 }
host_op3: { id: 9, name: "DistributedDataParallel.forward", ctrl_deps: 4 }

So, host_op1 has host_op2 as parent, host_op2 has host_op3 as parent, and host_op3 has a node with id 4 as parent.
However, host op with id: 4 is not present in the p_trace_file and so facebookresearch/param/et_replay will not properly build the children/parent links (as seen in execution_trace.py):
if n.parent_id in self.nodes: # here there is no node with id equal to parent_id in the list of nodes
self.nodes[n.parent_id].add_child(n)
n.set_parent(self.nodes[n.parent_id])
When chakra loads the host trace, it will do a traversal starting from the root_node, but since there are some nodes which have parents not present in the file, this trace will be a graph with multiple components => chakra will load the "main" connected component, the one starting from the root node. For example, for a ResNet50 on 4 ranks that I've used to debug, it will load 2410/4484 host op nodes.

Attaching several relevant files generated with and without the changes from this branch.
p_trace_rank0.json
kineto_trace_rank0.json
new_jsonized_chakra_trace.json
new_merged.json
old_jsonized_chakra_trace.json
old_merged.json

cc: @winstonliu1111 @danmih-ixia @TaekyungHeo @tushar-krishna

@theodorbadea theodorbadea requested a review from a team as a code owner March 27, 2025 18:49
Copy link
github-actions bot commented Mar 27, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@spandoescode
Copy link

Hey @theodorbadea,

Just wanted to confirm whether this PR is done or not, since it says "Draft PR" in the first line. If this is still a draft, could you save this as a draft instead?

@theodorbadea
Copy link
Contributor Author
theodorbadea commented Apr 14, 2025

Hey @spandoescode , I put “draft” in the first comment because this PR contains just some commented out code and I would like to use it just to support my proposal and get feedback from the community for now. If we decide that this change is correct and should get integrated, I will do the proper code changes (a more elegant way of loading the host nodes and entirely remove the method “remove_dangling_nodes”).
However, I would like to avoid marking as Draft because I am not actually working on it and I want to draw some attention (I feel that marking this as Draft will perhaps make the community ignore it as people might believe it’s still work in progress).

Looking forward to receiving your feedback on this proposal to load all the host nodes and not just the ones from the main connected component.

@danmih-ixia
Copy link

@tushar-krishna, @srinivas212 can you please advise who can review this PR?

@jinsun-yoo
Copy link
Contributor

Thank you for the suggestion!
I think this fix will properly include the broadcast in the Chakra trace. However, I'm not sure if this correctly represents the dependency behavior.
For example, in new_jsonized_chakra_trace.json, the broadcasts with id 116/117, 175/176, is dependent on id 5, but id 5 is not dependent on anything. I would assume in reality each of the broadcast would be dependent on some compute.

What do you think?

@theodorbadea
Copy link
Contributor Author

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

@JoongunPark
Copy link
Contributor
JoongunPark commented May 15, 2025

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

Hi @theodorbadea , thank you for the PR and sorry for joining this discussion lately.
Originally, when Chakra loads host ops, it traverses the operations. Like @theodorbadea explained, this eliminates lots of host ops not connected each other. I believe this is important problem to solve.
This PR basically preserves all the CPU ops in host traces and dangling chakra nodes in the Chakra trace.

I have few suggestions.
(1) We need extra logic that searches dependency for broadcast and connect compute->broadcast (in this case) in better way.
(2) We might still need remove_dangling_nodes rather than skipping it. Otherwise, all the dangling nodes will be scheduled in the beginning of simulation/replay.
(3) We need to understand why some host nodes does not have dependency to others. That would help us to solve this problem.

@JoongunPark
Copy link
Contributor
JoongunPark commented May 19, 2025

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

Hi @theodorbadea , thank you for the PR and sorry for joining this discussion lately. Originally, when Chakra loads host ops, it traverses the operations. Like @theodorbadea explained, this eliminates lots of host ops not connected each other. I believe this is important problem to solve. This PR basically preserves all the CPU ops in host traces and dangling chakra nodes in the Chakra trace.

I have few suggestions. (1) We need extra logic that searches dependency for broadcast and connect compute->broadcast (in this case) in bett 8000 er way. (2) We might still need remove_dangling_nodes rather than skipping it. Otherwise, all the dangling nodes will be scheduled in the beginning of simulation/replay. (3) We need to understand why some host nodes does not have dependency to others. That would help us to solve this problem.

Hi @theodorbadea, thank you for the talk today.
We weren't able to discuss a lot for this PR, but I believe this PR is very important and must be fixed.
Hope we can discuss this in detail in our next meeting. Thank you!

@theodorbadea
Copy link
Contributor Author
theodorbadea commented May 21, 2025

@jinsun-yoo I managed to reproduce the situation you were targeting with your comment when training AlexNet with FSDP.
This is what the original .et looks like:
image

This is what the .et containing the orphan collectives looks like:
image

Considering the two illustrations and analyzing the AlexNet in FSDP with size_based_auto_wrap_policy and 1.500.000 minimum number of parameters for wrapping, it seems that the entire forward pass part and the final AllReduce (produced by explicitly calling all_reduce(loss, op=dist.ReduceOp.SUM)) are missing.
I believe that 3 modules of the AlexNet (the 3 layers from the classifier) and perhaps their nn.Sequential container get wrapped, so the 4 AllGather and 4 ReduceScatter visible in the original .et are explained.
However, the AllGathers corresponding to the forward pass are entirely missing.
Additionaly, your remark was correct: forcing Chakra to load these orphan nodes is not enough because dependencies are also required. In the second illustration, AllReduce has no dependency and so it may get executed first, when in fact it should get executed last.

cc: @JoongunPark
k_trace_rank=0_.json
p_trace_rank=0_.json

Later Edit: @TaekyungHeo (I think you were interested in seeing an example), please find attached these AlexNet traces. In the p_trace file, node with "id": 4 is missing, however 4 appears as parent of several ops, consequently these ops will be dropped.

@JoongunPark
Copy link
Contributor
JoongunPark commented May 23, 2025

@jinsun-yoo I managed to reproduce the situation you were targeting with your comment when training AlexNet with FSDP. This is what the original .et looks like: image

This is what the .et containing the orphan collectives looks like: image

Considering the two illustrations and analyzing the AlexNet in FSDP with size_based_auto_wrap_policy and 1.500.000 minimum number of parameters for wrapping, it seems that the entire forward pass part and the final AllReduce (produced by explicitly calling all_reduce(loss, op=dist.ReduceOp.SUM)) are missing. I believe that 3 modules of the AlexNet (the 3 layers from the classifier) and perhaps their nn.Sequential container get wrapped, so the 4 AllGather and 4 ReduceScatter visible in the original .et are explained. However, the AllGathers corresponding to the forward pass are entirely missing. Additionaly, your remark was correct: forcing Chakra to load these orphan nodes is not enough because dependencies are also required. In the second illustration, AllReduce has no dependency and so it may get executed first, when in fact it should get executed last.

cc: @JoongunPark k_trace_rank=0_.json p_trace_rank=0_.json

Hi @theodorbadea. I realized that this PR might work better than current version along with your next PR #191 .
As we weren't able to finalize our discussion in our previous meeting, it would be great to move forward after our next Chakra meeting.

Also, I have a quick question.
Were you able to observe missing operations other than collectives? If yes, do you have any idea to enforce dependencies for those ops?
Thank you!

@theodorbadea
Copy link
Contributor Author
theodorbadea commented May 27, 2025

Hey, @JoongunPark . After some more digging, I found out that what's captured in the resulting chakra.et corresponds to thread tid=2 and what's missing to thread tid=1. About your question, yes, there are also missing compute nodes.
So, baiscally, in the attached example, what we can see in the .et right now is corresponding to:
{
"id": 345, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,
...
"tid": 2
}
and what is missing corresponds to:
{
"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,
...
"tid": 1
}
I noticed that node with "id": 4 is missing from the p_trace, therefore all the nodes which have "ctrl_deps": 4 will be dropped.
Then, I did a little experiment and manually replaced "ctrl_deps": 4 with "ctrl_deps": 2, this way basically "forcing" the orphan graph components to be part of the graph by setting {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]"} as their parent (which I suspect it truly is, but there are some missing nodes in-between).

Now I intend to check the PyTorch profiler and see if there is something missing there, perhaps some setting, which might justify the missing host operations. I find it very strange that it is able to record {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,} and some other distant children of this node, but fail to record what's in-between.
Attaching here a new mermaid chart after hardcoding the proper parent:
image

@JoongunPark
Copy link
Contributor

Hey, @JoongunPark . After some more digging, I found out that what's captured in the resulting chakra.et corresponds to thread tid=2 and what's missing to thread tid=1. About your question, yes, there are also missing compute nodes. So, baiscally, in the attached example, what we can see in the .et right now is corresponding to: { "id": 345, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 2 } and what is missing corresponds to: { "id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 1 } I noticed that node with "id": 4 is missing from the p_trace, therefore all the nodes which have "ctrl_deps": 4 will be dropped. Then, I did a little experiment and manually replaced "ctrl_deps": 4 with "ctrl_deps": 2, this way basically "forcing" the orphan graph components to be part of the graph by setting {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]"} as their parent (which I suspect it truly is, but there are some missing nodes in-between).

Now I intend to check the PyTorch profiler and see if there is something missing there, perhaps some setting, which might justify the missing host operations. I find it very strange that it is able to record {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,} and some other distant children of this node, but fail to record what's in-between. Attaching here a new mermaid chart after hardcoding the proper parent: image

Thank you for confirmation ! @theodorbadea .
I think your approach makes sense.
We may need to choose either (1) Find hidden dependences we might miss or (2) Forcing ops to have dependencies like you did. Let's discuss about this during the meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0