preserve collectives #190

theodorbadea · 2025-03-27T18:49:01Z

Mostly addressing #161
Issue: some collectives are missing from the .et file, but are present in the kineto file.

It all starts from the host trace, these missing collectives have host operations with missing parents. To be more specific, the host ops corresponding to these collectives have a parent which is not part of the p_trace file.
Specific example:
host_op1: { id: 439, name: "record_param_comms", ctrl_deps: 438, rf_id: 115 } -> this corresponds to a ncclDevKernel_Broadcast
host_op2: { id: 438, name: "c10d::broadcast_", ctrl_deps: 9 }
host_op3: { id: 9, name: "DistributedDataParallel.forward", ctrl_deps: 4 }

So, host_op1 has host_op2 as parent, host_op2 has host_op3 as parent, and host_op3 has a node with id 4 as parent.
However, host op with id: 4 is not present in the p_trace_file and so facebookresearch/param/et_replay will not properly build the children/parent links (as seen in execution_trace.py):
if n.parent_id in self.nodes: # here there is no node with id equal to parent_id in the list of nodes
self.nodes[n.parent_id].add_child(n)
n.set_parent(self.nodes[n.parent_id])
When chakra loads the host trace, it will do a traversal starting from the root_node, but since there are some nodes which have parents not present in the file, this trace will be a graph with multiple components => chakra will load the "main" connected component, the one starting from the root node. For example, for a ResNet50 on 4 ranks that I've used to debug, it will load 2410/4484 host op nodes.

Attaching several relevant files generated with and without the changes from this branch.
p_trace_rank0.json
kineto_trace_rank0.json
new_jsonized_chakra_trace.json
new_merged.json
old_jsonized_chakra_trace.json
old_merged.json

cc: @winstonliu1111 @danmih-ixia @TaekyungHeo @tushar-krishna

github-actions · 2025-03-27T18:49:14Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

spandoescode · 2025-04-14T21:58:44Z

Hey @theodorbadea,

Just wanted to confirm whether this PR is done or not, since it says "Draft PR" in the first line. If this is still a draft, could you save this as a draft instead?

theodorbadea · 2025-04-14T22:10:57Z

Hey @spandoescode , I put “draft” in the first comment because this PR contains just some commented out code and I would like to use it just to support my proposal and get feedback from the community for now. If we decide that this change is correct and should get integrated, I will do the proper code changes (a more elegant way of loading the host nodes and entirely remove the method “remove_dangling_nodes”).
However, I would like to avoid marking as Draft because I am not actually working on it and I want to draw some attention (I feel that marking this as Draft will perhaps make the community ignore it as people might believe it’s still work in progress).

Looking forward to receiving your feedback on this proposal to load all the host nodes and not just the ones from the main connected component.

danmih-ixia · 2025-04-24T13:58:12Z

@tushar-krishna, @srinivas212 can you please advise who can review this PR?

jinsun-yoo · 2025-04-24T14:46:54Z

Thank you for the suggestion!
I think this fix will properly include the broadcast in the Chakra trace. However, I'm not sure if this correctly represents the dependency behavior.
For example, in new_jsonized_chakra_trace.json, the broadcasts with id 116/117, 175/176, is dependent on id 5, but id 5 is not dependent on anything. I would assume in reality each of the broadcast would be dependent on some compute.

What do you think?

theodorbadea · 2025-04-25T11:44:26Z

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

JoongunPark · 2025-05-15T22:24:06Z

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

Hi @theodorbadea , thank you for the PR and sorry for joining this discussion lately.
Originally, when Chakra loads host ops, it traverses the operations. Like @theodorbadea explained, this eliminates lots of host ops not connected each other. I believe this is important problem to solve.
This PR basically preserves all the CPU ops in host traces and dangling chakra nodes in the Chakra trace.

I have few suggestions.
(1) We need extra logic that searches dependency for broadcast and connect compute->broadcast (in this case) in better way.
(2) We might still need remove_dangling_nodes rather than skipping it. Otherwise, all the dangling nodes will be scheduled in the beginning of simulation/replay.
(3) We need to understand why some host nodes does not have dependency to others. That would help us to solve this problem.

JoongunPark · 2025-05-19T19:16:00Z

Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal...

Hi @theodorbadea , thank you for the PR and sorry for joining this discussion lately. Originally, when Chakra loads host ops, it traverses the operations. Like @theodorbadea explained, this eliminates lots of host ops not connected each other. I believe this is important problem to solve. This PR basically preserves all the CPU ops in host traces and dangling chakra nodes in the Chakra trace.

I have few suggestions. (1) We need extra logic that searches dependency for broadcast and connect compute->broadcast (in this case) in bett 8000 er way. (2) We might still need remove_dangling_nodes rather than skipping it. Otherwise, all the dangling nodes will be scheduled in the beginning of simulation/replay. (3) We need to understand why some host nodes does not have dependency to others. That would help us to solve this problem.

Hi @theodorbadea, thank you for the talk today.
We weren't able to discuss a lot for this PR, but I believe this PR is very important and must be fixed.
Hope we can discuss this in detail in our next meeting. Thank you!

theodorbadea · 2025-05-21T13:55:08Z

@jinsun-yoo I managed to reproduce the situation you were targeting with your comment when training AlexNet with FSDP.
This is what the original .et looks like:

This is what the .et containing the orphan collectives looks like:

Considering the two illustrations and analyzing the AlexNet in FSDP with size_based_auto_wrap_policy and 1.500.000 minimum number of parameters for wrapping, it seems that the entire forward pass part and the final AllReduce (produced by explicitly calling all_reduce(loss, op=dist.ReduceOp.SUM)) are missing.
I believe that 3 modules of the AlexNet (the 3 layers from the classifier) and perhaps their nn.Sequential container get wrapped, so the 4 AllGather and 4 ReduceScatter visible in the original .et are explained.
However, the AllGathers corresponding to the forward pass are entirely missing.
Additionaly, your remark was correct: forcing Chakra to load these orphan nodes is not enough because dependencies are also required. In the second illustration, AllReduce has no dependency and so it may get executed first, when in fact it should get executed last.

cc: @JoongunPark
k_trace_rank=0_.json
p_trace_rank=0_.json

Later Edit: @TaekyungHeo (I think you were interested in seeing an example), please find attached these AlexNet traces. In the p_trace file, node with "id": 4 is missing, however 4 appears as parent of several ops, consequently these ops will be dropped.

JoongunPark · 2025-05-23T20:10:55Z

@jinsun-yoo I managed to reproduce the situation you were targeting with your comment when training AlexNet with FSDP. This is what the original .et looks like:

This is what the .et containing the orphan collectives looks like:

Considering the two illustrations and analyzing the AlexNet in FSDP with size_based_auto_wrap_policy and 1.500.000 minimum number of parameters for wrapping, it seems that the entire forward pass part and the final AllReduce (produced by explicitly calling all_reduce(loss, op=dist.ReduceOp.SUM)) are missing. I believe that 3 modules of the AlexNet (the 3 layers from the classifier) and perhaps their nn.Sequential container get wrapped, so the 4 AllGather and 4 ReduceScatter visible in the original .et are explained. However, the AllGathers corresponding to the forward pass are entirely missing. Additionaly, your remark was correct: forcing Chakra to load these orphan nodes is not enough because dependencies are also required. In the second illustration, AllReduce has no dependency and so it may get executed first, when in fact it should get executed last.

cc: @JoongunPark k_trace_rank=0_.json p_trace_rank=0_.json

Hi @theodorbadea. I realized that this PR might work better than current version along with your next PR #191 .
As we weren't able to finalize our discussion in our previous meeting, it would be great to move forward after our next Chakra meeting.

Also, I have a quick question.
Were you able to observe missing operations other than collectives? If yes, do you have any idea to enforce dependencies for those ops?
Thank you!

theodorbadea · 2025-05-27T18:31:17Z

Hey, @JoongunPark . After some more digging, I found out that what's captured in the resulting chakra.et corresponds to thread tid=2 and what's missing to thread tid=1. About your question, yes, there are also missing compute nodes.
So, baiscally, in the attached example, what we can see in the .et right now is corresponding to:
{
"id": 345, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,
...
"tid": 2
}
and what is missing corresponds to:
{
"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,
...
"tid": 1
}
I noticed that node with "id": 4 is missing from the p_trace, therefore all the nodes which have "ctrl_deps": 4 will be dropped.
Then, I did a little experiment and manually replaced "ctrl_deps": 4 with "ctrl_deps": 2, this way basically "forcing" the orphan graph components to be part of the graph by setting {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]"} as their parent (which I suspect it truly is, but there are some missing nodes in-between).

Now I intend to check the PyTorch profiler and see if there is something missing there, perhaps some setting, which might justify the missing host operations. I find it very strange that it is able to record {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,} and some other distant children of this node, but fail to record what's in-between.
Attaching here a new mermaid chart after hardcoding the proper parent:

JoongunPark · 2025-06-02T18:08:44Z

Hey, @JoongunPark . After some more digging, I found out that what's captured in the resulting chakra.et corresponds to thread tid=2 and what's missing to thread tid=1. About your question, yes, there are also missing compute nodes. So, baiscally, in the attached example, what we can see in the .et right now is corresponding to: { "id": 345, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 2 } and what is missing corresponds to: { "id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1, ... "tid": 1 } I noticed that node with "id": 4 is missing from the p_trace, therefore all the nodes which have "ctrl_deps": 4 will be dropped. Then, I did a little experiment and manually replaced "ctrl_deps": 4 with "ctrl_deps": 2, this way basically "forcing" the orphan graph components to be part of the graph by setting {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]"} as their parent (which I suspect it truly is, but there are some missing nodes in-between).

Now I intend to check the PyTorch profiler and see if there is something missing there, perhaps some setting, which might justify the missing host operations. I find it very strange that it is able to record {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,} and some other distant children of this node, but fail to record what's in-between. Attaching here a new mermaid chart after hardcoding the proper parent:

Thank you for confirmation ! @theodorbadea .
I think your approach makes sense.
We may need to choose either (1) Find hidden dependences we might miss or (2) Forcing ops to have dependencies like you did. Let's discuss about this during the meeting.

preserve collectives

a46db39

theodorbadea requested a review from a team as a code owner March 27, 2025 18:49

This was referenced Mar 27, 2025

NCCL:Broadcast collectives are missing from the converted trace but present in the trace_link #161

Open

Can't get rf_id in kineto trace!!!, trace_link.py can't find relation of cpu_op. #152

Open

theodorbadea added 7 commits May 28, 2025 17:52

link orphans to closest root by thread id

a99a5b3

linter

2621cf9

add under flag

e788fe6

revert changes to converter

15ac783

fix lint

4edf3ba

lint

be26b92

test refactor

a85226b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

preserve collectives #190

preserve collectives #190

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

preserve collectives #190

Are you sure you want to change the base?

preserve collectives #190

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!