-
Notifications
You must be signed in to change notification settings - Fork 53
preserve collectives #190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
preserve collectives #190
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Hey @theodorbadea, Just wanted to confirm whether this PR is done or not, since it says "Draft PR" in the first line. If this is still a draft, could you save this as a draft instead? |
Hey @spandoescode , I put “draft” in the first comment because this PR contains just some commented out code and I would like to use it just to support my proposal and get feedback from the community for now. If we decide that this change is correct and should get integrated, I will do the proper code changes (a more elegant way of loading the host nodes and entirely remove the method “remove_dangling_nodes”). Looking forward to receiving your feedback on this proposal to load all the host nodes and not just the ones from the main connected component. |
@tushar-krishna, @srinivas212 can you please advise who can review this PR? |
Thank you for the suggestion! What do you think? |
Hey, @jinsun-yoo. Indeed, since the parent is missing, it will not depend on any compute node. This would be basically a reason why I initiated this discussion, to get feedback from the community about the correctness of this proposal... |
Hi @theodorbadea , thank you for the PR and sorry for joining this discussion lately. I have few suggestions. |
Hi @theodorbadea, thank you for the talk today. |
@jinsun-yoo I managed to reproduce the situation you were targeting with your comment when training AlexNet with FSDP. This is what the .et containing the orphan collectives looks like: Considering the two illustrations and analyzing the AlexNet in FSDP with cc: @JoongunPark Later Edit: @TaekyungHeo (I think you were interested in seeing an example), please find attached these AlexNet traces. In the p_trace file, node with "id": 4 is missing, however 4 appears as parent of several ops, consequently these ops will be dropped. |
Hi @theodorbadea. I realized that this PR might work better than current version along with your next PR #191 . Also, I have a quick question. |
Hey, @JoongunPark . After some more digging, I found out that what's captured in the resulting chakra.et corresponds to thread tid=2 and what's missing to thread tid=1. About your question, yes, there are also missing compute nodes. Now I intend to check the PyTorch profiler and see if there is something missing there, perhaps some setting, which might justify the missing host operations. I find it very strange that it is able to record {"id": 2, "name": "[pytorch|profiler|execution_trace|thread]", "ctrl_deps": 1,} and some other distant children of this node, but fail to record what's in-between. |
Thank you for confirmation ! @theodorbadea . |
Mostly addressing #161
Issue: some collectives are missing from the .et file, but are present in the kineto file.
It all starts from the host trace, these missing collectives have host operations with missing parents. To be more specific, the host ops corresponding to these collectives have a parent which is not part of the p_trace file.
Specific example:
host_op1: { id: 439, name: "record_param_comms", ctrl_deps: 438, rf_id: 115 } -> this corresponds to a ncclDevKernel_Broadcast
host_op2: { id: 438, name: "c10d::broadcast_", ctrl_deps: 9 }
host_op3: { id: 9, name: "DistributedDataParallel.forward", ctrl_deps: 4 }
So, host_op1 has host_op2 as parent, host_op2 has host_op3 as parent, and host_op3 has a node with id 4 as parent.
However, host op with id: 4 is not present in the p_trace_file and so facebookresearch/param/et_replay will not properly build the children/parent links (as seen in execution_trace.py):
if n.parent_id in self.nodes: # here there is no node with id equal to parent_id in the list of nodes
self.nodes[n.parent_id].add_child(n)
n.set_parent(self.nodes[n.parent_id])
When chakra loads the host trace, it will do a traversal starting from the root_node, but since there are some nodes which have parents not present in the file, this trace will be a graph with multiple components => chakra will load the "main" connected component, the one starting from the root node. For example, for a ResNet50 on 4 ranks that I've used to debug, it will load 2410/4484 host op nodes.
Attaching several relevant files generated with and without the changes from this branch.
p_trace_rank0.json
kineto_trace_rank0.json
new_jsonized_chakra_trace.json
new_merged.json
old_jsonized_chakra_trace.json
old_merged.json
cc: @winstonliu1111 @danmih-ixia @TaekyungHeo @tushar-krishna