-
Notifications
You must be signed in to change notification settings - Fork 24.4k
[distributed] Handle object collectives and NCCL. #79034
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
✅ No Failures (0 Pending)As of commit 655d136 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome refactoring! Looks good overall - only suggestion is around merging all of the tests.
f = io.BytesIO() | ||
_pickler(f).dump(obj) | ||
byte_storage = torch.ByteStorage.from_buffer(f.getvalue()) # type: ignore[attr-defined] | ||
# Do not replace `torch.ByteTensor` or `torch.LongTensor` with torch.tensor and specifying dtype. | ||
# Otherwise, it will casue 100X slowdown. | ||
# See: https://github.com/pytorch/pytorch/issues/65696 | ||
byte_tensor = torch.ByteTensor(byte_storage) | ||
local_size = torch.LongTensor([byte_tensor.numel()]) | ||
byte_tensor = torch.ByteTensor(byte_storage).to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reminds me, we should probably figure out the root cause issue of #65696, because we really don't want to be using ByteTensor
and LongTensor
style APIs anymore.
658c7f9
to
655d136
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pytorchmergebot merge please |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchmergebot merge |
@pytorchbot successfully started a merge job. Check the current status here |
Hey @kumpera. |
Summary: This fixes all object collectives under NCCL and adds some automated tests for them. This PR *does not* fix sending tensors using object collectives. It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it. Pull Request resolved: #79034 Approved by: https://github.com/rohan-varma Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4ebb326b75024af600a966308965803cdc3437d9 Reviewed By: malfet Differential Revision: D37153126 Pulled By: kumpera fbshipit-source-id: caf01d830545b40d2827c336c350da3cd8ab552c
@pytorchbot revert -m="Diff reverted internally" -c="ghfirst" This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).) |
@pytorchbot successfully started a revert job. Check the current status here |
This reverts commit 4ebb326. Reverted #79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
…79034) Test Plan: revert-hammer Differential Revision: D37153126 Original commit changeset: caf01d830545 Original Phabricator Diff: D37153126 fbshipit-source-id: fe023b2eaa028f0677997c63a0472d70df381253
…ctives and NCCL. (#79034) Test Plan: revert-hammer Differential Revision: D37170190 Original commit changeset: fe023b2eaa02 Original Phabricator Diff: D37153126 fbshipit-source-id: 3f28eefe13ed0867f76c39d5866430393af2c153
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034 Pull Request resolved: #84596 Approved by: https://github.com/H-Huang
Summary: The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034 Pull Request resolved: #84596 Approved by: https://github.com/H-Huang Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/e96fb5d58c2accd717f0859b510ae7facb6d6aac Reviewed By: izaitsevfb Differential Revision: D39312639 Pulled By: kumpera fbshipit-source-id: dc1b57b7ad464cf00b44ac6dbfca5349e9fd41b1
This fixes all object collectives under NCCL and adds some automated tests for them.
This PR does not fix sending tensors using object collectives.
It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.