8000 [distributed] Handle object collectives and NCCL. by kumpera · Pull Request #79034 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[distributed] Handle object collectives and NCCL. #79034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

kumpera
Copy link
Contributor
@kumpera kumpera commented Jun 7, 2022

This fixes all object collectives under NCCL and adds some automated tests for them.

This PR does not fix sending tensors using object collectives.

It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Jun 7, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit 655d136 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 7, 2022
@rohan-varma rohan-varma self-requested a review June 7, 2022 20:01
Copy link
Member
@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome refactoring! Looks good overall - only suggestion is around merging all of the tests.

f = io.BytesIO()
_pickler(f).dump(obj)
byte_storage = torch.ByteStorage.from_buffer(f.getvalue()) # type: ignore[attr-defined]
# Do not replace `torch.ByteTensor` or `torch.LongTensor` with torch.tensor and specifying dtype.
# Otherwise, it will casue 100X slowdown.
# See: https://github.com/pytorch/pytorch/issues/65696
byte_tensor = torch.ByteTensor(byte_storage)
local_size = torch.LongTensor([byte_tensor.numel()])
byte_tensor = torch.ByteTensor(byte_storage).to(device)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me, we should probably figure out the root cause issue of #65696, because we really don't want to be using ByteTensor and LongTensor style APIs anymore.

@kumpera kumpera force-pushed the obj_collectives_fix branch from 658c7f9 to 655d136 Compare June 9, 2022 14:25
@rohan-varma rohan-varma self-requested a review June 13, 2022 19:15
Copy link
Member
@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kumpera
Copy link
Contributor Author
kumpera commented Jun 13, 2022

@pytorchmergebot merge please

@pytorch-bot
Copy link
pytorch-bot bot commented Jun 13, 2022

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: please

usage: @pytorchbot {merge,revert,rebase,help} ...

Try @pytorchbot help for more info.

@kumpera
Copy link
Contributor Author
kumpera commented Jun 13, 2022

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here

@github-actions
Copy link
Contributor

Hey @kumpera.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Jun 15, 2022
Summary:
This fixes all object collectives under NCCL and adds some automated tests for them.

This PR *does not* fix sending tensors using object collectives.

It simplifies device handling by computing the appropriate one earlier and then ensuring all tensor ops happen on it.

Pull Request resolved: #79034
Approved by: https://github.com/rohan-varma

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4ebb326b75024af600a966308965803cdc3437d9

Reviewed By: malfet

Differential Revision: D37153126

Pulled By: kumpera

fbshipit-source-id: caf01d830545b40d2827c336c350da3cd8ab552c
@facebook-github-bot
Copy link
Contributor

@pytorchbot revert -m="Diff reverted internally" -c="ghfirst"

This Pull Request has been reverted by a revert inside Meta. To re-land this change, please open another pull request, assign the same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk).)

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here

pytorchmergebot added a commit that referenced this pull request Jun 15, 2022
This reverts commit 4ebb326.

Reverted #79034 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
malfet added a commit that referenced this pull request Jun 15, 2022
facebook-github-bot pushed a commit that referenced this pull request Jun 15, 2022
…79034)

Test Plan: revert-hammer

Differential Revision:
D37153126

Original commit changeset: caf01d830545

Original Phabricator Diff: D37153126

fbshipit-source-id: fe023b2eaa028f0677997c63a0472d70df381253
facebook-github-bot pushed a commit that referenced this pull request Jun 15, 2022
…ctives and NCCL. (#79034)

Test Plan: revert-hammer

Differential Revision:
D37170190

Original commit changeset: fe023b2eaa02

Original Phabricator Diff: D37153126

fbshipit-source-id: 3f28eefe13ed0867f76c39d5866430393af2c153
pytorchmergebot pushed a commit that referenced this pull request Sep 7, 2022
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: #84596
Approved by: https://github.com/H-Huang
facebook-github-bot pushed a commit that referenced this pull request Sep 8, 2022
Summary:
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: #84596
Approved by: https://github.com/H-Huang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/e96fb5d58c2accd717f0859b510ae7facb6d6aac

Reviewed By: izaitsevfb

Differential Revision: D39312639

Pulled By: kumpera

fbshipit-source-id: dc1b57b7ad464cf00b44ac6dbfca5349e9fd41b1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0