8000 [c10d] Fix docstring of scatter_object_list by kumpera · Pull Request #84596 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[c10d] Fix docstring of scatter_object_list #84596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kumpera
Copy link
Contributor
@kumpera kumpera commented Sep 6, 2022

The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Sep 6, 2022

🔗 Helpful links

❌ 2 New Failures

As of commit c1ce320 (more details on the Dr. CI page):

Expand to see more
  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (1/2)

Step: "Get workflow job id" (full log | diagnosis details)

2022-09-06T21:33:46.6418052Z RuntimeError: Expe...e, but found at least two devices, cuda:0 and cpu!
2022-09-06T21:33:46.6363403Z frame #37: clone + 0x3f (0x7f482fde061f in /lib/x86_64-linux-gnu/libc.so.6)
2022-09-06T21:33:46.6363916Z 
2022-09-06T21:33:46.6363950Z 
2022-09-06T21:33:46.6364316Z On WorkerInfo(id=2, name=worker2):
2022-09-06T21:33:46.6394334Z RuntimeError('Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!\nException raised from compute_types at /var/lib/jenkins/workspace/aten/src/ATen/TensorIterator.cpp:484 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faaf9cd7cab in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7faaf9cd367e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #2: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) + 0xbbb (0x7fab042e1b3b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #3: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7f (0x7fab042e2f5f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #4: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 0xf2 (0x7fab042e4602 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #5: at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x2e (0x7fab044c18de in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x2a4881e (0x7faafc96181e in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)\nframe #7: <unknown function> + 0x2a48926 (0x7faafc961926 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)\nframe #8: at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x98 (0x7fab04ee5f38 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #9: <unknown function> + 0x32489ba (0x7fab066709ba in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #10: <unknown function> + 0x3249129 (0x7fab06671129 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #11: at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x172 (0x7fab04f19ef2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #12: <unknown function> + 0x32b497 (0x7fab11131497 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #13: <unknown function> + 0x32b7b6 (0x7fab111317b6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #14: <unknown function> + 0x1ddc68 (0x562dc7636c68 in /opt/conda/bin/python)\nframe #15: <unknown function> + 0x199499 (0x562dc75f2499 in /opt/conda/bin/python)\nframe #16: <unknown function> + 0x1995fa (0x562dc75f25fa in /opt/conda/bin/python)\nframe #17: PyNumber_Add + 0x41 (0x562dc759e4b1 in /opt/conda/bin/python)\nframe #18: _PyEval_EvalFrameDefault + 0x1008 (0x562dc763b098 in /opt/conda/bin/python)\nframe #19: <unknown function> + 0x18f742 (0x562dc75e8742 in /opt/conda/bin/python)\nframe #20: _PyObject_Call + 0x20a (0x562dc75a0faa in /opt/conda/bin/python)\nframe #21: _PyEval_EvalFrameDefault + 0x26e4 (0x562dc763c774 in /opt/conda/bin/python)\nframe #22: <unknown function> + 0x18f742 (0x562dc75e8742 in /opt/conda/bin/python)\nframe #23: _PyObject_Call + 0x20a (0x562dc75a0faa in /opt/conda/bin/python)\nframe #24: <unknown function> + 0xa2572a (0x7fab1182b72a in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #25: torch::distributed::rpc::PythonRpcHandler::runPythonUdf(pybind11::object const&) + 0x7d (0x7fab1182996d in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #26: torch::distributed::rpc::RequestCallbackImpl::runPythonFunction(pybind11::object const&, std::vector<c10::Stream, std::allocator<c10::Stream> >, bool) const + 0x85 (0x7fab1182cb05 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #27: torch::distributed::rpc::RequestCallbackImpl::processPythonCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x96 (0x7fab118306a6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #28: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x10c (0x7fab07a3b3cc in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #29: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7fab1182c7e5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libto
2022-09-06T21:33:46.6412048Z Traceback (most recent call last):
2022-09-06T21:33:46.6413650Z   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
2022-09-06T21:33:46.6414735Z     result = python_udf.func(*python_udf.args, **python_udf.kwargs)
2022-09-06T21:33:46.6416190Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5911, in _gpu_add_wrong_gpus
2022-09-06T21:33:46.6417132Z     return x.cpu() + y.cuda()
2022-09-06T21:33:46.6418052Z RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
2022-09-06T21:33:46.6419325Z Exception raised from compute_types at /var/lib/jenkins/workspace/aten/src/ATen/TensorIterator.cpp:484 (most recent call first):
2022-09-06T21:33:46.6421310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faaf9cd7cab in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2022-09-06T21:33:46.6423633Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7faaf9cd367e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2022-09-06T21:33:46.6425915Z frame #2: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) + 0xbbb (0x7fab042e1b3b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6427830Z frame #3: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7f (0x7fab042e2f5f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6429958Z frame #4: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 0xf2 (0x7fab042e4602 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6432078Z frame #5: at::meta::structured_add_Tensor::meta(at::T
8000
ensor const&, at::Tensor const&, c10::Scalar const&) + 0x2e (0x7fab044c18de in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6433765Z frame #6: <unknown function> + 0x2a4881e (0x7faafc96181e in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)
2022-09-06T21:33:46.6435313Z frame #7: <unknown function> + 0x2a48926 (0x7faafc961926 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)
2022-09-06T21:33:46.6437284Z frame #8: at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x98 (0x7fab04ee5f38 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)

See GitHub Actions build pull / linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge) (2/2)

Step: "Get workflow job id" (full log | diagnosis details)

2022-09-06T22:09:59.2820075Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given
2022-09-06T22:09:59.2789498Z   /opt/conda/lib/python3.7/unittest/suite.py(122): run
2022-09-06T22:09:59.2789756Z   /opt/conda/lib/python3.7/unittest/suite.py(84): __call__
2022-09-06T22:09:59.2790083Z   /opt/conda/lib/python3.7/site-packages/xmlrunner/runner.py(67): run
2022-09-06T22:09:59.2790366Z   /opt/conda/lib/python3.7/unittest/main.py(271): runTests
2022-09-06T22:09:59.2790809Z   /opt/conda/lib/python3.7/unittest/main.py(101): __init__
2022-09-06T22:09:59.2791177Z   /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py(786): run_tests
2022-09-06T22:09:59.2791453Z   test_futures.py(331): <module>
2022-09-06T22:09:59.2791575Z 
2022-09-06T22:09:59.2791650Z ok (0.244s)
2022-09-06T22:09:59.2813246Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.003s)
2022-09-06T22:09:59.2820075Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:212] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2022-09-06T22:09:59.2821629Z ok (0.001s)
2022-09-06T22:09:59.2835979Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.001s)
2022-09-06T22:09:59.2917310Z   test_chained_then (__main__.TestFuture) ... ok (0.008s)
2022-09-06T22:09:59.3936639Z   test_collect_all (__main__.TestFuture) ... ok (0.102s)
2022-09-06T22:09:59.3946602Z   test_done (__main__.TestFuture) ... ok (0.001s)
2022-09-06T22:09:59.3960055Z   test_done_exception (__main__.TestFuture) ... ok (0.001s)
2022-09-06T22:09:59.3981939Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.002s)
2022-09-06T22:09:59.3992621Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:212] Got the following error when running the callback: ValueError: Expected error
2022-09-06T22:09:59.3993437Z 
2022-09-06T22:09:59.3993540Z At:

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 6, 2022
Copy link
Member
@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kumpera
Copy link
Contributor Author
kumpera commented Sep 7, 2022

@pytorchmergebot merge -f "This is a quite safe change, all it does is deletes a few lines from a docstring."

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the force (-f) flag. This means your change will be merged immediately, bypassing any CI checks (ETA: 1-5 minutes). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link
Contributor
github-actions bot commented Sep 7, 2022

Hey @kumpera.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Sep 8, 2022
Summary:
The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

Pull Request resolved: #84596
Approved by: https://github.com/H-Huang

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/e96fb5d58c2accd717f0859b510ae7facb6d6aac

Reviewed By: izaitsevfb

Differential Revision: D39312639

Pulled By: kumpera

fbshipit-source-id: dc1b57b7ad464cf00b44ac6dbfca5349e9fd41b1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0