[c10d] Fix docstring of scatter_object_list #84596

kumpera · 2022-09-06T20:48:00Z

The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034

facebook-github-bot · 2022-09-06T20:48:07Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84596
✖️ Python docs build was skipped
✖️ C++ docs build was skipped
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 2 New Failures

As of commit c1ce320 (more details on the Dr. CI page):

Expand to see more

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (1/2)

Step: "Get workflow job id" (full log | diagnosis details)

2022-09-06T21:33:46.6418052Z RuntimeError: Expe...e, but found at least two devices, cuda:0 and cpu!

2022-09-06T21:33:46.6363403Z frame #37: clone + 0x3f (0x7f482fde061f in /lib/x86_64-linux-gnu/libc.so.6)
2022-09-06T21:33:46.6363916Z 
2022-09-06T21:33:46.6363950Z 
2022-09-06T21:33:46.6364316Z On WorkerInfo(id=2, name=worker2):
2022-09-06T21:33:46.6394334Z RuntimeError('Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!\nException raised from compute_types at /var/lib/jenkins/workspace/aten/src/ATen/TensorIterator.cpp:484 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faaf9cd7cab in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7faaf9cd367e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #2: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) + 0xbbb (0x7fab042e1b3b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #3: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7f (0x7fab042e2f5f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #4: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 0xf2 (0x7fab042e4602 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #5: at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x2e (0x7fab044c18de in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x2a4881e (0x7faafc96181e in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)\nframe #7: <unknown function> + 0x2a48926 (0x7faafc961926 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)\nframe #8: at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x98 (0x7fab04ee5f38 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #9: <unknown function> + 0x32489ba (0x7fab066709ba in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #10: <unknown function> + 0x3249129 (0x7fab06671129 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #11: at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x172 (0x7fab04f19ef2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #12: <unknown function> + 0x32b497 (0x7fab11131497 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #13: <unknown function> + 0x32b7b6 (0x7fab111317b6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #14: <unknown function> + 0x1ddc68 (0x562dc7636c68 in /opt/conda/bin/python)\nframe #15: <unknown function> + 0x199499 (0x562dc75f2499 in /opt/conda/bin/python)\nframe #16: <unknown function> + 0x1995fa (0x562dc75f25fa in /opt/conda/bin/python)\nframe #17: PyNumber_Add + 0x41 (0x562dc759e4b1 in /opt/conda/bin/python)\nframe #18: _PyEval_EvalFrameDefault + 0x1008 (0x562dc763b098 in /opt/conda/bin/python)\nframe #19: <unknown function> + 0x18f742 (0x562dc75e8742 in /opt/conda/bin/python)\nframe #20: _PyObject_Call + 0x20a (0x562dc75a0faa in /opt/conda/bin/python)\nframe #21: _PyEval_EvalFrameDefault + 0x26e4 (0x562dc763c774 in /opt/conda/bin/python)\nframe #22: <unknown function> + 0x18f742 (0x562dc75e8742 in /opt/conda/bin/python)\nframe #23: _PyObject_Call + 0x20a (0x562dc75a0faa in /opt/conda/bin/python)\nframe #24: <unknown function> + 0xa2572a (0x7fab1182b72a in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #25: torch::distributed::rpc::PythonRpcHandler::runPythonUdf(pybind11::object const&) + 0x7d (0x7fab1182996d in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #26: torch::distributed::rpc::RequestCallbackImpl::runPythonFunction(pybind11::object const&, std::vector<c10::Stream, std::allocator<c10::Stream> >, bool) const + 0x85 (0x7fab1182cb05 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #27: torch::distributed::rpc::RequestCallbackImpl::processPythonCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x96 (0x7fab118306a6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #28: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x10c (0x7fab07a3b3cc in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #29: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7fab1182c7e5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libto
2022-09-06T21:33:46.6412048Z Traceback (most recent call last):
2022-09-06T21:33:46.6413650Z   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
2022-09-06T21:33:46.6414735Z     result = python_udf.func(*python_udf.args, **python_udf.kwargs)
2022-09-06T21:33:46.6416190Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5911, in _gpu_add_wrong_gpus
2022-09-06T21:33:46.6417132Z     return x.cpu() + y.cuda()
2022-09-06T21:33:46.6418052Z RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
2022-09-06T21:33:46.6419325Z Exception raised from compute_types at /var/lib/jenkins/workspace/aten/src/ATen/TensorIterator.cpp:484 (most recent call first):
2022-09-06T21:33:46.6421310Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7faaf9cd7cab in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2022-09-06T21:33:46.6423633Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7faaf9cd367e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2022-09-06T21:33:46.6425915Z frame #2: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) + 0xbbb (0x7fab042e1b3b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6427830Z frame #3: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7f (0x7fab042e2f5f in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6429958Z frame #4: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 0xf2 (0x7fab042e4602 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6432078Z frame #5: at::meta::structured_add_Tensor::meta(at::T
8000
ensor const&, at::Tensor const&, c10::Scalar const&) + 0x2e (0x7fab044c18de in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-06T21:33:46.6433765Z frame #6: <unknown function> + 0x2a4881e (0x7faafc96181e in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)
2022-09-06T21:33:46.6435313Z frame #7: <unknown function> + 0x2a48926 (0x7faafc961926 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)
2022-09-06T21:33:46.6437284Z frame #8: at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x98 (0x7fab04ee5f38 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)

pull / linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge) (2/2)

Step: "Get workflow job id" (full log | diagnosis details)

2022-09-06T22:09:59.2820075Z test_add_done_ca...arg() takes 0 positional arguments but 1 was given

2022-09-06T22:09:59.2789498Z   /opt/conda/lib/python3.7/unittest/suite.py(122): run
2022-09-06T22:09:59.2789756Z   /opt/conda/lib/python3.7/unittest/suite.py(84): __call__
2022-09-06T22:09:59.2790083Z   /opt/conda/lib/python3.7/site-packages/xmlrunner/runner.py(67): run
2022-09-06T22:09:59.2790366Z   /opt/conda/lib/python3.7/unittest/main.py(271): runTests
2022-09-06T22:09:59.2790809Z   /opt/conda/lib/python3.7/unittest/main.py(101): __init__
2022-09-06T22:09:59.2791177Z   /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py(786): run_tests
2022-09-06T22:09:59.2791453Z   test_futures.py(331): <module>
2022-09-06T22:09:59.2791575Z 
2022-09-06T22:09:59.2791650Z ok (0.244s)
2022-09-06T22:09:59.2813246Z   test_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.003s)
2022-09-06T22:09:59.2820075Z   test_add_done_callback_no_arg_error_is_ignored (__main__.TestFuture) ... [E pybind_utils.h:212] Got the following error when running the callback: TypeError: no_arg() takes 0 positional arguments but 1 was given
2022-09-06T22:09:59.2821629Z ok (0.001s)
2022-09-06T22:09:59.2835979Z   test_add_done_callback_simple (__main__.TestFuture) ... ok (0.001s)
2022-09-06T22:09:59.2917310Z   test_chained_then (__main__.TestFuture) ... ok (0.008s)
2022-09-06T22:09:59.3936639Z   test_collect_all (__main__.TestFuture) ... ok (0.102s)
2022-09-06T22:09:59.3946602Z   test_done (__main__.TestFuture) ... ok (0.001s)
2022-09-06T22:09:59.3960055Z   test_done_exception (__main__.TestFuture) ... ok (0.001s)
2022-09-06T22:09:59.3981939Z   test_interleaving_then_and_add_done_callback_maintains_callback_order (__main__.TestFuture) ... ok (0.002s)
2022-09-06T22:09:59.3992621Z   test_interleaving_then_and_add_done_callback_propagates_error (__main__.TestFuture) ... [E pybind_utils.h:212] Got the following error when running the callback: ValueError: Expected error
2022-09-06T22:09:59.3993437Z 
2022-09-06T22:09:59.3993540Z At:

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

H-Huang

LGTM!

kumpera · 2022-09-07T14:48:03Z

@pytorchmergebot merge -f "This is a quite safe change, all it does is deletes a few lines from a docstring."

pytorchmergebot · 2022-09-07T14:49:42Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the force (-f) flag. This means your change will be merged immediately, bypassing any CI checks (ETA: 1-5 minutes). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-09-07T14:50:43Z

Hey @kumpera.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: The docstring for scatter_object_list mentions is doesn't work with NCCL, but this was fixed in #79034 Pull Request resolved: #84596 Approved by: https://github.com/H-Huang Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/e96fb5d58c2accd717f0859b510ae7facb6d6aac Reviewed By: izaitsevfb Differential Revision: D39312639 Pulled By: kumpera fbshipit-source-id: dc1b57b7ad464cf00b44ac6dbfca5349e9fd41b1

[c10d] Fix docstring of scatter_object_list.

c1ce320
8000

kumpera requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, mingzhe09088, H-Huang and awgu as code owners September 6, 2022 20:48

pytorch-bot bot added the release notes: distributed (c10d) release notes category label Sep 6, 2022

facebook-github-bot added the cla signed label Sep 6, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 6, 2022

H-Huang mentioned this pull request Sep 6, 2022

dist.scatter_object_list() NCCL support #84571

Closed

H-Huang approved these changes Sep 6, 2022

View reviewed changes

pytorchmergebot added the Merged label Sep 7, 2022

pytorchmergebot closed this in e96fb5d Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Fix docstring of scatter_object_list #84596

[c10d] Fix docstring of scatter_object_list #84596

Uh oh!

Uh oh!

🕵️ 2 new failures recognized by patterns

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (1/2)

pull / linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge) (2/2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[c10d] Fix docstring of scatter_object_list #84596

[c10d] Fix docstring of scatter_object_list #84596

Uh oh!

Conversation

Uh oh!

Uh oh!

🔗 Helpful links

❌ 2 New Failures

🕵️ 2 new failures recognized by patterns

pull / linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (1/2)

pull / linux-bionic-py3.7-clang9 / test (crossref, 2, 2, linux.2xlarge) (2/2)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!