.github: Switch 8xlarge to 4xlarge instance_type #67299

seemethere · 2021-10-26T23:54:48Z

Stack from ghstack:

-> .github: Switch 8xlarge to 4xlarge instance_type #67299

Switches the linux.8xlarge.nvidia.gpu to the 4xlarge instance type to
help with queueing / capacity issues. This change is only meant to be a
bridge until everyone updates their PRs to use the new
linux.4xlarge.nvidia.gpu node type

NOTE: This node type will be removed so do not depend on it for any new
workflows.

Signed-off-by: Eli Uriegas eliuriegas@fb.com

Differential Revision: D31945507

Switches the linux.8xlarge.nvidia.gpu to the 4xlarge instance type to help with queueing / capacity issues. This change is only meant to be a bridge until everyone updates their PRs to use the new linux.4xlarge.nvidia.gpu node type NOTE: This node type will be removed so do not depend on it for any new workflows. Signed-off-by: Eli Uriegas <eliuriegas@fb.com> [ghstack-poisoned]

pytorch-probot · 2021-10-26T23:54:50Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/1a0bda01a154c2fa53d66752df91ea0354ebb60b/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
docker-builds	`ciflow/all`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis	`ciflow/all`, `ciflow/linux`, `ciflow/mobile`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-10-26T23:54:54Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/67299
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 1a0bda0 (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

Lint / clang-tidy (1/2)

Step: "Check for warnings" (full log | diagnosis details | 🔁 rerun)

2021-10-27T00:00:09.0200985Z /__w/pytorch/pytor...h/torch.h' file not found [clang-diagnostic-error]

2021-10-27T00:00:09.0194072Z torch/csrc/jit/frontend/tracer.h:197:3: error: use '= default' to define a trivial destructor [modernize-use-equals-default,-warnings-as-errors]
2021-10-27T00:00:09.0195166Z   ~WithNestedTracingFrame() {
2021-10-27T00:00:09.0195540Z   ^
2021-10-27T00:00:09.0196286Z /__w/pytorch/pytorch/torch/csrc/deploy/test_deploy_gpu.cpp:3:10: error: 'torch/cuda.h' file not found [clang-diagnostic-error]
2021-10-27T00:00:09.0196920Z #include <torch/cuda.h>
2021-10-27T00:00:09.0197221Z          ^
2021-10-27T00:00:09.0198410Z /__w/pytorch/pytorch/torch/csrc/deploy/test_deploy_gpu.cpp:39:27: error: variable 'inputs' is not initialized [cppcoreguidelines-init-variables,-warnings-as-errors]
2021-10-27T00:00:09.0199551Z   std::vector<at::IValue> inputs;
2021-10-27T00:00:09.0199865Z                           ^
2021-10-27T00:00:09.0200141Z                                  = 0
2021-10-27T00:00:09.0200985Z /__w/pytorch/pytorch/torch/csrc/deploy/test_deploy_missing_interpreter.cpp:3:10: error: 'torch/torch.h' file not found [clang-diagnostic-error]
2021-10-27T00:00:09.0201695Z #include <torch/torch.h>
2021-10-27T00:00:09.0202000Z          ^
2021-10-27T00:00:09.0202287Z Warnings detected!
2021-10-27T00:00:09.0202601Z Summary:
2021-10-27T00:00:09.0203317Z [cppcoreguidelines-pro-type-member-init] occurred 10 times
2021-10-27T00:00:09.0204172Z     /__w/pytorch/pytorch/torch/csrc/api/include/torch/detail/TensorDataContainer.h:94
2021-10-27T00:00:09.0204933Z     /__w/pytorch/pytorch/torch/csrc/api/include/torch/detail/TensorDataContainer.h:106
2021-10-27T00:00:09.0205679Z     /__w/pytorch/pytorch/torch/csrc/api/include/torch/detail/TensorDataContainer.h:107
2021-10-27T00:00:09.0206442Z     /__w/pytorch/pytorch/torch/csrc/api/include/torch/detail/TensorDataContainer.h:109
2021-10-27T00:00:09.0207202Z     /__w/pytorch/pytorch/torch/csrc/api/include/torch/detail/TensorDataContainer.h:147

linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.4xlarge.nvidia.gpu) (2/2)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-10-27T04:38:38.0567353Z what(): NCCL er...pNCCL.cpp:1078, invalid usage, NCCL version 21.0.3

2021-10-27T04:08:37.7690932Z �[0;32m[ RUN      ] �[mProcessGroupNCCLTest.testReduce
2021-10-27T04:08:37.8472365Z �[0;32m[       OK ] �[mProcessGroupNCCLTest.testReduce (78 ms)
2021-10-27T04:08:37.8473886Z �[0;32m[ RUN      ] �[mProcessGroupNCCLTest.testAllgather
2021-10-27T04:08:37.9179959Z �[0;32m[       OK ] �[mProcessGroupNCCLTest.testAllgather (70 ms)
2021-10-27T04:08:37.9181534Z �[0;32m[ RUN      ] �[mProcessGroupNCCLTest.testAllgatherBase
2021-10-27T04:08:37.9871455Z �[0;32m[       OK ] �[mProcessGroupNCCLTest.testAllgatherBase (69 ms)
2021-10-27T04:08:37.9873158Z �[0;32m[ RUN      ] �[mProcessGroupNCCLTest.testReduceScatter
2021-10-27T04:08:38.0560244Z �[0;32m[       OK ] �[mProcessGroupNCCLTest.testReduceScatter (68 ms)
2021-10-27T04:08:38.0562080Z �[0;32m[ RUN      ] �[mProcessGroupNCCLTest.testSequenceNumInit
2021-10-27T04:38:38.0566106Z terminate called after throwing an instance of 'c10::Error'
2021-10-27T04:38:38.0567353Z   what():  NCCL error in: /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1078, invalid usage, NCCL version 21.0.3
2021-10-27T04:38:38.0568838Z ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
2021-10-27T04:38:38.0570317Z Exception raised from getNCCLComm at /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1078 (most recent call first):
2021-10-27T04:38:38.0573714Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f268741ceab in /opt/conda/lib/python3.6/site-packages/torch/bin/libc10.so)
2021-10-27T04:38:38.0576960Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f2687418afe in /opt/conda/lib/python3.6/site-packages/torch/bin/libc10.so)
2021-10-27T04:38:38.0579762Z frame #2: c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x18a8 (0x7f2687c88758 in /opt/conda/lib/python3.6/site-packages/torch/bin/libtorch_cuda_cpp.so)
2021-10-27T04:38:38.0581774Z frame #3: <unknown function> + 0x148c35 (0x7f2687c88c35 in /opt/conda/lib/python3.6/site-packages/torch/bin/libtorch_cuda_cpp.so)
2021-10-27T04:38:38.0582770Z frame #4: <unknown function> + 0xc9039 (0x7f2694b57039 in /opt/conda/lib/libstdc++.so.6)
2021-10-27T04:38:38.0584022Z frame #5: <unknown function> + 0x76ba (0x7f26899996ba in /lib/x86_64-linux-gnu/libpthread.so.0)
2021-10-27T04:38:38.0585020Z frame #6: clone + 0x6d (0x7f267770851d in /lib/x86_64-linux-gnu/libc.so.6)
2021-10-27T04:38:38.0585440Z

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Switches the linux.8xlarge.nvidia.gpu to the 4xlarge instance type to help with queueing / capacity issues. This change is only meant to be a bridge until everyone updates their PRs to use the new linux.4xlarge.nvidia.gpu node type NOTE: This node type will be removed so do not depend on it for any new workflows. Signed-off-by: Eli Uriegas <eliuriegasfb.com> ghstack-source-id: 724d2ce Pull Request resolved: #67299

seemethere · 2021-10-26T23:59:52Z

8000

@seemethere has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Pull Request resolved: #67299 Switches the linux.8xlarge.nvidia.gpu to the 4xlarge instance type to help with queueing / capacity issues. This change is only meant to be a bridge until everyone updates their PRs to use the new linux.4xlarge.nvidia.gpu node type NOTE: This node type will be removed so do not depend on it for any new workflows. Signed-off-by: Eli Uriegas <eliuriegas@fb.com> Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D31945507 Pulled By: seemethere fbshipit-source-id: fb8587de7f31da72e968d46eeecc573d3f5b440f

facebook-github-bot · 2021-10-27T17:04:01Z

This pull request has been reverted by f5bb83d991d8d425238fda9e992df039797b11f4. To re-land this change, follow these steps.

facebook-github-bot · 2021-10-27T17:04:29Z

This pull request has been reverted by 2669e4e. To re-land this change, follow these steps.

facebook-github-bot · 2022-01-06T03:56:55Z

This pull request has been reverted by 2669e4e. To re-land this change, follow these steps.

seemethere requested review from driazati, janeyx99 and zhouzhuojie as code owners October 26, 2021 23:54

pytorch-probot bot added the ciflow/default label Oct 26, 2021

facebook-github-bot added the cla signed label Oct 26, 2021

malfet approved these changes Oct 27, 2021

View reviewed changes

facebook-github-bot closed this Oct 27, 2021

facebook-github-bot added the Reverted label Oct 27, 2021

janeyx99 mentioned this pull request Nov 2, 2021

[Meta] CI Revert Tracker #66178

Closed

facebook-github-bot deleted the gh/seemethere/178/head branch November 26, 2021 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

.github: Switch 8xlarge to 4xlarge instance_type #67299

.github: Switch 8xlarge to 4xlarge instance_type #67299

Uh oh!

Uh oh!

⚛️ CI Flow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

.github: Switch 8xlarge to 4xlarge instance_type #67299

.github: Switch 8xlarge to 4xlarge instance_type #67299

Uh oh!

Conversation

Uh oh!

Uh oh!

⚛️ CI Flow

Uh oh!

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

Lint / clang-tidy (1/2)

linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.4xlarge.nvidia.gpu) (2/2)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!