-
Notifications
You must be signed in to change notification settings - Fork 24.4k
[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Cross linking with #136390 |
While this observation is interesting and could be time consuming to root-cause and fix, I noticed that test_nan_assert was recently modified #151723
perhaps, the "confirm enable/disable flag works" should be a separate unit test? |
… to cu126-sm75 (#151594) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. #153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) #153122 CUDA context related #153517 NCCL regression, future NCCL may fix it See: #147383 Pull Request resolved: #151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever
…12.6 (#151594) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. #153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) #153122 CUDA context related #153517 NCCL regression, future NCCL may fix it #154073 skip test_symmetric_memory for cuda 12.6 before it is fixed See: #147383 Pull Request resolved: #151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround #153479 Pull Request resolved: #154448 Approved by: https://github.com/kwen2501
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround pytorch#153479 Pull Request resolved: pytorch#154448 Approved by: https://github.com/kwen2501
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround pytorch#153479 Pull Request resolved: pytorch#154448 Approved by: https://github.com/kwen2501
Update: @xiaofanl-nvidia helped with the debug and below is our findings: On the host machine: So we recommend disabling IOMMU as a more reliable way to avoid hang issues. More recent kernels like Ubuntu 22.04 may have better support and when IOMMU is enabled, there would be no hang either. So on the Amazon Linux side, Amazon Linux 2023 behaved similar to ubuntu 20.04, perhaps newer versions of Amazon Linux could potentially be free of such issues, looking forward to Amazon Linux 202X (X>3) Closing as there seems to be limited stuff that we need to do. |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
Example failure job link (raw): https://ossci-raw-job-status.s3.amazonaws.com/log/41906032660
While working on #151594 , distributed job (test_c10d_nccl.py) with cuda 12.6 would hang for the test_nan_assert checks for bfloat16, float16, float32, float64 data types.
Creating this issue to track the fix of this as it could be host Operating system specific: I have reproduced the behavior with ubuntu 20.04 host but a ubuntu 22.04 setup could correctly handle the raised NaN and "Ok" the unit test.
Steps to reproduce:
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @eqy @tinglvv @atalman @malfet
Versions
Nightly: ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel
The text was updated successfully, but these errors were encountered: