[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

nWEIdia · 2025-05-13T18:11:16Z

🐛 Describe the bug

Example failure job link (raw): https://ossci-raw-job-status.s3.amazonaws.com/log/41906032660

While working on #151594 , distributed job (test_c10d_nccl.py) with cuda 12.6 would hang for the test_nan_assert checks for bfloat16, float16, float32, float64 data types.

Creating this issue to track the fix of this as it could be host Operating system specific: I have reproduced the behavior with ubuntu 20.04 host but a ubuntu 22.04 setup could correctly handle the raised NaN and "Ok" the unit test.

Steps to reproduce:

Get a T4 runner
Provision Ubuntu 20.04 OS for reproducing the hang and Ubuntu 22.04 OS for working around the hang (not in CI but locally)
install docker, nvidia-container-toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) and install the CUDA driver (e.g. NVIDIA-Linux-x86_64-570.133.07.run)
download pytorch nightly docker image (e.g. ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel )
launch the container with: docker run --gpus all -it ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel
apt update && apt install git
clone pytorch
cd pytorch/test/distributed/
python test_c10d_nccl.py -v -k test_nan_assert

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @eqy @tinglvv @atalman @malfet

Versions

Nightly: ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel

nWEIdia · 2025-05-13T18:21:03Z

Cross linking with #136390

nWEIdia · 2025-05-13T22:14:47Z

While this observation is interesting and could be time consuming to root-cause and fix, I noticed that test_nan_assert was recently modified #151723

   # confirm enable/disable flag works
    backend._set_enable_nan_check(False)
    pg.allreduce(nan_tensor)

    backend._set_enable_nan_check(True)
    with self.assertRaises(RuntimeError):
         pg._allgather_base(output, nan_tensor)

perhaps, the "confirm enable/disable flag works" should be a separate unit test?

… to cu126-sm75 (#151594) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. #153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) #153122 CUDA context related #153517 NCCL regression, future NCCL may fix it See: #147383 Pull Request resolved: #151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever

…12.6 (#151594) This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6. In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues. #153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?) #153122 CUDA context related #153517 NCCL regression, future NCCL may fix it #154073 skip test_symmetric_memory for cuda 12.6 before it is fixed See: #147383 Pull Request resolved: #151594 Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501

…h#153479

We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround #153479 Pull Request resolved: #154448 Approved by: https://github.com/kwen2501

We need to reenable this test because there are recent changes that could be relevant to test_nan_assert. I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls. Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip). Workaround pytorch#153479 Pull Request resolved: pytorch#154448 Approved by: https://github.com/kwen2501

nWEIdia · 2025-06-13T23:53:36Z

Update: @xiaofanl-nvidia helped with the debug and below is our findings:

On the host machine:
With IOMMU disabled, NCCL test passes. For IOMMU and NCCL interaction, please see: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

So we recommend disabling IOMMU as a more reliable way to avoid hang issues. More recent kernels like Ubuntu 22.04 may have better support and when IOMMU is enabled, there would be no hang either.

So on the Amazon Linux side, Amazon Linux 2023 behaved similar to ubuntu 20.04, perhaps newer versions of Amazon Linux could potentially be free of such issues, looking forward to Amazon Linux 202X (X>3)

Closing as there seems to be limited stuff that we need to do.

nWEIdia self-assigned this May 13, 2025

nWEIdia mentioned this issue May 13, 2025

[c10d][nccl][cuda] Regression (unspecific cuda launch error) with test_c10d_nncl #136390

Open

malfet added oncall: distributed Add this issue/PR to distributed oncall triage queue module: deadlock Problems related to deadlocks (hang without exiting) labels May 13, 2025

This was referenced May 13, 2025

Add api to enable/disable NaN detector per-PG #151723

Closed

[CI][CUDA][Distributed] Move cuda 11.8 distributed pull jobs to cuda 12.6 #151594

Closed

This was referenced May 22, 2025

[Testing] multigpu tests are still running against CUDA-11 #154119

Open

[CI][CUDA] Re-enable the test-nan-assert on CUDA12 #154448

Closed

nWEIdia added a commit to nWEIdia/pytorch that referenced this issue May 27, 2025

Remove the call as a workaround before we figure out a fix for pytorc…

2588201

…h#153479

nWEIdia closed this as completed Jun 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

Uh oh!

Uh oh!

Uh oh!

[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

Comments

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!