8000 [CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 · Issue #153479 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[CI][CUDA][Distributed] test_assert_nan_float16 unit test hangs with certain Host OS + CUDA KMD 570.133.07 #153479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nWEIdia opened this issue May 13, 2025 · 3 comments
Assignees
Labels
module: deadlock Problems related to deadlocks (hang without exiting) oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@nWEIdia
Copy link
Collaborator
nWEIdia commented May 13, 2025

🐛 Describe the bug

Example failure job link (raw): https://ossci-raw-job-status.s3.amazonaws.com/log/41906032660

While working on #151594 , distributed job (test_c10d_nccl.py) with cuda 12.6 would hang for the test_nan_assert checks for bfloat16, float16, float32, float64 data types.

Creating this issue to track the fix of this as it could be host Operating system specific: I have reproduced the behavior with ubuntu 20.04 host but a ubuntu 22.04 setup could correctly handle the raised NaN and "Ok" the unit test.

Steps to reproduce:

  1. Get a T4 runner
  2. Provision Ubuntu 20.04 OS for reproducing the hang and Ubuntu 22.04 OS for working around the hang (not in CI but locally)
  3. install docker, nvidia-container-toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) and install the CUDA driver (e.g. NVIDIA-Linux-x86_64-570.133.07.run)
  4. download pytorch nightly docker image (e.g. ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel )
  5. launch the container with: docker run --gpus all -it ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel 
  6. apt update && apt install git
  7. clone pytorch
  8. cd pytorch/test/distributed/
  9. python test_c10d_nccl.py -v -k test_nan_assert

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @ptrblck @eqy @tinglvv @atalman @malfet

Versions

Nightly: ghcr.io/pytorch/pytorch-nightly:2.8.0.dev20250511-cuda12.6-cudnn9-devel 

@nWEIdia nWEIdia self-assigned this May 13, 2025
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 13, 2025

Cross linking with #136390

@malfet malfet added oncall: distributed Add this issue/PR to distributed oncall triage queue module: deadlock Problems related to deadlocks (hang without exiting) labels May 13, 2025
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented May 13, 2025

While this observation is interesting and could be time consuming to root-cause and fix, I noticed that test_nan_assert was recently modified #151723

   # confirm enable/disable flag works
    backend._set_enable_nan_check(False)
    pg.allreduce(nan_tensor)

    backend._set_enable_nan_check(True)
    with self.assertRaises(RuntimeError):
         pg._allgather_base(output, nan_tensor)

perhaps, the "confirm enable/disable flag works" should be a separate unit test?

pytorchmergebot pushed a commit that referenced this issue May 20, 2025
… to cu126-sm75 (#151594)

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
#153122 CUDA context related
#153517  NCCL regression, future NCCL may fix it

See: #147383

Pull Request resolved: #151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever
pytorchmergebot pushed a commit that referenced this issue May 22, 2025
…12.6 (#151594)

This PR moves distributed cuda CI job from cuda 11.8 to cuda 12.6.
In doing so, a few unit test failures were exposed, some if not all of which would take a while to root-cause and fix, so temporarily skip them after creating the issues.

#153479 test_nan_assert tricky behavior (e.g. skip_but_pass_in_sandcastle, ubuntu 20.04 does not work, ubuntu 22.04 works, Amazon Linux 2023 skip - what is Sandcastle OS?)
#153122 CUDA context related
#153517  NCCL regression, future NCCL may fix it
#154073 skip test_symmetric_memory for cuda 12.6 before it is fixed

See: #147383

Pull Request resolved: #151594
Approved by: https://github.com/eqy, https://github.com/atalman, https://github.com/cyyever, https://github.com/huydhn, https://github.com/kwen2501
pytorchmergebot pushed a commit that referenced this issue Jun 5, 2025
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround #153479

Pull Request resolved: #154448
Approved by: https://github.com/kwen2501
angelayi pushed a commit to angelayi/pytorch that referenced this issue Jun 5, 2025
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround pytorch#153479

Pull Request resolved: pytorch#154448
Approved by: https://github.com/kwen2501
framoncg pushed a commit to docathon-pytorch-friends/pytorch that referenced this issue Jun 6, 2025
We need to reenable this test because there are recent changes that could be relevant to test_nan_assert.

I've already tested that there would be hang if we don't remove the "pg._allgather_base(output, nan_tensor)" in between the "backend._set_enable_nan_check" calls.
Why was it "working" previously? Because previously only cu118 distributed was running and this "backend._set_enable_nan_check" change was not tested in the merge process (skip logic is if "not CUDA 12 and above", skip).

Workaround pytorch#153479

Pull Request resolved: pytorch#154448
Approved by: https://github.com/kwen2501
@nWEIdia
Copy link
Collaborator Author
nWEIdia commented Jun 13, 2025

Update: @xiaofanl-nvidia helped with the debug and below is our findings:

On the host machine:
With IOMMU disabled, NCCL test passes. For IOMMU and NCCL interaction, please see: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

So we recommend disabling IOMMU as a more reliable way to avoid hang issues. More recent kernels like Ubuntu 22.04 may have better support and when IOMMU is enabled, there would be no hang either.

So on the Amazon Linux side, Amazon Linux 2023 behaved similar to ubuntu 20.04, perhaps newer versions of Amazon Linux could potentially be free of such issues, looking forward to Amazon Linux 202X (X>3)

Closing as there seems to be limited stuff that we need to do.

@nWEIdia nWEIdia closed this as completed Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: deadlock Problems related to deadlocks (hang without exiting) oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

2 participants
0