AllReduce hangs due to no device_id into init_process_group #142356

youth123 · 2024-12-09T07:53:09Z

🐛 Describe the bug

In pytorch 2.2 or later, when initializing a distributed environment, if you do not pass device_id to the init_process_group function but pass it through the set_device, the allreduce primitive hangs or NCCL directly reports an error. This question only comes up on allreduce primitives, does anyone know why?
You can run the example below with torchrun --nproc_per_node 4 test.py

import torch
import torch.distributed as dist
import os

init_method = 'tcp://'
master_ip = os.getenv('MASTER_ADDR', 'localhost')
world_size = 4

default_master_port = '6000'
master_port = os.getenv('MASTER_PORT', default_master_port)
init_method += master_ip + ':' + master_port
rank = int(os.getenv('RANK', '0'))
# correct init
# torch.distributed.init_process_group(
#     backend="nccl",
#     world_size=world_size, rank=rank, device_id=torch.device(f"cuda:{rank}"), init_method=init_method)

# will cause allreduce hang
torch.cuda.set_device(torch.device(f"cuda:{rank}"))
torch.distributed.init_process_group(
     backend="nccl",
     world_size=world_size, rank=rank, init_method=init_method)

cur_rank = torch.distributed.get_rank()
if cur_rank == 0 or cur_rank == 1:
    gqa_group = torch.distributed.new_group([0, 1])
else:
    gqa_group = torch.distributed.new_group([2, 3])

a = torch.tensor(1, device=cur_rank)
torch.distributed.all_reduce(a, group=gqa_group)

Versions

pytorch 2.5

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

The text was updated successfully, but these errors were encountered:

XilunWu · 2024-12-09T19:04:11Z

I tried to reproduce the hang but couldn't. Instead, it's a NCCL error:

misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to 2803:6082:c0b4:8500::1<35393> failed : Software caused connection abort
[rank1]: Traceback (most recent call last):
[rank1]:   File "test_allreduce.py", line 49, in <module>
[rank1]:     torch.distributed.all_reduce(a, group=gqa_group)
[rank1]:   File "torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "torch/distributed/distributed_c10d.py", line 2773, in all_reduce
[rank1]:     work = group.allreduce([tensor], opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: torch/csrc/distributed/c10d/NCCLUtils.hpp:269, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
[rank1]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
[rank1]: Last error:
[rank1]: socketStartConnect: Connect to <host>::1<35393> failed : Software caused connection abort
W1209 11:00:32.851000 1935266 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1935505 closing signal SIGTERM
W1209 11:00:32.852000 1935266 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1935507 closing signal SIGTERM
W1209 11:00:32.853000 1935266 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1935508 closing signal SIGTERM

The test code I'm using:

import os

import torch
import torch.distributed as dist

from torch.testing._internal.common_utils import find_free_port

init_method = "tcp://"
master_ip = os.getenv("MASTER_ADDR", "localhost")
world_size = 4

default_master_port = find_free_port()
master_port = os.getenv("MASTER_PORT", default_master_port)
init_method += master_ip + ":" + master_port
rank = int(os.getenv("RANK", "0"))

# When `use_set_device == True`, the program has a good chance to fail with NCCL error.
use_set_device = True

if use_set_device:
    # will cause allreduce hang
    torch.cuda.set_device(torch.device(f"cuda:{rank}"))
    dist.init_process_group(
        backend="nccl",
        world_size=world_size,
        rank=rank,
        init_method=init_method,
    )
else:
    # correct init
    dist.init_process_group(
        backend="nccl",
        world_size=world_size,
        rank=rank,
        device_id=torch.device(f"cuda:{rank}"),
        init_method=init_method,
    )


# test code
cur_rank = torch.distributed.get_rank()
if cur_rank == 0 or cur_rank == 1:
    gqa_group = torch.distributed.new_group([0, 1])
else:
    gqa_group = torch.distributed.new_group([2, 3])

a = torch.tensor(1, device=cur_rank)
torch.distributed.all_reduce(a, group=gqa_group)

torch.distributed.destroy_process_group()

cc @yifuwang @kwen2501

wconstab · 2024-12-09T19:11:56Z

iiuc this code is actually not correct. according to docs for new_group, all ranks must call new_group (for each group creation)

specifically, this code:

if cur_rank == 0 or cur_rank == 1:
    gqa_group = torch.distributed.new_group([0, 1])
else:
    gqa_group = torch.distributed.new_group([2, 3])

should change to this:

if cur_rank == 0 or cur_rank == 1:
    gqa_group = torch.distributed.new_group([0, 1])
    _ = torch.distributed.new_group([2, 3])
else:
    _ = torch.distributed.new_group([0, 1])
    gqa_group = torch.distributed.new_group([2, 3])

cc @kwen2501 please confirm and keep me honest

youth123 · 2024-12-10T02:24:36Z

I tried to reproduce the hang but couldn't. Instead, it's a NCCL error

The hanging phenomenon is not inevitable, and nccl's connection refused error will also occur.

youth123 · 2024-12-10T02:27:17Z

if cur_rank == 0 or cur_rank == 1:
gqa_group = torch.distributed.new_group([0, 1])
_ = torch.distributed.new_group([2, 3])
else:
_ = torch.distributed.new_group([0, 1])
gqa_group = torch.distributed.new_group([2, 3])

After I modified the code as above, I would occasionally get nccl connection refused error when running it multiple times. I think the error reported is not caused by new_group.

wconstab · 2024-12-11T01:03:02Z

well, adding device_id into init_process_group opts you into 'eager init' for nccl. Without eager init, you get lazy init which means nccl establishes its connections on the first collective. In general, i've only seen these type of hangs when the collective or new group calls are not well synchronized.

the original code with incorrect new_group does cause the nccl connect error for me. But the corrected code works fine for me, can't repro an error after 5 retries back to back.

Can you confirm you're still seeing errors with this script?

youth123 · 2024-12-11T01:46:35Z

Can you confirm you're still seeing errors with this script?

I did see the error after trying it a dozen times. Here's my error log.

[rank3]:[E1211 09:42:50.285517203 ProcessGroupNCCL.cpp:542] [Rank 1] Collective WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) raised the following async exception: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Exception raised from checkForNCCLErrorsInternal at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2027 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fd8e1b6c446 in /root/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::checkForNCCLErrorsInternal(std::shared_ptrc10d::NCCLComm&) + 0x220 (0x7fd897a29f70 in /root/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::checkAndSetException() + 0x7c (0x7fd897a2a1bc in /root/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isStarted() + 0x90 (0x7fd897a2a490 in /root/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::watchdogHandler() + 0x9f8 (0x7fd897a32368 in /root/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fd897a3360d in /root/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0x145c0 (0x7fd8e20185c0 in /root/.local/lib/python3.10/site-packages/torch/lib/libtorch.so)
frame #7: + 0x94ac3 (0x7fd8e2869ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x126850 (0x7fd8e28fb850 in /lib/x86_64-linux-gnu/libc.so.6)

malfet added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 9, 2024

yf225 added the module: c10d Issues/PRs related to collective communications and process groups label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllReduce hangs due to no device_id into init_process_group #142356

AllReduce hangs due to no device_id into init_process_group #142356

AllReduce hangs due to no device_id into init_process_group #142356

AllReduce hangs due to no device_id into init_process_group #142356

Comments

🐛 Describe the bug

Versions