8000 (Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device · Issue #149196 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fzyzcjy opened this issue Mar 14, 2025 &mi 8000 ddot; 10 comments · May be fixed by #149248
Open

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

fzyzcjy opened this issue Mar 14, 2025 · 10 comments · May be fixed by #149248
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: multiprocessing Related to torch.multiprocessing triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@fzyzcjy
Copy link
Contributor
fzyzcjy commented Mar 14, 2025

EDIT: PR to fix this

PR is here: #149248

🐛 Describe the bug

Hi thanks for the helpful library! When two processes have different CUDA_VISIBLE_DEVICES and pass around tensor between them, it seems the .device attribute is incorrect.

Example code:

import os


def _run_second_process(queue):
    print(f'[second] {os.environ.get("CUDA_VISIBLE_DEVICES")=}')
    value_from_queue = queue.get()
    print(f'[second] queue.get {value_from_queue=} {value_from_queue.device=}')


def _run_main_process():
    import torch
    print(f'[first] {os.environ.get("CUDA_VISIBLE_DEVICES")=}')
    queue = torch.multiprocessing.Queue()

    os.environ['CUDA_VISIBLE_DEVICES'] = '1,2'
    p = torch.multiprocessing.Process(
        target=_run_second_process,
        kwargs=dict(queue=queue),
    )
    p.start()
    del os.environ['CUDA_VISIBLE_DEVICES']

    value_to_queue = torch.tensor([1.0, 2.0], device='cuda:1')
    print(f'[first] queue.put {value_to_queue=} {value_to_queue.device=}')
    queue.put(value_to_queue)

    p.join()

if __name__ == '__main__':
    _run_main_process()

Output:

[first] os.environ.get("CUDA_VISIBLE_DEVICES")=None
[second] os.environ.get("CUDA_VISIBLE_DEVICES")='1,2'
[first] queue.put value_to_queue=tensor([1., 2.], device='cuda:1') value_to_queue.device=device(type='cuda', index=1)
[second] queue.get value_from_queue=tensor([1., 2.], device='cuda:1') value_from_queue.device=device(type='cuda', index=1)

It seems cuda:0 in the second process should mean cuda:1 in the first process, thus the second process wrongly recognize the tensor as cuda:1.

This seems to be related to issues like github.com/volcengine/verl/pull/ 490#issuecomment-2720212225.

If I manage to find some spare time, I am happy to PR for this.

Versions

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39

Python version: 3.10.16 (main, Dec 4 2024, 08:53:38) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-1017-aws-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.8.61
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 550.127.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.7.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 1
BogoMIPS: 5299.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 3 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 48 MiB (96 instances)
L3 cache: 384 MiB (12 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] flashinfer-python==0.2.3+cu124torch2.5
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.14.1
[pip3] torch==2.5.1
[pip3] torch_memory_saver==0.0.2
[pip3] torchao==0.9.0
[pip3] torchaudio==2.5.1
[pip3] torchdata==0.11.0
[pip3] torchvision==0.20.1
[pip3] triton==3.1.0
[conda] Could not collect

cc @VitalyFedyunin @albanD @ptrblck @msaroufim @eqy

@fzyzcjy fzyzcjy changed the title Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device (Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device Mar 15, 2025
@albanD
Copy link
Collaborator
albanD commented Mar 17, 2025

That would be very BC-breaking...
And exactely the same behavior happens for serialization in general.

I think there are two things here:

  • Make sure that what we have does not segfault in a bad way. Which might be happening right now?
  • What is the behavior we want for this case (keep id or keep physical device)?

@albanD albanD added module: multiprocessing Related to torch.multiprocessing module: cuda Related to torch.cuda, and CUDA support in general labels Mar 17, 2025
@eqy
Copy link
Collaborator
eqy commented Mar 17, 2025

In general I'm not sure we are expecting to handle

import torch
os.environ['CUDA_VISIBLE_DEVICES'] = ...

patterns as we often assume env vars cannot be changed after initialization.

@zou3519 zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Mar 17, 2025
@albanD
Copy link
Collaborator
albanD commented Mar 17, 2025

Changing environ after cuda initialization is a noop for sure.
This one is a bit special because these are two separate processes, that could be launched completely independently, so it is perfectly valid for them to have different env variables.

@fzyzcjy
Copy link
Contributor Author
fzyzcjy commented Mar 17, 2025

That would be very BC-breaking...

This is a prototype hack to demonstrate it roughly fixes instead of a production-ready fix :) For backward compatibility, we can e.g. check whether it is a id or uuid and do different things accordingly.

Make sure that what we have does not segfault in a bad way. Which might be happening right now?

Firstly, the .device attribute is incorrect, so code relying on it gets confused.

Secondly, when peer GPUs do not have connection, @Dutch-voyage reported an error when deserializing volcengine/verl#490 (comment), but since I do not have that hardware, I cannot say whether it can be reproduced by this exact script or is caused by something else. (@Dutch-voyage could you please run this script and see whether it errors as well?)

What is the behavior we want for this case (keep id or keep physical device)?

I think physical device is the correct thing, because when sending the tensor, the pointer that we send is pointing to the physical device instead of the id. When user sees a big tensor on cuda:0, it seems users expect "our cuda:0 physical device is occupied by it", "using that tensor on cuda:0 would need no cross-gpu communication while using it on cuda:1 will need", etc.

This one is a bit special because these are two separate processes, that could be launched completely independently

Yes exactly, for a more complicated demo, one may have the main process unaware of pytorch, and launch two subprocesses with different env vars.

@zhaochenyang20
Copy link

@fzyzcjy xuehai pan told me that torch.distributed.rpc can assign device mapping. Maybe it gonna work?

@fzyzcjy
Copy link
Contributor Author
fzyzcjy commented Mar 18, 2025

Not tried but I guess that may work; but it would be great if this bug is fixed as well.

@zhaochenyang20
Copy link

@fzyzcjy I think we've done with multi-nodes by the help of hancheng?

@fzyzcjy
Copy link
Contributor Author
fzyzcjy commented Mar 19, 2025

@zhaochenyang20 Not sure, do you mean volcengine/verl#652?

Anyway that is orthogonal this PyTorch bug

@zhaochenyang20
Copy link

@zhaochenyang20 Not sure, do you mean volcengine/verl#652?

Anyway that is orthogonal this PyTorch bug

great!

@Dutch-voyage
Copy link
Dutch-voyage commented Mar 19, 2025

@fzyzcjy @zhaochenyang20
Hi guys, thanks for working on this issue. To me, the trigger of the mentioned bug in volcengine/verl#490 (comment) is very straight forward. If the current torch.mp returns the device id from local environment, then we would expect: a to-be-updated tensor is serialized in Ray subprocess ActorRefRolloutWorker (specifically in ShardingManager), and deserialized in Sglang subprocess TP_worker (ModelRunner), where as the former has isolated env vars provided by Ray and the latter has global ones (seem to be a temporary implementation really).
As asked, I run the script in with the provided docker container, with monkey patch of torch disabled (in verl_engine.py and model_runner.py)
This what I get:

2025-03-19 08:57:52,816 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 (main_task pid=28236) {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2, (main_task pid=28236) 'entropy_coeff': 0.001, (main_task pid=28236) 'fsdp_config': {'fsdp_size': -1, (main_task pid=28236) 'optimizer_offload': True, (main_task pid=28236) 'param_offload': True, (main_task pid=28236) 'wrap_policy': {'min_num_params': 0}}, (main_task pid=28236) 'grad_clip': 1.0, (main_task pid=28236) 'kl_loss_coef': 0.001, (main_task pid=28236) 'kl_loss_type': 'low_var_kl', (main_task pid=28236) 'optim': {'lr': 1e-06, (main_task pid=28236) 'lr_warmup_steps_ratio': 0.0, (main_task pid=28236) 'min_lr_ratio': None, (main_task pid=28236) 'total_training_steps': -1, (main_task pid=28236) 'warmup_style': 'constant'}, (main_task pid=28236) 'ppo_epochs': 1, (main_task pid=28236) 'ppo_max_token_len_per_gpu': 16384, (main_task pid=28236) 'ppo_micro_batch_size': None, (main_task pid=28236) 'ppo_micro_batch_size_per_gpu': 16, (main_task pid=28236) 'ppo_mini_batch_size': 64, (main_task pid=28236) 'shuffle': False, (main_task pid=28236) 'strategy': 'fsdp', (main_task pid=28236) 'ulysses_sequence_parallel_size': 1, (main_task pid=28236) 'use_dynamic_bsz': False, (main_task pid=28236) 'use_kl_loss': False}, (main_task pid=28 8000 236) 'hybrid_engine': True, (main_task pid=28236) 'model': {'enable_gradient_checkpointing': True, (main_task pid=28236) 'external_lib': None, (main_task pid=28236) 'override_config': {}, (main_task pid=28236) 'path': 'Qwen/Qwen2.5-0.5B-Instruct', (main_task pid=28236) 'use_remove_padding': True}, (main_task pid=28236) 'ref': {'fsdp_config': {'param_offload': True, (main_task pid=28236) 'wrap_policy': {'min_num_params': 0}}, (main_task pid=28236) 'log_prob_max_token_len_per_gpu': 16384, (main_task pid=28236) 'log_prob_micro_batch_size': 16, (main_task pid=28236) 'log_prob_micro_batch_size_per_gpu': None, (main_task pid=28236) 'log_prob_use_dynamic_bsz': False, (main_task pid=28236) 'ulysses_sequence_parallel_size': 1}, (main_task pid=28236) 'rollout': {'disable_custom_all_reduce': True, (main_task pid=28236) 'disable_log_stats': True, (main_task pid=28236) 'do_sample': True, (main_task pid=28236) 'dtype': 'bfloat16', (main_task pid=28236) 'enable_chunked_prefill': True, (main_task pid=28236) 'enforce_eager': True, (main_task pid=28236) 'free_cache_engine': True, (main_task pid=28236) 'gpu_memory_utilization': 0.4, (main_task pid=28236) 'ignore_eos': False, (main_task pid=28236) 'load_format': 'dummy_dtensor', (main_task pid=28236) 'log_prob_max_token_len_per_gpu': 16384, (main_task pid=28236) 'log_prob_micro_batch_size': None, (main_task pid=28236) 'log_prob_micro_batch_size_per_gpu': 16, (main_task pid=28236) 'log_prob_use_dynamic_bsz': False, (main_task pid=28236) 'max_model_len': None, (main_task pid=28236) 'max_num_batched_tokens': 8192, (main_task pid=28236) 'max_num_seqs': 1024, (main_task pid=28236) 'n': 1, (main_task pid=28236) 'name': 'sglang', (main_task pid=28236) 'prompt_length': 512, (main_task pid=28236) 'response_length': 1, (main_task pid=28236) 'sampling_params': {'max_new_tokens': 100}, (main_task pid=28236) 'temperature': 1.0, (main_task pid=28236) 'tensor_model_parallel_size': 2, (main_task pid=28236) 'top_k': -1, (main_task pid=28236) 'top_p': 1, (main_task pid=28236) 'use_fire_sampling': False, (main_task pid=28236) 'val_kwargs': {'do_sample': False, (main_task pid=28236) 'n': 1, (main_task pid=28236) 'temperature': 0, (main_task pid=28236) 'top_k': -1, (main_task pid=28236) 'top_p': 1.0}}}, (main_task pid=28236) 'algorithm': {'adv_estimator': 'gae', (main_task pid=28236) 'gamma': 1.0, (main_task pid=28236) 'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'}, (main_task pid=28236) 'kl_penalty': 'kl', (main_task pid=28236) 'lam': 1.0}, (main_task pid=28236) 'critic': {'cliprange_value': 0.5, (main_task pid=28236) 'forward_max_token_len_per_gpu': 32768, (main_task pid=28236) 'forward_micro_batch_size': 16, (main_task pid=28236) 'forward_micro_batch_size_per_gpu': None, (main_task pid=28236) 'grad_clip': 1.0, (main_task pid=28236) 'model': {'enable_gradient_checkpointing': True, (main_task pid=28236) 'external_lib': None, (main_task pid=28236) 'fsdp_config': {'fsdp_size': -1, (main_task pid=28236) 'optimizer_offload': True, (main_task pid=28236) 'param_offload': True, (main_task pid=28236) 'wrap_policy': {'min_num_params': 0}}, (main_task pid=28236) 'override_config': {}, (main_task pid=28236) 'path': 'Qwen/Qwen2.5-0.5B-Instruct', (main_task pid=28236) 'tokenizer_path': 'Qwen/Qwen2.5-0.5B-Instruct', (main_task pid=28236) 'use_remove_padding': True}, (main_task pid=28236) 'optim': {'lr': 1e-05, (main_task pid=28236) 'lr_warmup_steps_ratio': 0.0, (main_task pid=28236) 'min_lr_ratio': None, (main_task pid=28236) 'total_training_steps': -1, (main_task pid=28236) 'warmup_style': 'constant'}, (main_task pid=28236) 'ppo_epochs': 1, (main_task pid=28236) 'ppo_max_token_len_per_gpu': 32768, (main_task pid=28236) 'ppo_micro_batch_size': 16, (main_task pid=28236) 'ppo_micro_batch_size_per_gpu': None, (main_task pid=28236) 'ppo_mini_batch_size': 64, (main_task pid=28236) 'shuffle': False, (main_task pid=28236) 'strategy': 'fsdp', (main_task pid=28236) 'ulysses_sequence_parallel_size': 1, (main_task pid=28236) 'use_dynamic_bsz': False}, (main_task pid=28236) 'custom_reward_function': {'name': 'compute_score', 'path': None}, (main_task pid=28236) 'data': {'filter_overlong_prompts': False, (main_task pid=28236) 'image_key': 'images', (main_task pid=28236) 'max_prompt_length': 512, (main_task pid=28236) 'max_response_length': 1, (main_task pid=28236) 'prompt_key': 'prompt', (main_task pid=28236) 'return_raw_chat': False, (main_task pid=28236) 'return_raw_input_ids': False, (main_task pid=28236) 'shuffle': True, (main_task pid=28236) 'tokenizer': None, (main_task pid=28236) 'train_batch_size': 64, (main_task pid=28236) 'train_files': '/root/data/gsm8k/train.parquet', (main_task pid=28236) 'truncation': 'error', (main_task pid=28236) 'val_batch_size': 1312, (main_task pid=28236) 'val_files': '/root/data/gsm8k/test.parquet'}, (main_task pid=28236) 'reward_model': {'enable': False, (main_task pid=28236) 'forward_max_token_len_per_gpu': 32768, (main_task pid=28236) 'max_length': None, (main_task pid=28236) 'micro_batch_size': None, (main_task pid=28236) 'micro_batch_size_per_gpu': None, (main_task pid=28236) 'model': {'external_lib': None, (main_task pid=28236) 'fsdp_config': {'fsdp_size': -1, (main_task pid=28236) 'param_offload': False, (main_task pid=28236) 'wrap_policy': {'min_num_params': 0}}, (main_task pid=28236) 'input_tokenizer': 'Qwen/Qwen2.5-0.5B-Instruct', (main_task pid=28236) 'path': '~/models/FsfairX-LLaMA3-RM-v0.1', (main_task pid=28236) 'use_remove_padding': False}, (main_task pid=28236) 'reward_manager': 'naive', (main_task pid=28236) 'strategy': 'fsdp', (main_task pid=28236) 'ulysses_sequence_parallel_size': 1, (main_task pid=28236) 'use_dynamic_bsz': False}, (main_task pid=28236) 'trainer': {'balance_batch': True, (main_task pid=28236) 'critic_warmup': 0, (main_task pid=28236) 'default_hdfs_dir': None, (main_task pid=28236) 'default_local_dir': 'checkpoints/verl_examples/gsm8k', (main_task pid=28236) 'del_local_ckpt_after_load': False, (main_task pid=28236) 'experiment_name': 'gsm8k', (main_task pid=28236) 'logger': ['console'], (main_task pid=28236) 'n_gpus_per_node': 2, (main_task pid=28236) 'nnodes': 1, (main_task pid=28236) 'project_name': 'verl_examples', (main_task pid=28236) 'remove_previous_ckpt_in_save': False, (main_task pid=28236) 'resume_from_path': False, (main_task pid=28236) 'resume_mode': 'auto', (main_task pid=28236) 'save_freq': -1, (main_task pid=28236) 'test_freq': 10, (main_task pid=28236) 'total_epochs': 1, (main_task pid=28236) 'total_training_steps': None, (main_task pid=28236) 'val_before_train': True, (main_task pid=28236) 'val_generations_to_log_to_wandb': 0}} (main_task pid=28236) WARNING: val_batch_size is deprecated. Validation datasets are sent to inference engines as a whole batch, which will schedule the memory themselves. (main_task pid=28236) [validate_config] All configuration checks passed successfully! (main_task pid=28236) dataset len: 7473 (main_task pid=28236) dataset len: 1319 (main_task pid=28236) Size of train dataloader: 116 (main_task pid=28236) Total training steps: 116 (main_task pid=28236) DeprecationWarning: \`ray.state.available_resources_per_node\` is a private attribute and access will be removed in a future Ray version. (WorkerDict pid=28537) Critic overriding config {'bos_token_id': None, 'eos_token_id': 151645, 'pad_token_id': 151643} (WorkerDict pid=28537) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForTokenClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the \`with torch.autocast(device_type='torch_device'):\` decorator, or load the model with the \`torch_dtype\` argument. Example: \`model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)\` (WorkerDict pid=28537) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with \`model.to('cuda')\`. (WorkerDict pid=28537) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.bias', 'score.weight'] (WorkerDict pid=28537) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=28537) Qwen2ForTokenClassification contains 494.03M parameters (WorkerDict pid=28537) Before critic FSDP, memory allocated (GB): 0.0, memory reserved (GB): 0.0 (WorkerDict pid=28537) NCCL version 2.21.5+cuda12.4 (WorkerDict pid=28537) After critic FSDP, memory allocated (GB): 0.9205479621887207, memory reserved (GB): 2.103515625 (WorkerDict pid=28537) Total steps: 116, num_warmup_steps: 0 (WorkerDict pid=28537) Critic use_remove_padding=True (WorkerDict pid=28537) Model config after override: Qwen2Config { (WorkerDict pid=28537) "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", (WorkerDict pid=28537) "architectures": [ (WorkerDict pid=28537) "Qwen2ForCausalLM" (WorkerDict pid=28537) ], (WorkerDict pid=28537) "attention_dropout": 0.0, (WorkerDict pid=28537) "eos_token_id": 151645, (WorkerDict pid=28537) "hidden_act": "silu", (WorkerDict pid=28537) "hidden_size": 896, (WorkerDict pid=28537) "initializer_range": 0.02, (WorkerDict pid=28537) "intermediate_size": 4864, (WorkerDict pid=28537) "max_position_embeddings": 32768, (WorkerDict pid=28537) "max_window_layers": 21, (WorkerDict pid=28537) "model_type": "qwen2", (WorkerDict pid=28537) "num_attention_heads": 14, (WorkerDict pid=28537) "num_hidden_layers": 24, (WorkerDict pid=28537) "num_key_value_heads": 2, (WorkerDict pid=28537) "pad_token_id": 151643, (WorkerDict pid=28537) "rms_norm_eps": 1e-06, (WorkerDict pid=28537) "rope_scaling": null, (WorkerDict pid=28537) "rope_theta": 1000000.0, (WorkerDict pid=28537) "sliding_window": null, (WorkerDict pid=28537) "tie_word_embeddings": true, (WorkerDict pid=28537) "torch_dtype": "bfloat16", (WorkerDict pid=28537) "transformers_version": "4.48.3", (WorkerDict pid=28537) "use_cache": true, (WorkerDict pid=28537) "use_sliding_window": false, (WorkerDict pid=28537) "vocab_size": 151936 (WorkerDict pid=28537) } (WorkerDict pid=28537) (WorkerDict pid=28537) Qwen2ForCausalLM contains 494.03M parameters (WorkerDict pid=28537) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})]) (WorkerDict pid=28537) Actor use_remove_padding=True (WorkerDict pid=28537) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the \`with torch.autocast(device_type='torch_device'):\` decorator, or load the model with the \`torch_dtype\` argument. Example: \`model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)\` [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (WorkerDict pid=30331) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with \`model.to('cuda')\`. (WorkerDict pid=30331) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.bias', 'score.weight'] (WorkerDict pid=30331) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. (WorkerDict pid=28537) Model config after override: Qwen2Config { (WorkerDict pid=28537) "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct", (WorkerDict pid=28537) "architectures": [ (WorkerDict pid=28537) "Qwen2ForCausalLM" (WorkerDict pid=28537) ], (WorkerDict pid=28537) "attention_dropout": 0.0, (WorkerDict pid=28537) "eos_token_id": 151645, (WorkerDict pid=28537) "hidden_act": "silu", (WorkerDict pid=28537) "hidden_size": 896, (WorkerDict pid=28537) "initializer_range": 0.02, (WorkerDict pid=28537) "intermediate_size": 4864, (WorkerDict pid=28537) "max_position_embeddings": 32768, (WorkerDict pid=28537) "max_window_layers": 21, (WorkerDict pid=28537) "model_type": "qwen2", (WorkerDict pid=28537) "num_attention_heads": 14, (WorkerDict pid=28537) "num_hidden_layers": 24, (WorkerDict pid=28537) "num_key_value_heads": 2, (WorkerDict pid=28537) "pad_token_id": 151643, (WorkerDict pid=28537) "rms_norm_eps": 1e-06, (WorkerDict pid=28537) "rope_scaling": null, (WorkerDict pid=28537) "rope_theta": 1000000.0, (WorkerDict pid=28537) "sliding_window": null, (WorkerDict pid=28537) "tie_word_embeddings": true, (WorkerDict pid=28537) "torch_dtype": "bfloat16", (WorkerDict pid=28537) "transformers_version": "4.48.3", (WorkerDict pid=28537) "use_cache": true, (WorkerDict pid=28537) "use_sliding_window": false, (WorkerDict pid=28537) "vocab_size": 151936 (WorkerDict pid=28537) } (WorkerDict pid=28537) (WorkerDict pid=30331) Total steps: 116, num_warmup_steps: 0 (WorkerDict pid=30331) Critic use_remove_padding=True (WorkerDict pid=28537) Qwen2ForCausalLM contains 494.03M parameters (WorkerDict pid=28537) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})]) [repeated 2x across cluster] (WorkerDict pid=30331) Actor use_remove_padding=True (WorkerDict pid=28537) Total steps: 116, num_warmup_steps: 0 (WorkerDict pid=28537) Actor use_remove_padding=True (WorkerDict pid=28537) Before building sglang rollout, memory allocated (GB): 0.9205489158630371, memory reserved (GB): 2.107421875 (WorkerDict pid=30331) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})]) (WorkerDict pid=30331) Total steps: 116, num_warmup_steps: 0 (WorkerDict pid=30331) Actor use_remove_padding=True (WorkerDict pid=28537) NCCL version 2.21.5+cuda12.4 Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00 (WorkerDict pid=28537) (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank)) (WorkerDict pid=28537) File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1040, in _unwrap_tensor (WorkerDict pid=28537) tensor = tensor.get(tp_rank) (WorkerDict pid=28537) File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1052, in get (WorkerDict pid=28537) return MultiprocessingSerializer.deserialize(self.values[rank]) (WorkerDict pid=28537) File "/root/sglang/python/sglang/srt/utils.py", line 1386, in deserialize (WorkerDict pid=28537) return ForkingPickler.loads(data) (WorkerDict pid=28537) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor (WorkerDict pid=28537) storage = storage_cls._new_shared_cuda( (WorkerDict pid=28537) File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 1434, in _new_shared_cuda (WorkerDict pid=28537) return torch.UntypedStorage._new_shared_cuda(*args, **kwargs) (WorkerDict pid=28537) RuntimeError: CUDA error: peer access is not supported between these two devices (WorkerDict pid=28537) Compile with \`TORCH_USE_CUDA_DSA\` to enable device-side assertions. (WorkerDict pid=28537) (WorkerDict pid=28537) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=28537) C++ CapturedTraceback: (WorkerDict pid=28537) #4 std::_Function_handler const> (), c10::SetStackTraceFetcher(std::function)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 (WorkerDict pid=28537) #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0 (WorkerDict pid=28537) #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) from ??:0 (WorkerDict pid=28537) #7 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) from ??:0 (WorkerDict pid=28537) #8 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::getIpcDevPtr(std::string) from :0 (WorkerDict pid=28537) #9 THPStorage_newSharedCuda(_object*, _object*) from StorageSharing.cpp:0 (WorkerDict pid=28537) #10 PyObject_CallFunctionObjArgs from ??:0 (WorkerDict pid=28537) #11 PyObject_Call from ??:0 (WorkerDict pid=28537) #12 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #13 PyMethod_New from ??:0 (WorkerDict pid=28537) #14 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #15 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #16 _Py_VaBuildValue_SizeT from ??:0 (WorkerDict pid=28537) #17 PyUnicodeDecodeError_SetReason from ??:0 (WorkerDict pid=28537) #18 _PyDict_NewPresized from ??:0 (WorkerDict pid=28537) #19 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #20 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #21 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #22 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #23 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #24 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #25 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #26 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #27 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #28 PyMethod_New from ??:0 (WorkerDict pid=28537) #29 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #30 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #31 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #32 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #33 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #34 PyMethod_New from ??:0 (WorkerDict pid=28537) #35 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #36 _PyObject_FastCallDictTstate from ??:0 (WorkerDict pid=28537) #37 _PyObject_Call_Prepend from ??:0 (WorkerDict pid=28537) #38 PyInit__datetime from ??:0 (WorkerDict pid=28537) #39 _PyObject_MakeTpCall from ??:0 (WorkerDict pid=28537) #40 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #41 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #42 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #43 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #44 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #45 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #46 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #47 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #48 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #49 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #50 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #51 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #52 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #53 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #54 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #55 _PyFunction_Vectorcall from ??:0 (WorkerDict pid=28537) #56 _PyEval_EvalFrameDefault from ??:0 (WorkerDict pid=28537) #57 PyEval_EvalCode from ??:0 (WorkerDict pid=28537) #58 PyEval_EvalCode from ??:0 (WorkerDict pid=28537) #59 PyUnicode_Tailmatch from ??:0 (WorkerDict pid=28537) #60 PyInit__collections from ??:0 (WorkerDict pid=28537) #61 PyRun_StringFlags from ??:0 (WorkerDict pid=28537) #62 PyRun_SimpleStringFlags from ??:0 (WorkerDict pid=28537) #63 Py_RunMain from ??:0 (WorkerDict pid=28537) #64 Py_BytesMain from ??:0 (WorkerDict pid=28537) #65 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58 (WorkerDict pid=28537) #66 __libc_start_main_impl from ./csu/../csu/libc-start.c:392 (WorkerDict pid=28537) #67 _start from ??:0 (WorkerDict pid=28537) (WorkerDict pid=28537) (WorkerDict pid=28537) [2025-03-19 08:59:19] Received sigquit from a child process. It usually means the child failed. Traceback (most recent call last): File "/root/verl/verl/trainer/main_ppo.py", line 54, in main run_ppo(config) File "/root/verl/verl/trainer/main_ppo.py", line 70, in run_ppo ray.get(main_task.remote(config)) File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2771, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 919, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=28236, ip=172.17.0.2) File "/root/verl/verl/trainer/main_ppo.py", line 167, in main_task trainer.fit() File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 909, in fit val_metrics = self._validate() File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 676, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) File "/root/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. Worker exits with an exit code 0.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
(WorkerDict pid=30331) kwargs: {'n': 1, 'max_new_tokens': 1, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
(WorkerDict pid=30331) after resume, sleep for 5 second to check nvidia-smi...

As for the script in this issue, I get: [first] os.environ.get("CUDA_VISIBLE_DEVICES")=None [second] os.environ.get("CUDA_VISIBLE_DEVICES")='1,2' [first] queue.put value_to_queue=tensor([1., 2.], device='cuda:1') value_to_queue.device=device(type='cuda', index=1) Process Process-1: Traceback (most recent call last): File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/yyx/my-Logic/test.py", line 7, in _run_second_process value_from_queue = queue.get() File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor storage = storage_cls._new_shared_cuda( File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/site-packages/torch/storage.py", line 1434, in _new_shared_cuda return torch.UntypedStorage._new_shared_cuda(*args, **kwargs) RuntimeError: CUDA error: peer access is not supported between these two devices CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[W319 19:34:23.470562418 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

So I'm pretty sure the bug volcengine/verl#490 (comment) is related to this issue.

To me, if torch isn't designed to provide global device id (or perhaps exposing another interface to support this feature would be more reasonable), then verl can work on its side. A quick idea is to have sglang run with local environments. If this affects part of sglang feature, then we should re-design ActorRollout worker to separately manage their sharding and weight syncs. e.g. one global rollout worker and multiple actor workers, along with a global sharding manager. I guess the current framework already allows for such design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: multiprocessing Related to torch.multiprocessing triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0