(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

fzyzcjy · 2025-03-14T14:36:24Z

EDIT: PR to fix this

PR is here: #149248

🐛 Describe the bug

Hi thanks for the helpful library! When two processes have different CUDA_VISIBLE_DEVICES and pass around tensor between them, it seems the .device attribute is incorrect.

Example code:

import os


def _run_second_process(queue):
    print(f'[second] {os.environ.get("CUDA_VISIBLE_DEVICES")=}')
    value_from_queue = queue.get()
    print(f'[second] queue.get {value_from_queue=} {value_from_queue.device=}')


def _run_main_process():
    import torch
    print(f'[first] {os.environ.get("CUDA_VISIBLE_DEVICES")=}')
    queue = torch.multiprocessing.Queue()

    os.environ['CUDA_VISIBLE_DEVICES'] = '1,2'
    p = torch.multiprocessing.Process(
        target=_run_second_process,
        kwargs=dict(queue=queue),
    )
    p.start()
    del os.environ['CUDA_VISIBLE_DEVICES']

    value_to_queue = torch.tensor([1.0, 2.0], device='cuda:1')
    print(f'[first] queue.put {value_to_queue=} {value_to_queue.device=}')
    queue.put(value_to_queue)

    p.join()

if __name__ == '__main__':
    _run_main_process()

Output:

[first] os.environ.get("CUDA_VISIBLE_DEVICES")=None
[second] os.environ.get("CUDA_VISIBLE_DEVICES")='1,2'
[first] queue.put value_to_queue=tensor([1., 2.], device='cuda:1') value_to_queue.device=device(type='cuda', index=1)
[second] queue.get value_from_queue=tensor([1., 2.], device='cuda:1') value_from_queue.device=device(type='cuda', index=1)

It seems cuda:0 in the second process should mean cuda:1 in the first process, thus the second process wrongly recognize the tensor as cuda:1.

This seems to be related to issues like github.com/volcengine/verl/pull/ 490#issuecomment-2720212225.

If I manage to find some spare time, I am happy to PR for this.

Versions

Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39

Python version: 3.10.16 (main, Dec 4 2024, 08:53:38) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-1017-aws-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.8.61
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version: 550.127.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.7.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 1
BogoMIPS: 5299.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 3 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 48 MiB (96 instances)
L3 cache: 384 MiB (12 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] flashinfer-python==0.2.3+cu124torch2.5
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.14.1
[pip3] torch==2.5.1
[pip3] torch_memory_saver==0.0.2
[pip3] torchao==0.9.0
[pip3] torchaudio==2.5.1
[pip3] torchdata==0.11.0
[pip3] torchvision==0.20.1
[pip3] triton==3.1.0
[conda] Could not collect

cc @VitalyFedyunin @albanD @ptrblck @msaroufim @eqy

The text was updated successfully, but these errors were encountered:

albanD · 2025-03-17T16:07:11Z

That would be very BC-breaking...
And exactely the same behavior happens for serialization in general.

I think there are two things here:

Make sure that what we have does not segfault in a bad way. Which might be happening right now?
What is the behavior we want for this case (keep id or keep physical device)?

eqy · 2025-03-17T16:57:38Z

In general I'm not sure we are expecting to handle

import torch
os.environ['CUDA_VISIBLE_DEVICES'] = ...

patterns as we often assume env vars cannot be changed after initialization.

albanD · 2025-03-17T17:40:32Z

Changing environ after cuda initialization is a noop for sure.
This one is a bit special because these are two separate processes, that could be launched completely independently, so it is perfectly valid for them to have different env variables.

fzyzcjy · 2025-03-17T23:58:37Z

That would be very BC-breaking...

This is a prototype hack to demonstrate it roughly fixes instead of a production-ready fix :) For backward compatibility, we can e.g. check whether it is a id or uuid and do different things accordingly.

Make sure that what we have does not segfault in a bad way. Which might be happening right now?

Firstly, the .device attribute is incorrect, so code relying on it gets confused.

Secondly, when peer GPUs do not have connection, @Dutch-voyage reported an error when deserializing volcengine/verl#490 (comment), but since I do not have that hardware, I cannot say whether it can be reproduced by this exact script or is caused by something else. (@Dutch-voyage could you please run this script and see whether it errors as well?)

What is the behavior we want for this case (keep id or keep physical device)?

I think physical device is the correct thing, because when sending the tensor, the pointer that we send is pointing to the physical device instead of the id. When user sees a big tensor on cuda:0, it seems users expect "our cuda:0 physical device is occupied by it", "using that tensor on cuda:0 would need no cross-gpu communication while using it on cuda:1 will need", etc.

This one is a bit special because these are two separate processes, that could be launched completely independently

Yes exactly, for a more complicated demo, one may have the main process unaware of pytorch, and launch two subprocesses with different env vars.

zhaochenyang20 · 2025-03-18T02:18:56Z

@fzyzcjy xuehai pan told me that torch.distributed.rpc can assign device mapping. Maybe it gonna work?

fzyzcjy · 2025-03-18T14:41:26Z

Not tried but I guess that may work; but it would be great if this bug is fixed as well.

zhaochenyang20 · 2025-03-18T18:21:06Z

@fzyzcjy I think we've done with multi-nodes by the help of hancheng?

fzyzcjy · 2025-03-19T06:31:49Z

@zhaochenyang20 Not sure, do you mean volcengine/verl#652?

Anyway that is orthogonal this PyTorch bug

zhaochenyang20 · 2025-03-19T06:50:42Z

@zhaochenyang20 Not sure, do you mean volcengine/verl#652?

Anyway that is orthogonal this PyTorch bug

great！

Dutch-voyage · 2025-03-19T09:30:23Z

@fzyzcjy @zhaochenyang20
Hi guys, thanks for working on this issue. To me, the trigger of the mentioned bug in volcengine/verl#490 (comment) is very straight forward. If the current torch.mp returns the device id from local environment, then we would expect: a to-be-updated tensor is serialized in Ray subprocess ActorRefRolloutWorker (specifically in ShardingManager), and deserialized in Sglang subprocess TP_worker (ModelRunner), where as the former has isolated env vars provided by Ray and the latter has global ones (seem to be a temporary implementation really).
As asked, I run the script in with the provided docker container, with monkey patch of torch disabled (in verl_engine.py and model_runner.py)
This what I get:


2025-03-19 08:57:52,816 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(main_task pid=28236) {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
(main_task pid=28236)                                  'entropy_coeff': 0.001,
(main_task pid=28236)                                  'fsdp_config': {'fsdp_size': -1,
(main_task pid=28236)                                                  'optimizer_offload': True,
(main_task pid=28236)                                                  'param_offload': True,
(main_task pid=28236)                                                  'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236)                                  'grad_clip': 1.0,
(main_task pid=28236)                                  'kl_loss_coef': 0.001,
(main_task pid=28236)                                  'kl_loss_type': 'low_var_kl',
(main_task pid=28236)                                  'optim': {'lr': 1e-06,
(main_task pid=28236)                                            'lr_warmup_steps_ratio': 0.0,
(main_task pid=28236)                                            'min_lr_ratio': None,
(main_task pid=28236)                                            'total_training_steps': -1,
(main_task pid=28236)                                            'warmup_style': 'constant'},
(main_task pid=28236)                                  'ppo_epochs': 1,
(main_task pid=28236)                                  'ppo_max_token_len_per_gpu': 16384,
(main_task pid=28236)                                  'ppo_micro_batch_size': None,
(main_task pid=28236)                                  'ppo_micro_batch_size_per_gpu': 16,
(main_task pid=28236)                                  'ppo_mini_batch_size': 64,
(main_task pid=28236)                                  'shuffle': False,
(main_task pid=28236)                                  'strategy': 'fsdp',
(main_task pid=28236)                                  'ulysses_sequence_parallel_size': 1,
(main_task pid=28236)                                  'use_dynamic_bsz': False,
(main_task pid=28236)                                  'use_kl_loss': False},
(main_task pid=28
8000
236)                        'hybrid_engine': True,
(main_task pid=28236)                        'model': {'enable_gradient_checkpointing': True,
(main_task pid=28236)                                  'external_lib': None,
(main_task pid=28236)                                  'override_config': {},
(main_task pid=28236)                                  'path': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236)                                  'use_remove_padding': True},
(main_task pid=28236)                        'ref': {'fsdp_config': {'param_offload': True,
(main_task pid=28236)                                                'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236)                                'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=28236)                                'log_prob_micro_batch_size': 16,
(main_task pid=28236)                                'log_prob_micro_batch_size_per_gpu': None,
(main_task pid=28236)                                'log_prob_use_dynamic_bsz': False,
(main_task pid=28236)                                'ulysses_sequence_parallel_size': 1},
(main_task pid=28236)                        'rollout': {'disable_custom_all_reduce': True,
(main_task pid=28236)                                    'disable_log_stats': True,
(main_task pid=28236)                                    'do_sample': True,
(main_task pid=28236)                                    'dtype': 'bfloat16',
(main_task pid=28236)                                    'enable_chunked_prefill': True,
(main_task pid=28236)                                    'enforce_eager': True,
(main_task pid=28236)                                    'free_cache_engine': True,
(main_task pid=28236)                                    'gpu_memory_utilization': 0.4,
(main_task pid=28236)                                    'ignore_eos': False,
(main_task pid=28236)                                    'load_format': 'dummy_dtensor',
(main_task pid=28236)                                    'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=28236)                                    'log_prob_micro_batch_size': None,
(main_task pid=28236)                                    'log_prob_micro_batch_size_per_gpu': 16,
(main_task pid=28236)                                    'log_prob_use_dynamic_bsz': False,
(main_task pid=28236)                                    'max_model_len': None,
(main_task pid=28236)                                    'max_num_batched_tokens': 8192,
(main_task pid=28236)                                    'max_num_seqs': 1024,
(main_task pid=28236)                                    'n': 1,
(main_task pid=28236)                                    'name': 'sglang',
(main_task pid=28236)                                    'prompt_length': 512,
(main_task pid=28236)                                    'response_length': 1,
(main_task pid=28236)                                    'sampling_params': {'max_new_tokens': 100},
(main_task pid=28236)                                    'temperature': 1.0,
(main_task pid=28236)                                    'tensor_model_parallel_size': 2,
(main_task pid=28236)                                    'top_k': -1,
(main_task pid=28236)                                    'top_p': 1,
(main_task pid=28236)                                    'use_fire_sampling': False,
(main_task pid=28236)                                    'val_kwargs': {'do_sample': False,
(main_task pid=28236)                                                   'n': 1,
(main_task pid=28236)                                                   'temperature': 0,
(main_task pid=28236)                                                   'top_k': -1,
(main_task pid=28236)                                                   'top_p': 1.0}}},
(main_task pid=28236)  'algorithm': {'adv_estimator': 'gae',
(main_task pid=28236)                'gamma': 1.0,
(main_task pid=28236)                'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
(main_task pid=28236)                'kl_penalty': 'kl',
(main_task pid=28236)                'lam': 1.0},
(main_task pid=28236)  'critic': {'cliprange_value': 0.5,
(main_task pid=28236)             'forward_max_token_len_per_gpu': 32768,
(main_task pid=28236)             'forward_micro_batch_size': 16,
(main_task pid=28236)             'forward_micro_batch_size_per_gpu': None,
(main_task pid=28236)             'grad_clip': 1.0,
(main_task pid=28236)             'model': {'enable_gradient_checkpointing': True,
(main_task pid=28236)                       'external_lib': None,
(main_task pid=28236)                       'fsdp_config': {'fsdp_size': -1,
(main_task pid=28236)                                       'optimizer_offload': True,
(main_task pid=28236)                                       'param_offload': True,
(main_task pid=28236)                                       'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236)                       'override_config': {},
(main_task pid=28236)                       'path': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236)                       'tokenizer_path': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236)                       'use_remove_padding': True},
(main_task pid=28236)             'optim': {'lr': 1e-05,
(main_task pid=28236)                       'lr_warmup_steps_ratio': 0.0,
(main_task pid=28236)                       'min_lr_ratio': None,
(main_task pid=28236)                       'total_training_steps': -1,
(main_task pid=28236)                       'warmup_style': 'constant'},
(main_task pid=28236)             'ppo_epochs': 1,
(main_task pid=28236)             'ppo_max_token_len_per_gpu': 32768,
(main_task pid=28236)             'ppo_micro_batch_size': 16,
(main_task pid=28236)             'ppo_micro_batch_size_per_gpu': None,
(main_task pid=28236)             'ppo_mini_batch_size': 64,
(main_task pid=28236)             'shuffle': False,
(main_task pid=28236)             'strategy': 'fsdp',
(main_task pid=28236)             'ulysses_sequence_parallel_size': 1,
(main_task pid=28236)             'use_dynamic_bsz': False},
(main_task pid=28236)  'custom_reward_function': {'name': 'compute_score', 'path': None},
(main_task pid=28236)  'data': {'filter_overlong_prompts': False,
(main_task pid=28236)           'image_key': 'images',
(main_task pid=28236)           'max_prompt_length': 512,
(main_task pid=28236)           'max_response_length': 1,
(main_task pid=28236)           'prompt_key': 'prompt',
(main_task pid=28236)           'return_raw_chat': False,
(main_task pid=28236)           'return_raw_input_ids': False,
(main_task pid=28236)           'shuffle': True,
(main_task pid=28236)           'tokenizer': None,
(main_task pid=28236)           'train_batch_size': 64,
(main_task pid=28236)           'train_files': '/root/data/gsm8k/train.parquet',
(main_task pid=28236)           'truncation': 'error',
(main_task pid=28236)           'val_batch_size': 1312,
(main_task pid=28236)           'val_files': '/root/data/gsm8k/test.parquet'},
(main_task pid=28236)  'reward_model': {'enable': False,
(main_task pid=28236)                   'forward_max_token_len_per_gpu': 32768,
(main_task pid=28236)                   'max_length': None,
(main_task pid=28236)                   'micro_batch_size': None,
(main_task pid=28236)                   'micro_batch_size_per_gpu': None,
(main_task pid=28236)                   'model': {'external_lib': None,
(main_task pid=28236)                             'fsdp_config': {'fsdp_size': -1,
(main_task pid=28236)                                             'param_offload': False,
(main_task pid=28236)                                             'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236)                             'input_tokenizer': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236)                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
(main_task pid=28236)                             'use_remove_padding': False},
(main_task pid=28236)                   'reward_manager': 'naive',
(main_task pid=28236)                   'strategy': 'fsdp',
(main_task pid=28236)                   'ulysses_sequence_parallel_size': 1,
(main_task pid=28236)                   'use_dynamic_bsz': False},
(main_task pid=28236)  'trainer': {'balance_batch': True,
(main_task pid=28236)              'critic_warmup': 0,
(main_task pid=28236)              'default_hdfs_dir': None,
(main_task pid=28236)              'default_local_dir': 'checkpoints/verl_examples/gsm8k',
(main_task pid=28236)              'del_local_ckpt_after_load': False,
(main_task pid=28236)              'experiment_name': 'gsm8k',
(main_task pid=28236)              'logger': ['console'],
(main_task pid=28236)              'n_gpus_per_node': 2,
(main_task pid=28236)              'nnodes': 1,
(main_task pid=28236)              'project_name': 'verl_examples',
(main_task pid=28236)              'remove_previous_ckpt_in_save': False,
(main_task pid=28236)              'resume_from_path': False,
(main_task pid=28236)              'resume_mode': 'auto',
(main_task pid=28236)              'save_freq': -1,
(main_task pid=28236)              'test_freq': 10,
(main_task pid=28236)              'total_epochs': 1,
(main_task pid=28236)              'total_training_steps': None,
(main_task pid=28236)              'val_before_train': True,
(main_task pid=28236)              'val_generations_to_log_to_wandb': 0}}
(main_task pid=28236) WARNING: val_batch_size is deprecated. Validation datasets are sent to inference engines as a whole batch, which will schedule the memory themselves.
(main_task pid=28236) [validate_config] All configuration checks passed successfully!
(main_task pid=28236) dataset len: 7473
(main_task pid=28236) dataset len: 1319
(main_task pid=28236) Size of train dataloader: 116
(main_task pid=28236) Total training steps: 116
(main_task pid=28236) DeprecationWarning: \`ray.state.available_resources_per_node\` is a private attribute and access will be removed in a future Ray version.
(WorkerDict pid=28537) Critic overriding config {'bos_token_id': None, 'eos_token_id': 151645, 'pad_token_id': 151643}
(WorkerDict pid=28537) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForTokenClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the \`with torch.autocast(device_type='torch_device'):\` decorator, or load the model with the \`torch_dtype\` argument. Example: \`model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)\`
(WorkerDict pid=28537) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with \`model.to('cuda')\`.
(WorkerDict pid=28537) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.bias', 'score.weight']
(WorkerDict pid=28537) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(WorkerDict pid=28537) Qwen2ForTokenClassification contains 494.03M parameters
(WorkerDict pid=28537) Before critic FSDP, memory allocated (GB): 0.0, memory reserved (GB): 0.0
(WorkerDict pid=28537) NCCL version 2.21.5+cuda12.4
(WorkerDict pid=28537) After critic FSDP, memory allocated (GB): 0.9205479621887207, memory reserved (GB): 2.103515625
(WorkerDict pid=28537) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=28537) Critic use_remove_padding=True
(WorkerDict pid=28537) Model config after override: Qwen2Config {
(WorkerDict pid=28537)   "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
(WorkerDict pid=28537)   "architectures": [
(WorkerDict pid=28537)     "Qwen2ForCausalLM"
(WorkerDict pid=28537)   ],
(WorkerDict pid=28537)   "attention_dropout": 0.0,
(WorkerDict pid=28537)   "eos_token_id": 151645,
(WorkerDict pid=28537)   "hidden_act": "silu",
(WorkerDict pid=28537)   "hidden_size": 896,
(WorkerDict pid=28537)   "initializer_range": 0.02,
(WorkerDict pid=28537)   "intermediate_size": 4864,
(WorkerDict pid=28537)   "max_position_embeddings": 32768,
(WorkerDict pid=28537)   "max_window_layers": 21,
(WorkerDict pid=28537)   "model_type": "qwen2",
(WorkerDict pid=28537)   "num_attention_heads": 14,
(WorkerDict pid=28537)   "num_hidden_layers": 24,
(WorkerDict pid=28537)   "num_key_value_heads": 2,
(WorkerDict pid=28537)   "pad_token_id": 151643,
(WorkerDict pid=28537)   "rms_norm_eps": 1e-06,
(WorkerDict pid=28537)   "rope_scaling": null,
(WorkerDict pid=28537)   "rope_theta": 1000000.0,
(WorkerDict pid=28537)   "sliding_window": null,
(WorkerDict pid=28537)   "tie_word_embeddings": true,
(WorkerDict pid=28537)   "torch_dtype": "bfloat16",
(WorkerDict pid=28537)   "transformers_version": "4.48.3",
(WorkerDict pid=28537)   "use_cache": true,
(WorkerDict pid=28537)   "use_sliding_window": false,
(WorkerDict pid=28537)   "vocab_size": 151936
(WorkerDict pid=28537) }
(WorkerDict pid=28537) 
(WorkerDict pid=28537) Qwen2ForCausalLM contains 494.03M parameters
(WorkerDict pid=28537) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})])
(WorkerDict pid=28537) Actor use_remove_padding=True
(WorkerDict pid=28537) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the \`with torch.autocast(device_type='torch_device'):\` decorator, or load the model with the \`torch_dtype\` argument. Example: \`model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)\` [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WorkerDict pid=30331) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with \`model.to('cuda')\`.
(WorkerDict pid=30331) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.bias', 'score.weight']
(WorkerDict pid=30331) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(WorkerDict pid=28537) Model config after override: Qwen2Config {
(WorkerDict pid=28537)   "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
(WorkerDict pid=28537)   "architectures": [
(WorkerDict pid=28537)     "Qwen2ForCausalLM"
(WorkerDict pid=28537)   ],
(WorkerDict pid=28537)   "attention_dropout": 0.0,
(WorkerDict pid=28537)   "eos_token_id": 151645,
(WorkerDict pid=28537)   "hidden_act": "silu",
(WorkerDict pid=28537)   "hidden_size": 896,
(WorkerDict pid=28537)   "initializer_range": 0.02,
(WorkerDict pid=28537)   "intermediate_size": 4864,
(WorkerDict pid=28537)   "max_position_embeddings": 32768,
(WorkerDict pid=28537)   "max_window_layers": 21,
(WorkerDict pid=28537)   "model_type": "qwen2",
(WorkerDict pid=28537)   "num_attention_heads": 14,
(WorkerDict pid=28537)   "num_hidden_layers": 24,
(WorkerDict pid=28537)   "num_key_value_heads": 2,
(WorkerDict pid=28537)   "pad_token_id": 151643,
(WorkerDict pid=28537)   "rms_norm_eps": 1e-06,
(WorkerDict pid=28537)   "rope_scaling": null,
(WorkerDict pid=28537)   "rope_theta": 1000000.0,
(WorkerDict pid=28537)   "sliding_window": null,
(WorkerDict pid=28537)   "tie_word_embeddings": true,
(WorkerDict pid=28537)   "torch_dtype": "bfloat16",
(WorkerDict pid=28537)   "transformers_version": "4.48.3",
(WorkerDict pid=28537)   "use_cache": true,
(WorkerDict pid=28537)   "use_sliding_window": false,
(WorkerDict pid=28537)   "vocab_size": 151936
(WorkerDict pid=28537) }
(WorkerDict pid=28537) 
(WorkerDict pid=30331) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=30331) Critic use_remove_padding=True
(WorkerDict pid=28537) Qwen2ForCausalLM contains 494.03M parameters
(WorkerDict pid=28537) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})]) [repeated 2x across cluster]
(WorkerDict pid=30331) Actor use_remove_padding=True
(WorkerDict pid=28537) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=28537) Actor use_remove_padding=True
(WorkerDict pid=28537) Before building sglang rollout, memory allocated (GB): 0.9205489158630371, memory reserved (GB): 2.107421875
(WorkerDict pid=30331) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})])
(WorkerDict pid=30331) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=30331) Actor use_remove_padding=True
(WorkerDict pid=28537) NCCL version 2.21.5+cuda12.4
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00
(WorkerDict pid=28537)     (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
(WorkerDict pid=28537)   File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1040, in _unwrap_tensor
(WorkerDict pid=28537)     tensor = tensor.get(tp_rank)
(WorkerDict pid=28537)   File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1052, in get
(WorkerDict pid=28537)     return MultiprocessingSerializer.deserialize(self.values[rank])
(WorkerDict pid=28537)   File "/root/sglang/python/sglang/srt/utils.py", line 1386, in deserialize
(WorkerDict pid=28537)     return ForkingPickler.loads(data)
(WorkerDict pid=28537)   File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor
(WorkerDict pid=28537)     storage = storage_cls._new_shared_cuda(
(WorkerDict pid=28537)   File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 1434, in _new_shared_cuda
(WorkerDict pid=28537)     return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
(WorkerDict pid=28537) RuntimeError: CUDA error: peer access is not supported between these two devices
(WorkerDict pid=28537) Compile with \`TORCH_USE_CUDA_DSA\` to enable device-side assertions.
(WorkerDict pid=28537) 
(WorkerDict pid=28537) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=28537) C++ CapturedTraceback:
(WorkerDict pid=28537) #4 std::_Function_handler const> (), c10::SetStackTraceFetcher(std::function)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
(WorkerDict pid=28537) #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
(WorkerDict pid=28537) #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) from ??:0
(WorkerDict pid=28537) #7 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) from ??:0
(WorkerDict pid=28537) #8 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::getIpcDevPtr(std::string) from :0
(WorkerDict pid=28537) #9 THPStorage_newSharedCuda(_object*, _object*) from StorageSharing.cpp:0
(WorkerDict pid=28537) #10 PyObject_CallFunctionObjArgs from ??:0
(WorkerDict pid=28537) #11 PyObject_Call from ??:0
(WorkerDict pid=28537) #12 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #13 PyMethod_New from ??:0
(WorkerDict pid=28537) #14 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #15 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #16 _Py_VaBuildValue_SizeT from ??:0
(WorkerDict pid=28537) #17 PyUnicodeDecodeError_SetReason from ??:0
(WorkerDict pid=28537) #18 _PyDict_NewPresized from ??:0
(WorkerDict pid=28537) #19 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #20 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #21 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #22 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #23 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #24 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #25 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #26 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #27 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #28 PyMethod_New from ??:0
(WorkerDict pid=28537) #29 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #30 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #31 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #32 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #33 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #34 PyMethod_New from ??:0
(WorkerDict pid=28537) #35 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #36 _PyObject_FastCallDictTstate from ??:0
(WorkerDict pid=28537) #37 _PyObject_Call_Prepend from ??:0
(WorkerDict pid=28537) #38 PyInit__datetime from ??:0
(WorkerDict pid=28537) #39 _PyObject_MakeTpCall from ??:0
(WorkerDict pid=28537) #40 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #41 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #42 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #43 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #44 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #45 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #46 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #47 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #48 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #49 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #50 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #51 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #52 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #53 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #54 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #55 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #56 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #57 PyEval_EvalCode from ??:0
(WorkerDict pid=28537) #58 PyEval_EvalCode from ??:0
(WorkerDict pid=28537) #59 PyUnicode_Tailmatch from ??:0
(WorkerDict pid=28537) #60 PyInit__collections from ??:0
(WorkerDict pid=28537) #61 PyRun_StringFlags from ??:0
(WorkerDict pid=28537) #62 PyRun_SimpleStringFlags from ??:0
(WorkerDict pid=28537) #63 Py_RunMain from ??:0
(WorkerDict pid=28537) #64 Py_BytesMain from ??:0
(WorkerDict pid=28537) #65 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
(WorkerDict pid=28537) #66 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
(WorkerDict pid=28537) #67 _start from ??:0
(WorkerDict pid=28537) 
(WorkerDict pid=28537) 
(WorkerDict pid=28537) [2025-03-19 08:59:19] Received sigquit from a child process. It usually means the child failed.
Traceback (most recent call last):
  File "/root/verl/verl/trainer/main_ppo.py", line 54, in main
    run_ppo(config)
  File "/root/verl/verl/trainer/main_ppo.py", line 70, in run_ppo
    ray.get(main_task.remote(config))
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2771, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 919, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=28236, ip=172.17.0.2)
  File "/root/verl/verl/trainer/main_ppo.py", line 167, in main_task
    trainer.fit()
  File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 909, in fit
    val_metrics = self._validate()
  File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 676, in _validate
    test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
  File "/root/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. Worker exits with an exit code 0.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. (WorkerDict pid=30331) kwargs: {'n': 1, 'max_new_tokens': 1, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} (WorkerDict pid=30331) after resume, sleep for 5 second to check nvidia-smi...

As for the script in this issue, I get:


[first] os.environ.get("CUDA_VISIBLE_DEVICES")=None
[second] os.environ.get("CUDA_VISIBLE_DEVICES")='1,2'
[first] queue.put value_to_queue=tensor([1., 2.], device='cuda:1') value_to_queue.device=device(type='cuda', index=1)
Process Process-1:
Traceback (most recent call last):
  File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/yyx/my-Logic/test.py", line 7, in _run_second_process
    value_from_queue = queue.get()
  File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor
    storage = storage_cls._new_shared_cuda(
  File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/site-packages/torch/storage.py", line 1434, in _new_shared_cuda
    return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
RuntimeError: CUDA error: peer access is not supported between these two devices
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[W319 19:34:23.470562418 CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

So I'm pretty sure the bug volcengine/verl#490 (comment) is related to this issue.

To me, if torch isn't designed to provide global device id (or perhaps exposing another interface to support this feature would be more reasonable), then verl can work on its side. A quick idea is to have sglang run with local environments. If this affects part of sglang feature, then we should re-design ActorRollout worker to separately manage their sharding and weight syncs. e.g. one global rollout worker and multiple actor workers, along with a global sharding manager. I guess the current framework already allows for such design.

fzyzcjy mentioned this issue Mar 14, 2025

[rollout] feat: add SGLang as rollout engine to verl volcengine/verl#490

Merged

fzyzcjy changed the title ~~Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device~~ (Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device Mar 15, 2025

fzyzcjy linked a pull request Mar 15, 2025 that will close this issue

Fix multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149248

Open

zou3519 added the triage review label Mar 17, 2025

albanD added module: multiprocessing Related to torch.multiprocessing module: cuda Related to torch.cuda, and CUDA support in general labels Mar 17, 2025

zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196

Comments

EDIT: PR to fix this

🐛 Describe the bug

Versions