-
Notifications
You must be signed in to 8000 change notification settings - Fork 24.1k
(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device #149196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That would be very BC-breaking... I think there are two things here:
|
In general I'm not sure we are expecting to handle
patterns as we often assume env vars cannot be changed after initialization. |
Changing environ after cuda initialization is a noop for sure. |
This is a prototype hack to demonstrate it roughly fixes instead of a production-ready fix :) For backward compatibility, we can e.g. check whether it is a id or uuid and do different things accordingly.
Firstly, the Secondly, when peer GPUs do not have connection, @Dutch-voyage reported an error when deserializing volcengine/verl#490 (comment), but since I do not have that hardware, I cannot say whether it can be reproduced by this exact script or is caused by something else. (@Dutch-voyage could you please run this script and see whether it errors as well?)
I think physical device is the correct thing, because when sending the tensor, the pointer that we send is pointing to the physical device instead of the id. When user sees a big tensor on
Yes exactly, for a more complicated demo, one may have the main process unaware of pytorch, and launch two subprocesses with different env vars. |
@fzyzcjy xuehai pan told me that |
Not tried but I guess that may work; but it would be great if this bug is fixed as well. |
@fzyzcjy I think we've done with multi-nodes by the help of hancheng? |
@zhaochenyang20 Not sure, do you mean volcengine/verl#652? Anyway that is orthogonal this PyTorch bug |
great! |
@fzyzcjy @zhaochenyang20
2025-03-19 08:57:52,816 INFO worker.py:1832 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(main_task pid=28236) {'actor_rollout_ref': {'actor': {'clip_ratio': 0.2,
(main_task pid=28236) 'entropy_coeff': 0.001,
(main_task pid=28236) 'fsdp_config': {'fsdp_size': -1,
(main_task pid=28236) 'optimizer_offload': True,
(main_task pid=28236) 'param_offload': True,
(main_task pid=28236) 'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236) 'grad_clip': 1.0,
(main_task pid=28236) 'kl_loss_coef': 0.001,
(main_task pid=28236) 'kl_loss_type': 'low_var_kl',
(main_task pid=28236) 'optim': {'lr': 1e-06,
(main_task pid=28236) 'lr_warmup_steps_ratio': 0.0,
(main_task pid=28236) 'min_lr_ratio': None,
(main_task pid=28236) 'total_training_steps': -1,
(main_task pid=28236) 'warmup_style': 'constant'},
(main_task pid=28236) 'ppo_epochs': 1,
(main_task pid=28236) 'ppo_max_token_len_per_gpu': 16384,
(main_task pid=28236) 'ppo_micro_batch_size': None,
(main_task pid=28236) 'ppo_micro_batch_size_per_gpu': 16,
(main_task pid=28236) 'ppo_mini_batch_size': 64,
(main_task pid=28236) 'shuffle': False,
(main_task pid=28236) 'strategy': 'fsdp',
(main_task pid=28236) 'ulysses_sequence_parallel_size': 1,
(main_task pid=28236) 'use_dynamic_bsz': False,
(main_task pid=28236) 'use_kl_loss': False},
(main_task pid=28
8000
236) 'hybrid_engine': True,
(main_task pid=28236) 'model': {'enable_gradient_checkpointing': True,
(main_task pid=28236) 'external_lib': None,
(main_task pid=28236) 'override_config': {},
(main_task pid=28236) 'path': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236) 'use_remove_padding': True},
(main_task pid=28236) 'ref': {'fsdp_config': {'param_offload': True,
(main_task pid=28236) 'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236) 'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=28236) 'log_prob_micro_batch_size': 16,
(main_task pid=28236) 'log_prob_micro_batch_size_per_gpu': None,
(main_task pid=28236) 'log_prob_use_dynamic_bsz': False,
(main_task pid=28236) 'ulysses_sequence_parallel_size': 1},
(main_task pid=28236) 'rollout': {'disable_custom_all_reduce': True,
(main_task pid=28236) 'disable_log_stats': True,
(main_task pid=28236) 'do_sample': True,
(main_task pid=28236) 'dtype': 'bfloat16',
(main_task pid=28236) 'enable_chunked_prefill': True,
(main_task pid=28236) 'enforce_eager': True,
(main_task pid=28236) 'free_cache_engine': True,
(main_task pid=28236) 'gpu_memory_utilization': 0.4,
(main_task pid=28236) 'ignore_eos': False,
(main_task pid=28236) 'load_format': 'dummy_dtensor',
(main_task pid=28236) 'log_prob_max_token_len_per_gpu': 16384,
(main_task pid=28236) 'log_prob_micro_batch_size': None,
(main_task pid=28236) 'log_prob_micro_batch_size_per_gpu': 16,
(main_task pid=28236) 'log_prob_use_dynamic_bsz': False,
(main_task pid=28236) 'max_model_len': None,
(main_task pid=28236) 'max_num_batched_tokens': 8192,
(main_task pid=28236) 'max_num_seqs': 1024,
(main_task pid=28236) 'n': 1,
(main_task pid=28236) 'name': 'sglang',
(main_task pid=28236) 'prompt_length': 512,
(main_task pid=28236) 'response_length': 1,
(main_task pid=28236) 'sampling_params': {'max_new_tokens': 100},
(main_task pid=28236) 'temperature': 1.0,
(main_task pid=28236) 'tensor_model_parallel_size': 2,
(main_task pid=28236) 'top_k': -1,
(main_task pid=28236) 'top_p': 1,
(main_task pid=28236) 'use_fire_sampling': False,
(main_task pid=28236) 'val_kwargs': {'do_sample': False,
(main_task pid=28236) 'n': 1,
(main_task pid=28236) 'temperature': 0,
(main_task pid=28236) 'top_k': -1,
(main_task pid=28236) 'top_p': 1.0}}},
(main_task pid=28236) 'algorithm': {'adv_estimator': 'gae',
(main_task pid=28236) 'gamma': 1.0,
(main_task pid=28236) 'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
(main_task pid=28236) 'kl_penalty': 'kl',
(main_task pid=28236) 'lam': 1.0},
(main_task pid=28236) 'critic': {'cliprange_value': 0.5,
(main_task pid=28236) 'forward_max_token_len_per_gpu': 32768,
(main_task pid=28236) 'forward_micro_batch_size': 16,
(main_task pid=28236) 'forward_micro_batch_size_per_gpu': None,
(main_task pid=28236) 'grad_clip': 1.0,
(main_task pid=28236) 'model': {'enable_gradient_checkpointing': True,
(main_task pid=28236) 'external_lib': None,
(main_task pid=28236) 'fsdp_config': {'fsdp_size': -1,
(main_task pid=28236) 'optimizer_offload': True,
(main_task pid=28236) 'param_offload': True,
(main_task pid=28236) 'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236) 'override_config': {},
(main_task pid=28236) 'path': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236) 'tokenizer_path': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236) 'use_remove_padding': True},
(main_task pid=28236) 'optim': {'lr': 1e-05,
(main_task pid=28236) 'lr_warmup_steps_ratio': 0.0,
(main_task pid=28236) 'min_lr_ratio': None,
(main_task pid=28236) 'total_training_steps': -1,
(main_task pid=28236) 'warmup_style': 'constant'},
(main_task pid=28236) 'ppo_epochs': 1,
(main_task pid=28236) 'ppo_max_token_len_per_gpu': 32768,
(main_task pid=28236) 'ppo_micro_batch_size': 16,
(main_task pid=28236) 'ppo_micro_batch_size_per_gpu': None,
(main_task pid=28236) 'ppo_mini_batch_size': 64,
(main_task pid=28236) 'shuffle': False,
(main_task pid=28236) 'strategy': 'fsdp',
(main_task pid=28236) 'ulysses_sequence_parallel_size': 1,
(main_task pid=28236) 'use_dynamic_bsz': False},
(main_task pid=28236) 'custom_reward_function': {'name': 'compute_score', 'path': None},
(main_task pid=28236) 'data': {'filter_overlong_prompts': False,
(main_task pid=28236) 'image_key': 'images',
(main_task pid=28236) 'max_prompt_length': 512,
(main_task pid=28236) 'max_response_length': 1,
(main_task pid=28236) 'prompt_key': 'prompt',
(main_task pid=28236) 'return_raw_chat': False,
(main_task pid=28236) 'return_raw_input_ids': False,
(main_task pid=28236) 'shuffle': True,
(main_task pid=28236) 'tokenizer': None,
(main_task pid=28236) 'train_batch_size': 64,
(main_task pid=28236) 'train_files': '/root/data/gsm8k/train.parquet',
(main_task pid=28236) 'truncation': 'error',
(main_task pid=28236) 'val_batch_size': 1312,
(main_task pid=28236) 'val_files': '/root/data/gsm8k/test.parquet'},
(main_task pid=28236) 'reward_model': {'enable': False,
(main_task pid=28236) 'forward_max_token_len_per_gpu': 32768,
(main_task pid=28236) 'max_length': None,
(main_task pid=28236) 'micro_batch_size': None,
(main_task pid=28236) 'micro_batch_size_per_gpu': None,
(main_task pid=28236) 'model': {'external_lib': None,
(main_task pid=28236) 'fsdp_config': {'fsdp_size': -1,
(main_task pid=28236) 'param_offload': False,
(main_task pid=28236) 'wrap_policy': {'min_num_params': 0}},
(main_task pid=28236) 'input_tokenizer': 'Qwen/Qwen2.5-0.5B-Instruct',
(main_task pid=28236) 'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
(main_task pid=28236) 'use_remove_padding': False},
(main_task pid=28236) 'reward_manager': 'naive',
(main_task pid=28236) 'strategy': 'fsdp',
(main_task pid=28236) 'ulysses_sequence_parallel_size': 1,
(main_task pid=28236) 'use_dynamic_bsz': False},
(main_task pid=28236) 'trainer': {'balance_batch': True,
(main_task pid=28236) 'critic_warmup': 0,
(main_task pid=28236) 'default_hdfs_dir': None,
(main_task pid=28236) 'default_local_dir': 'checkpoints/verl_examples/gsm8k',
(main_task pid=28236) 'del_local_ckpt_after_load': False,
(main_task pid=28236) 'experiment_name': 'gsm8k',
(main_task pid=28236) 'logger': ['console'],
(main_task pid=28236) 'n_gpus_per_node': 2,
(main_task pid=28236) 'nnodes': 1,
(main_task pid=28236) 'project_name': 'verl_examples',
(main_task pid=28236) 'remove_previous_ckpt_in_save': False,
(main_task pid=28236) 'resume_from_path': False,
(main_task pid=28236) 'resume_mode': 'auto',
(main_task pid=28236) 'save_freq': -1,
(main_task pid=28236) 'test_freq': 10,
(main_task pid=28236) 'total_epochs': 1,
(main_task pid=28236) 'total_training_steps': None,
(main_task pid=28236) 'val_before_train': True,
(main_task pid=28236) 'val_generations_to_log_to_wandb': 0}}
(main_task pid=28236) WARNING: val_batch_size is deprecated. Validation datasets are sent to inference engines as a whole batch, which will schedule the memory themselves.
(main_task pid=28236) [validate_config] All configuration checks passed successfully!
(main_task pid=28236) dataset len: 7473
(main_task pid=28236) dataset len: 1319
(main_task pid=28236) Size of train dataloader: 116
(main_task pid=28236) Total training steps: 116
(main_task pid=28236) DeprecationWarning: \`ray.state.available_resources_per_node\` is a private attribute and access will be removed in a future Ray version.
(WorkerDict pid=28537) Critic overriding config {'bos_token_id': None, 'eos_token_id': 151645, 'pad_token_id': 151643}
(WorkerDict pid=28537) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForTokenClassification is torch.float32. You should run training or inference using Automatic Mixed-Precision via the \`with torch.autocast(device_type='torch_device'):\` decorator, or load the model with the \`torch_dtype\` argument. Example: \`model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)\`
(WorkerDict pid=28537) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with \`model.to('cuda')\`.
(WorkerDict pid=28537) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.bias', 'score.weight']
(WorkerDict pid=28537) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(WorkerDict pid=28537) Qwen2ForTokenClassification contains 494.03M parameters
(WorkerDict pid=28537) Before critic FSDP, memory allocated (GB): 0.0, memory reserved (GB): 0.0
(WorkerDict pid=28537) NCCL version 2.21.5+cuda12.4
(WorkerDict pid=28537) After critic FSDP, memory allocated (GB): 0.9205479621887207, memory reserved (GB): 2.103515625
(WorkerDict pid=28537) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=28537) Critic use_remove_padding=True
(WorkerDict pid=28537) Model config after override: Qwen2Config {
(WorkerDict pid=28537) "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
(WorkerDict pid=28537) "architectures": [
(WorkerDict pid=28537) "Qwen2ForCausalLM"
(WorkerDict pid=28537) ],
(WorkerDict pid=28537) "attention_dropout": 0.0,
(WorkerDict pid=28537) "eos_token_id": 151645,
(WorkerDict pid=28537) "hidden_act": "silu",
(WorkerDict pid=28537) "hidden_size": 896,
(WorkerDict pid=28537) "initializer_range": 0.02,
(WorkerDict pid=28537) "intermediate_size": 4864,
(WorkerDict pid=28537) "max_position_embeddings": 32768,
(WorkerDict pid=28537) "max_window_layers": 21,
(WorkerDict pid=28537) "model_type": "qwen2",
(WorkerDict pid=28537) "num_attention_heads": 14,
(WorkerDict pid=28537) "num_hidden_layers": 24,
(WorkerDict pid=28537) "num_key_value_heads": 2,
(WorkerDict pid=28537) "pad_token_id": 151643,
(WorkerDict pid=28537) "rms_norm_eps": 1e-06,
(WorkerDict pid=28537) "rope_scaling": null,
(WorkerDict pid=28537) "rope_theta": 1000000.0,
(WorkerDict pid=28537) "sliding_window": null,
(WorkerDict pid=28537) "tie_word_embeddings": true,
(WorkerDict pid=28537) "torch_dtype": "bfloat16",
(WorkerDict pid=28537) "transformers_version": "4.48.3",
(WorkerDict pid=28537) "use_cache": true,
(WorkerDict pid=28537) "use_sliding_window": false,
(WorkerDict pid=28537) "vocab_size": 151936
(WorkerDict pid=28537) }
(WorkerDict pid=28537)
(WorkerDict pid=28537) Qwen2ForCausalLM contains 494.03M parameters
(WorkerDict pid=28537) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})])
(WorkerDict pid=28537) Actor use_remove_padding=True
(WorkerDict pid=28537) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the \`with torch.autocast(device_type='torch_device'):\` decorator, or load the model with the \`torch_dtype\` argument. Example: \`model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)\` [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WorkerDict pid=30331) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with \`model.to('cuda')\`.
(WorkerDict pid=30331) Some weights of Qwen2ForTokenClassification were not initialized from the model checkpoint at Qwen/Qwen2.5-0.5B-Instruct and are newly initialized: ['score.bias', 'score.weight']
(WorkerDict pid=30331) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(WorkerDict pid=28537) Model config after override: Qwen2Config {
(WorkerDict pid=28537) "_name_or_path": "Qwen/Qwen2.5-0.5B-Instruct",
(WorkerDict pid=28537) "architectures": [
(WorkerDict pid=28537) "Qwen2ForCausalLM"
(WorkerDict pid=28537) ],
(WorkerDict pid=28537) "attention_dropout": 0.0,
(WorkerDict pid=28537) "eos_token_id": 151645,
(WorkerDict pid=28537) "hidden_act": "silu",
(WorkerDict pid=28537) "hidden_size": 896,
(WorkerDict pid=28537) "initializer_range": 0.02,
(WorkerDict pid=28537) "intermediate_size": 4864,
(WorkerDict pid=28537) "max_position_embeddings": 32768,
(WorkerDict pid=28537) "max_window_layers": 21,
(WorkerDict pid=28537) "model_type": "qwen2",
(WorkerDict pid=28537) "num_attention_heads": 14,
(WorkerDict pid=28537) "num_hidden_layers": 24,
(WorkerDict pid=28537) "num_key_value_heads": 2,
(WorkerDict pid=28537) "pad_token_id": 151643,
(WorkerDict pid=28537) "rms_norm_eps": 1e-06,
(WorkerDict pid=28537) "rope_scaling": null,
(WorkerDict pid=28537) "rope_theta": 1000000.0,
(WorkerDict pid=28537) "sliding_window": null,
(WorkerDict pid=28537) "tie_word_embeddings": true,
(WorkerDict pid=28537) "torch_dtype": "bfloat16",
(WorkerDict pid=28537) "transformers_version": "4.48.3",
(WorkerDict pid=28537) "use_cache": true,
(WorkerDict pid=28537) "use_sliding_window": false,
(WorkerDict pid=28537) "vocab_size": 151936
(WorkerDict pid=28537) }
(WorkerDict pid=28537)
(WorkerDict pid=30331) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=30331) Critic use_remove_padding=True
(WorkerDict pid=28537) Qwen2ForCausalLM contains 494.03M parameters
(WorkerDict pid=28537) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})]) [repeated 2x across cluster]
(WorkerDict pid=30331) Actor use_remove_padding=True
(WorkerDict pid=28537) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=28537) Actor use_remove_padding=True
(WorkerDict pid=28537) Before building sglang rollout, memory allocated (GB): 0.9205489158630371, memory reserved (GB): 2.107421875
(WorkerDict pid=30331) wrap_policy: functools.partial(, policies=[functools.partial(, transformer_layer_cls={})])
(WorkerDict pid=30331) Total steps: 116, num_warmup_steps: 0
(WorkerDict pid=30331) Actor use_remove_padding=True
(WorkerDict pid=28537) NCCL version 2.21.5+cuda12.4
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00
(WorkerDict pid=28537) (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
(WorkerDict pid=28537) File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1040, in _unwrap_tensor
(WorkerDict pid=28537) tensor = tensor.get(tp_rank)
(WorkerDict pid=28537) File "/root/sglang/python/sglang/srt/model_executor/model_runner.py", line 1052, in get
(WorkerDict pid=28537) return MultiprocessingSerializer.deserialize(self.values[rank])
(WorkerDict pid=28537) File "/root/sglang/python/sglang/srt/utils.py", line 1386, in deserialize
(WorkerDict pid=28537) return ForkingPickler.loads(data)
(WorkerDict pid=28537) File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor
(WorkerDict pid=28537) storage = storage_cls._new_shared_cuda(
(WorkerDict pid=28537) File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 1434, in _new_shared_cuda
(WorkerDict pid=28537) return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
(WorkerDict pid=28537) RuntimeError: CUDA error: peer access is not supported between these two devices
(WorkerDict pid=28537) Compile with \`TORCH_USE_CUDA_DSA\` to enable device-side assertions.
(WorkerDict pid=28537)
(WorkerDict pid=28537) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=28537) C++ CapturedTraceback:
(WorkerDict pid=28537) #4 std::_Function_handler const> (), c10::SetStackTraceFetcher(std::function)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
(WorkerDict pid=28537) #5 c10::Error::Error(c10::SourceLocation, std::string) from ??:0
(WorkerDict pid=28537) #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) from ??:0
(WorkerDict pid=28537) #7 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) from ??:0
(WorkerDict pid=28537) #8 c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::getIpcDevPtr(std::string) from :0
(WorkerDict pid=28537) #9 THPStorage_newSharedCuda(_object*, _object*) from StorageSharing.cpp:0
(WorkerDict pid=28537) #10 PyObject_CallFunctionObjArgs from ??:0
(WorkerDict pid=28537) #11 PyObject_Call from ??:0
(WorkerDict pid=28537) #12 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #13 PyMethod_New from ??:0
(WorkerDict pid=28537) #14 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #15 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #16 _Py_VaBuildValue_SizeT from ??:0
(WorkerDict pid=28537) #17 PyUnicodeDecodeError_SetReason from ??:0
(WorkerDict pid=28537) #18 _PyDict_NewPresized from ??:0
(WorkerDict pid=28537) #19 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #20 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #21 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #22 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #23 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #24 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #25 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #26 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #27 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #28 PyMethod_New from ??:0
(WorkerDict pid=28537) #29 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #30 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #31 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #32 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #33 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #34 PyMethod_New from ??:0
(WorkerDict pid=28537) #35 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #36 _PyObject_FastCallDictTstate from ??:0
(WorkerDict pid=28537) #37 _PyObject_Call_Prepend from ??:0
(WorkerDict pid=28537) #38 PyInit__datetime from ??:0
(WorkerDict pid=28537) #39 _PyObject_MakeTpCall from ??:0
(WorkerDict pid=28537) #40 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #41 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #42 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #43 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #44 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #45 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #46 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #47 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #48 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #49 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #50 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #51 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #52 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #53 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #54 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #55 _PyFunction_Vectorcall from ??:0
(WorkerDict pid=28537) #56 _PyEval_EvalFrameDefault from ??:0
(WorkerDict pid=28537) #57 PyEval_EvalCode from ??:0
(WorkerDict pid=28537) #58 PyEval_EvalCode from ??:0
(WorkerDict pid=28537) #59 PyUnicode_Tailmatch from ??:0
(WorkerDict pid=28537) #60 PyInit__collections from ??:0
(WorkerDict pid=28537) #61 PyRun_StringFlags from ??:0
(WorkerDict pid=28537) #62 PyRun_SimpleStringFlags from ??:0
(WorkerDict pid=28537) #63 Py_RunMain from ??:0
(WorkerDict pid=28537) #64 Py_BytesMain from ??:0
(WorkerDict pid=28537) #65 __libc_start_call_main from ./csu/../sysdeps/nptl/libc_start_call_main.h:58
(WorkerDict pid=28537) #66 __libc_start_main_impl from ./csu/../csu/libc-start.c:392
(WorkerDict pid=28537) #67 _start from ??:0
(WorkerDict pid=28537)
(WorkerDict pid=28537)
(WorkerDict pid=28537) [2025-03-19 08:59:19] Received sigquit from a child process. It usually means the child failed.
Traceback (most recent call last):
File "/root/verl/verl/trainer/main_ppo.py", line 54, in main
run_ppo(config)
File "/root/verl/verl/trainer/main_ppo.py", line 70, in run_ppo
ray.get(main_task.remote(config))
File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2771, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 919, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::main_task() (pid=28236, ip=172.17.0.2)
File "/root/verl/verl/trainer/main_ppo.py", line 167, in main_task
trainer.fit()
File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 909, in fit
val_metrics = self._validate()
File "/root/verl/verl/trainer/ppo/ray_trainer.py", line 676, in _validate
test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded)
File "/root/verl/verl/single_controller/ray/base.py", line 42, in func
output = ray.get(output)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. Worker exits with an exit code 0.
[first] os.environ.get("CUDA_VISIBLE_DEVICES")=None
[second] os.environ.get("CUDA_VISIBLE_DEVICES")='1,2'
[first] queue.put value_to_queue=tensor([1., 2.], device='cuda:1') value_to_queue.device=device(type='cuda', index=1)
Process Process-1:
Traceback (most recent call last):
File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/yyx/my-Logic/test.py", line 7, in _run_second_process
value_from_queue = queue.get()
File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 181, in rebuild_cuda_tensor
storage = storage_cls._new_shared_cuda(
File "/home/yyx/miniconda3/envs/sgl-fix/lib/python3.10/site-packages/torch/storage.py", line 1434, in _new_shared_cuda
return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
RuntimeError: CUDA error: peer access is not supported between these two devices
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
So I'm pretty sure the bug volcengine/verl#490 (comment) is related to this issue. To me, if torch isn't designed to provide global device id (or perhaps exposing another interface to support this feature would be more reasonable), then verl can work on its side. A quick idea is to have sglang run with local environments. If this affects part of sglang feature, then we should re-design ActorRollout worker to separately manage their sharding and weight syncs. e.g. one global rollout worker and multiple actor workers, along with a global sharding manager. I guess the current framework already allows for such design. |
EDIT: PR to fix this
PR is here: #149248
🐛 Describe the bug
Hi thanks for the helpful library! When two processes have different CUDA_VISIBLE_DEVICES and pass around tensor between them, it seems the
.device
attribute is incorrect.Example code:
Output:
It seems
cuda:0
in the second process should meancuda:1
in the first process, thus the second process wrongly recognize the tensor ascuda:1
.This seems to be related to issues like github.com/volcengine/verl/pull/ 490#issuecomment-2720212225.
If I manage to find some spare time, I am happy to PR for this.
Versions
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 24.04.1 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39
Python version: 3.10.16 (main, Dec 4 2024, 08:53:38) [GCC 13.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-1017-aws-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.8.61
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3
Nvidia driver version: 550.127.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.7.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.7.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R13 Processor
CPU family: 25
Model: 1
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 1
BogoMIPS: 5299.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 3 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 48 MiB (96 instances)
L3 cache: 384 MiB (12 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] flashinfer-python==0.2.3+cu124torch2.5
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] optree==0.14.1
[pip3] torch==2.5.1
[pip3] torch_memory_saver==0.0.2
[pip3] torchao==0.9.0
[pip3] torchaudio==2.5.1
[pip3] torchdata==0.11.0
[pip3] torchvision==0.20.1
[pip3] triton==3.1.0
[conda] Could not collect
cc @VitalyFedyunin @albanD @ptrblck @msaroufim @eqy
The text was updated successfully, but these errors were encountered: