8000 The question about gradient miss. · Issue #46 · tulerfeng/Video-R1 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

The question about gradient miss. #46

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.< 8000 /p>

Already on GitHub? Sign in to your account

Closed
Haoyu-Xie opened this issue May 8, 2025 · 1 comment
Closed

The question about gradient miss. #46

Haoyu-Xie opened this issue May 8, 2025 · 1 comment

Comments

@Haoyu-Xie
Copy link
Haoyu-Xie commented May 8, 2025

Thanks for your amazing work!

When I run the GRPO, I met the problem about the gradient.:
Traceback (most recent call last):
File "/mnt/workspace/haoyu/music_codebase/Video-R1/src/r1-v/src/open_r1/grpo_hy.py", line 372, in
main(script_args, training_args, model_args)
File "/mnt/workspace/haoyu/music_codebase/Video-R1/src/r1-v/src/open_r1/grpo_hy.py", line 358, in main
trainer.train()
File "/root/anaconda3/envs/video-r1/lib/python3.11/site-packages/transformers/trainer.py", line 2241, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/video-r1/lib/python3.11/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/anaconda3/envs/video-r1/lib/python3.11/site-packages/transformers/trainer.py", line 3740, in training_step
self.accelerator.backward(loss, **kwargs)
File "/root/anaconda3/envs/video-r1/lib/python3.11/site-packages/accelerate/accelerator.py", line 2329, in backward
loss.backward(**kwargs)
File "/root/anaconda3/envs/video-r1/lib/python3.11/site-packages/torch/_tensor.py", line 581, in backward
torch.autograd.backward(
File "/root/anaconda3/envs/video-r1/lib/python3.11/site-packages/torch/autograd/init.py", line 347, in backward
_engine_run_backward(
File "/root/anaconda3/envs/video-r1/lib/python3.11/site-packages/torch/autograd/graph.py", line 825, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

After I revise code in the compute_loss() before return:

if not loss.requires_grad:
loss.requires_grad = True

return loss

The Error is temporarily resolved. But the loss is always 0 (the reward is not Zero/None), and I check the components of the loss,

loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()

{all components of the loss}.requires_grad is False. The requires_grad of model parameters are True.
a
Why it happens?

@tulerfeng
Copy link
Owner
tulerfeng commented May 8, 2025

Hi, This is strange.

Have you changed the training code besides these two lines? In my environment, These two lines should not be added:

if not loss.requires_grad:
loss.requires_grad = True

In my environment, directly printing loss.requires_grad shows True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0