8000 The significance of temporal reward value shifting · Issue #51 · tulerfeng/Video-R1 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

The significance of temporal reward value shifting #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
M3Dade opened this issue May 16, 2025 · 1 comment
Open

The significance of temporal reward value shifting #51

M3Dade opened this issue May 16, 2025 · 1 comment

Comments

@M3Dade
Copy link
M3Dade commented May 16, 2025

Hi, thank you for your great work on this project!

I have a question regarding the temporal reward calculation. Suppose there is a group with 8 completions. If the accuracy of the normal video sequence is higher than the chaotic temporal sequence, a reward of 0.3 is added to all rewards with accuracy = 1 in the normal sequence. However, when calculating the advantage (i.e., after normalization), the relative values between rewards seem unchanged.

For example:

a= torch.tensor([1.3, 0, 1.3, 0, 1.3, 0, 0, 0]).to(torch.float)
print((a-a.mean())/a.std())
# tensor([ 1.2076, -0.7246,  1.2076, -0.7246,  1.2076, -0.7246, -0.7246, -0.7246])

a= torch.tensor([1, 0, 1, 0, 1, 0, 0, 0]).to(torch.float)
print((a-a.mean())/a.std())
# tensor([ 1.2076, -0.7246,  1.2076, -0.7246,  1.2076, -0.7246, -0.7246, -0.7246])

It seems that after normalization, the relative advantage values remain unchanged even if the rewards are shifted up by a constant amount when accuracy = 1.

Could you please help me understand what the effect or significance of this reward adjustment is in practice? Am I missing something about how this influences the training or learning process?

Thank you very much for your time!

@tulerfeng
Copy link
Owner
tulerfeng commented May 16, 2025

Hi, thank you for pointing this out ! We have not noticed this before.

It seems that more values maybe required for this reward in one group, for example:

a= torch.tensor([2.3, 1, 2.3, 1, 2.3, 1, 0, 0]).to(torch.float) # 0 denotes format reward = 0
print((a-a.mean())/a.std())
# tensor([ 1.0927, -0.2442,  1.0927, -0.2442,  1.0927, -0.2442, -1.2726, -1.2726])
a= torch.tensor([2, 1, 2, 1, 2, 1, 0, 0]).to(torch.float) # 0 denotes format reward = 0
print((a-a.mean())/a.std())
# tensor([ 1.0485, -0.1498,  1.0485, -0.1498,  1.0485, -0.1498, -1.3481, -1.3481])

a= torch.tensor([2.5, 1, 2.3, 1, 2.3, 1, 1, 1]).to(torch.float) # 2.5 denotes adding length reward 0.2 
print((a-a.mean())/a.std())
# tensor([ 1.3908, -0.7218,  1.1091, -0.7218,  1.1091, -0.7218, -0.7218, -0.7218])
a= torch.tensor([2.2, 1, 2, 1, 2, 1, 1, 1]).to(torch.float) # 2.2 denotes adding length reward 0.2 
print((a-a.mean())/a.std())
# tensor([ 1.4402, -0.7201,  1.0801, -0.7201,  1.0801, -0.7201, -0.7201, -0.7201])

Besides, for free-form questions, the rewards are inherently more than two values.

However, as you pointed out, the temporal reward appears to diminish or vanish in certain cases.

Therefore, directly adding the temporal reward to the advantage maybe a more suitable choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0