8000 The significance of temporal reward value shifting · Issue #51 · tulerfeng/Video-R1 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
The significance of temporal reward value shifting #51
Closed
@M3Dade

Description

@M3Dade

Hi, thank you for your great work on this project!

I have a question regarding the temporal reward calculation. Suppose there is a group with 8 completions. If the accuracy of the normal video sequence is higher than the chaotic temporal sequence, a reward of 0.3 is added to all rewards with accuracy = 1 in the normal sequence. However, when calculating the advantage (i.e., after normalization), the relative values between rewards seem unchanged.

For example:

a= torch.tensor([1.3, 0, 1.3, 0, 1.3, 0, 0, 0]).to(torch.float)
print((a-a.mean())/a.std())
# tensor([ 1.2076, -0.7246,  1.2076, -0.7246,  1.2076, -0.7246, -0.7246, -0.7246])

a= torch.tensor([1, 0, 1, 0, 1, 0, 0, 0]).to(torch.float)
print((a-a.mean())/a.std())
# tensor([ 1.2076, -0.7246,  1.2076, -0.7246,  1.2076, -0.7246, -0.7246, -0.7246])

It seems that after normalization, the relative advantage values remain unchanged even if the rewards are shifted up by a constant amount when accuracy = 1.

Could you please help me understand what the effect or significance of this reward adjustment is in practice? Am I missing something about how this influences the training or learning process?

Thank you very much for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0