Description
Hi, thank you for your great work on this project!
I have a question regarding the temporal reward calculation. Suppose there is a group with 8 completions. If the accuracy of the normal video sequence is higher than the chaotic temporal sequence, a reward of 0.3 is added to all rewards with accuracy = 1 in the normal sequence. However, when calculating the advantage (i.e., after normalization), the relative values between rewards seem unchanged.
For example:
a= torch.tensor([1.3, 0, 1.3, 0, 1.3, 0, 0, 0]).to(torch.float)
print((a-a.mean())/a.std())
# tensor([ 1.2076, -0.7246, 1.2076, -0.7246, 1.2076, -0.7246, -0.7246, -0.7246])
a= torch.tensor([1, 0, 1, 0, 1, 0, 0, 0]).to(torch.float)
print((a-a.mean())/a.std())
# tensor([ 1.2076, -0.7246, 1.2076, -0.7246, 1.2076, -0.7246, -0.7246, -0.7246])
It seems that after normalization, the relative advantage values remain unchanged even if the rewards are shifted up by a constant amount when accuracy = 1.
Could you please help me understand what the effect or significance of this reward adjustment is in practice? Am I missing something about how this influences the training or learning process?
Thank you very much for your time!