-
Notifications
You must be signed in to change notification settings - Fork 2k
Add entropy based filtering inside the GRPOTrainer. #3563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add entropy based filtering inside the GRPOTrainer. #3563
Conversation
You should be able to calculate the entropy directly from log probs as
which means we don't have to modify |
|
Also just realized that the temperature parameter won't affect the ranking order of entropy since all positions are affected by the same temperature, so the temp param shouldn't matter here. |
Actually this isn't true the |
yes of course, my bad. not really a fan of the proposed refactor of |
Nice! Thanks! Another recommendation: 1 argument is probably enough. - if self.filter_on_entropy:
+ if self.token_entropy_percentile_threshold < 1.0: and add that the recommended value is 0.2 in the documentation. |
Cool, let me look into making the entropy calculation less memory intensive. |
Updated the code to make sure that only a mini-batch of logits are materialized at any given point of time and entropies for those mini-batches of logits are optionally calculated. The |
nice work. left a few minor comments |
Thanks for the review Leon! Made the suggested changes. |
@qgallouedec please take another look at the PR when you have the time. |
What does this PR do?
This PR is in relation to #3555 which proposes to mask out the policy loss coming from tokens in the completions corresponding to positions with an entropy scores below the bottom-k percentile.
This idea is proposed by the Qwen team in their accompanying paper Beyond the 80/20 Rule
Key Proposals of the paper that guided the implementation
The key difference is the term
From the paper
Entropy is calculated as normal via the formula
The paper applies the entropy mask to the DAPO loss function in their experiments, but I think we can leave it to the user to decide which loss i.e. GRPO, Dr. GRPO or DAPO to apply it to.
The paper finds that the best threshold is to keep the top-20% of tokens based on their entropy.
I didn't run the vllm tests inside
test_grpo_trainer.py
since my machine/vm didn't have access to a gpu.Fixes #3555
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.