Description
Feature request
In the paper Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning the authors from the Qwen team show that filtering out i.e. zeroing out the policy-loss from the bottom 80% of tokens in a model's response based on their entropy can yield better results than including the loss from all the tokens.
We ultimately improve RLVR by restricting policy gradient updates to
forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of
the tokens while maintaining performance comparable to full-gradient updates on the
Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-
32B
It'd be great to add a flag in the GRPO Trainer so that a user can exclude the bottom-k tokens based on entropy from the policy loss.
Motivation
Can help users create better reasoning models!
Your contribution
I'd like to implement this parameter/feature!