Add entropy based token filtering to the GRPO Trainer/Loss Function.

Feature request

In the paper Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning the authors from the Qwen team show that filtering out i.e. zeroing out the policy-loss from the bottom 80% of tokens in a model's response based on their entropy can yield better results than including the loss from all the tokens.

We ultimately improve RLVR by restricting policy gradient updates to
forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of
the tokens while maintaining performance comparable to full-gradient updates on the
Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-
32B

It'd be great to add a flag in the GRPO Trainer so that a user can exclude the bottom-k tokens based on entropy from the policy loss.

Motivation

Can help users create better reasoning models!

Your contribution

I'd like to implement this parameter/feature!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions