8000 Add entropy based token filtering to the GRPO Trainer/Loss Function. · Issue #3555 · huggingface/trl · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Add entropy based token filtering to the GRPO Trainer/Loss Function. #3555
Open
@pramodith

Description

@pramodith

Feature request

In the paper Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning the authors from the Qwen team show that filtering out i.e. zeroing out the policy-loss from the bottom 80% of tokens in a model's response based on their entropy can yield better results than including the loss from all the tokens.

We ultimately improve RLVR by restricting policy gradient updates to
forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of
the tokens while maintaining performance comparable to full-gradient updates on the
Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-
32B

It'd be great to add a flag in the GRPO Trainer so that a user can exclude the bottom-k tokens based on entropy from the policy loss.

Motivation

Can help users create better reasoning models!

Your contribution

I'd like to implement this parameter/feature!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0