8000 Feature to mark invalid outputs from the reward system · Issue #431 · NVIDIA-NeMo/RL · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Feature to mark invalid outputs from the reward system #431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
suhara opened this issue May 21, 2025 · 0 comments
Open

Feature to mark invalid outputs from the reward system #431

suhara opened this issue May 21, 2025 · 0 comments

Comments

@suhara
Copy link
Collaborator
suhara commented May 21, 2025

Is your feature request related to a problem? Please describe.

In math_environment.py and it will assign 0 reward if exceptions are raised. It may be true for this math case, but in general, should we skip such examples and use None or np.nan instead (assuming len(results) == len(pred_responses) has to be true?)

https://github.com/NVIDIA/NeMo-RL/blob/cec9a60ff798554279ca494a0ef40ce5f283e0d8/nemo_rl/environments/math_environment.py#L87-L92

    def verify(
        self, pred_responses: List[str], ground_truths: List[str]
    ) -> List[float]:
        """Verify the correctness of the predicted responses against the ground truth.

        Args:
            pred_responses: List[str]. The predicted responses from the LLM.
            ground_truths: List[str]. The ground truth responses.

        Returns:
            List[float]. The rewards for each predicted response.
        """
        results = []
        for response, ground_truth in zip(pred_responses, ground_truths):
            try:
                ground_truth_parsable = "\\boxed{" + ground_truth + "}"
                with _mute_output():
                    try:
                        ret_score, _ = self.verify_func(
                            [ground_truth_parsable], [response]
                        )
                    except Exception:
                        ret_score = 0.0 <= Should we skip if detection is failed?

                results.append(float(ret_score))
            except Exception:
                results.append(0.0) <= Same here
        return results

Describe the solution you'd like

Per our internal discussion, adding a feature to mark invalid outputs to mask out samples in training might be a good solution.

Describe alternatives you've considered

N/A

Additional context

N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0