Feature to mark invalid outputs from the reward system #431

suhara · 2025-05-21T17:17:10Z

Is your feature request related to a problem? Please describe.

In math_environment.py and it will assign 0 reward if exceptions are raised. It may be true for this math case, but in general, should we skip such examples and use None or np.nan instead (assuming len(results) == len(pred_responses) has to be true?)

https://github.com/NVIDIA/NeMo-RL/blob/cec9a60ff798554279ca494a0ef40ce5f283e0d8/nemo_rl/environments/math_environment.py#L87-L92

    def verify(
        self, pred_responses: List[str], ground_truths: List[str]
    ) -> List[float]:
        """Verify the correctness of the predicted responses against the ground truth.

        Args:
            pred_responses: List[str]. The predicted responses from the LLM.
            ground_truths: List[str]. The ground truth responses.

        Returns:
            List[float]. The rewards for each predicted response.
        """
        results = []
        for response, ground_truth in zip(pred_responses, ground_truths):
            try:
                ground_truth_parsable = "\\boxed{" + ground_truth + "}"
                with _mute_output():
                    try:
                        ret_score, _ = self.verify_func(
                            [ground_truth_parsable], [response]
                        )
                    except Exception:
                        ret_score = 0.0 <= Should we skip if detection is failed?

                results.append(float(ret_score))
            except Exception:
                results.append(0.0) <= Same here
        return results

Describe the solution you'd like

Per our internal discussion, adding a feature to mark invalid outputs to mask out samples in training might be a good solution.

Describe alternatives you've considered

N/A

Additional context

N/A

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature to mark invalid outputs from the reward system #431

Feature to mark invalid outputs from the reward system #431

Feature to mark invalid outputs from the reward system #431

Feature to mark invalid outputs from the reward system #431

Comments