8000 v2 code updates by saumyamalik · Pull Request #236 · allenai/reward-bench · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

v2 code updates #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 37 commits into from
Jun 2, 2025
Merged

v2 code updates #236

merged 37 commits into from
Jun 2, 2025

Conversation

saumyamalik
Copy link
Contributor
@saumyamalik saumyamalik commented May 31, 2025

V2 Code Updates

Big updates are summarized as follows:

  1. Ties Scoring in V2 script:
  • scripts/run_v2.py now has support for processing the Ties subset, which uses a different scoring function over the whole Ties subset
  • Correspondingly, rewardbench/utils.py has new functions process_single_model and helper function sample_stats for Ties scoring.
  • Fixed general scoring in reroll_and_score_dataset in utils.py to now reasonably penalize for 2/3/4-way ties for top score— also generally changed this function from best-of-4 to more flexible num completions to accommodate Ties
  • run_v2.py also has a bunch of miscellaneous small updates that I copied over from doing a diff with run_rm.py— things like device map, handling bfloat16, quantization etc. They were missing because run_v2.py was initially based on the older run_bon.py.
  1. Generative Models: new scripts for v2 for generative models. Leaning toward keeping rewardbench/generative_v2.py separate since each prompt and function is different, but could consolidate, just not sure how much benefit that gives. Happy to iterate. The generative v2 scripts now:
  • loads the dataset with new function utils.load_eval_dataset_multi, similar to utils.load_eval_dataset from v1 but more flexible for multiple chosen/rejected
  • allows for scoring with absolute ratings with the --score_w_ratings argument
  • allows for scoring with 4-way rankings (the default)
  • Note: the Ties subset must be scored with ratings, even if the rest is scored with rankings— the logic of run_generative_v2.py enforces this.
  • In general, the logic of ties is a bit convoluted, but this is for 2 reasons: 1. as mentioned above, the Ties subset has to be handled differently with regard to the evaluation method and 2. Ties subset is scored differently. I have tested this pretty well to be sure, but am very open to refactoring suggestions.
  • Mostly just modified the code for API models, not VLLM yet.
  1. Bumping transformers to 4.51.0. (This causes issues for Gemma 2 27B models, though, but so does 4.48.1)

Testing
I have tested this branch's code with the image saumyam/rewardbench-2-pr-0530-tie-penalty on:

Small things still to fix

  • make quality doesn't like that the prompt templates are too long in generative_v2.py, but they are just as long in generative.py, which doesn't have issues?
  • I have the worldpm code in here, but haven't debugged it yet. Shouldn't merge this yet but I'll keep working on it on this branch
  • Documentation/readme

Copy link
Collaborator
@natolambert natolambert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick skim, some minor things to fix, approving so not blocked on merging.


###Feedback: """

AUTOJ_COARSE_SCORE_RUBRIC = """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not run for v2, I'd remove all the code we aren't using right now so we start with a cleaner file.

self.tokenizer = tokenizer

def __call__(self, samples, **kwargs):
_ = kwargs.get("batch_size", 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only tested with batch size 1, we need to raise a not implemented error if the user passes batch size >1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now tested with batch size 4, works fine!

@natolambert natolambert mentioned this pull request Jun 2, 2025
This was linked to issues Jun 2, 2025
@natolambert natolambert mentioned this pull request Jun 2, 2025
@saumyamalik saumyamalik changed the title v2 code updates [don't merge yet] v2 code updates Jun 2, 2025
@natolambert natolambert merged commit fdc742d into main Jun 2, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plan for RewardBench v2 Unpin transformers version in setup.py
2 participants
0