v2 code updates #236

saumyamalik · 2025-05-31T11:36:19Z

V2 Code Updates

Big updates are summarized as follows:

Ties Scoring in V2 script:

scripts/run_v2.py now has support for processing the Ties subset, which uses a different scoring function over the whole Ties subset
Correspondingly, rewardbench/utils.py has new functions process_single_model and helper function sample_stats for Ties scoring.
Fixed general scoring in reroll_and_score_dataset in utils.py to now reasonably penalize for 2/3/4-way ties for top score— also generally changed this function from best-of-4 to more flexible num completions to accommodate Ties
run_v2.py also has a bunch of miscellaneous small updates that I copied over from doing a diff with run_rm.py— things like device map, handling bfloat16, quantization etc. They were missing because run_v2.py was initially based on the older run_bon.py.

Generative Models: new scripts for v2 for generative models. Leaning toward keeping rewardbench/generative_v2.py separate since each prompt and function is different, but could consolidate, just not sure how much benefit that gives. Happy to iterate. The generative v2 scripts now:

loads the dataset with new function utils.load_eval_dataset_multi, similar to utils.load_eval_dataset from v1 but more flexible for multiple chosen/rejected
allows for scoring with absolute ratings with the --score_w_ratings argument
allows for scoring with 4-way rankings (the default)
Note: the Ties subset must be scored with ratings, even if the rest is scored with rankings— the logic of run_generative_v2.py enforces this.
In general, the logic of ties is a bit convoluted, but this is for 2 reasons: 1. as mentioned above, the Ties subset has to be handled differently with regard to the evaluation method and 2. Ties subset is scored differently. I have tested this pretty well to be sure, but am very open to refactoring suggestions.
Mostly just modified the code for API models, not VLLM yet.

Bumping transformers to 4.51.0. (This causes issues for Gemma 2 27B models, though, but so does 4.48.1)

Testing
I have tested this branch's code with the image saumyam/rewardbench-2-pr-0530-tie-penalty on:

External models that are easy to run — no issues! Scores look reasonable.
Models we trained — no issues, all models reran fine with correct scores!
External models that have some custom code/pipelines — works for all but version 1 of the same linked model. This v1 model worked at paper submission time, so might be related to the transformers bump. Some larger models are still in the queue, but overall, everything looks good except for this one model + gemma 2 27b models, which I can run with an older version of transformers.
Generative models scored with rankings
Generative models scored with ratings

Small things still to fix

make quality doesn't like that the prompt templates are too long in generative_v2.py, but they are just as long in generative.py, which doesn't have issues?
I have the worldpm code in here, but haven't debugged it yet. Shouldn't merge this yet but I'll keep working on it on this branch
Documentation/readme

…o v2_code

natolambert

Did a quick skim, some minor things to fix, approving so not blocked on merging.

natolambert · 2025-05-31T13:51:52Z

rewardbench/generative_v2.py

+
+###Feedback: """
+
+AUTOJ_COARSE_SCORE_RUBRIC = """


If not run for v2, I'd remove all the code we aren't using right now so we start with a cleaner file.

natolambert · 2025-05-31T13:52:34Z

rewardbench/models/worldpm.py

+        self.tokenizer = tokenizer
+
+    def __call__(self, samples, **kwargs):
+        _ = kwargs.get("batch_size", 1)


If only tested with batch size 1, we need to raise a not implemented error if the user passes batch size >1

Now tested with batch size 4, works fine!

saumyamalik and others added 30 commits May 13, 2025 03:29

base commit - copy over v1 scripts (for diff)

cfac0b0

run generative

d05793c

style and quality fixes

eb56164

init

87ddcdd

Merge remote-tracking branch 'origin/gen_ratings' into generative_v2

2e1ebde

nits

7a51ee0

Merge remote-tracking branch 'origin/gen_ratings' into generative_v2

2b430aa

first commit

88a92b2

Update README.md

7240116

Update README.md

59bc437

Update README.md

3a52650

Update README.md

45c5b9f

Update README.md

59aade5

submission scripts, minor fix

19827ce

Merge branch 'v2_code' of https://github.com/allenai/reward-bench int…

14be254

…o v2_code

delete comment

112b2a3

update ArmoRM type

65e743c

bump transformers, add config

acf085b

delete comments and unused arg

5063602

update reroll function

e3b2e83

config

2104ab1

generative models with ties

22d19c5

attempt at worldpm and other small changes

9e65a51

Merge remote-tracking branch 'origin/generative_v2' into v2_code

3dd528b

modified args to process_single_model

9711496

uncomment

b0d28a3

delete utils from scripts

9736517

small fix - ties progressbar

fa2e79f

tie penalty and misc small fixes

46eb55b

merge over run_generative from main

076343d

saumyamalik and others added 2 commits May 31, 2025 02:58

style and quality

593defa

Update README.md

e2d45d1

natolambert approved these changes May 31, 2025

View reviewed changes

saumyamalik added 4 commits June 1, 2025 23:07

add models/configs and small arg fix

ce796a9

fixed --debug, added configs

4fc2e32

remove unused templates from generative_v2

2c10115

quality and style and gemini fix

5c51ed3

natolambert mentioned this pull request Jun 2, 2025

Add WorldPM-72B model #233

Closed

This was linked to issues Jun 2, 2025

Plan for RewardBench v2 #228

Closed

Unpin transformers version in setup.py #225

Closed

natolambert mentioned this pull request Jun 2, 2025

Generative v2 #231

Closed

ignore for generative_v2.py

c458e8c

saumyamalik changed the title ~~v2 code updates [don't merge yet]~~ v2 code updates Jun 2, 2025

natolambert merged commit fdc742d into main Jun 2, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2 code updates #236

v2 code updates #236

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		###Feedback: """

		AUTOJ_COARSE_SCORE_RUBRIC = """

v2 code updates #236

v2 code updates #236

Uh oh!

Conversation

Uh oh!

V2 Code Updates

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!