-
Notifications
You must be signed in to change notification settings - Fork 83
v2 code updates #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v2 code updates #236
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a quick skim, some minor things to fix, approving so not blocked on merging.
rewardbench/generative_v2.py
Outdated
|
||
###Feedback: """ | ||
|
||
AUTOJ_COARSE_SCORE_RUBRIC = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not run for v2, I'd remove all the code we aren't using right now so we start with a cleaner file.
self.tokenizer = tokenizer | ||
|
||
def __call__(self, samples, **kwargs): | ||
_ = kwargs.get("batch_size", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If only tested with batch size 1, we need to raise a not implemented error if the user passes batch size >1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now tested with batch size 4, works fine!
V2 Code Updates
Big updates are summarized as follows:
scripts/run_v2.py
now has support for processing the Ties subset, which uses a different scoring function over the whole Ties subsetrewardbench/utils.py
has new functionsprocess_single_model
and helper functionsample_stats
for Ties scoring.reroll_and_score_dataset
inutils.py
to now reasonably penalize for 2/3/4-way ties for top score— also generally changed this function from best-of-4 to more flexible num completions to accommodate Tiesrun_v2.py
also has a bunch of miscellaneous small updates that I copied over from doing a diff withrun_rm.py
— things like device map, handling bfloat16, quantization etc. They were missing becauserun_v2.py
was initially based on the olderrun_bon.py
.rewardbench/generative_v2.py
separate since each prompt and function is different, but could consolidate, just not sure how much benefit that gives. Happy to iterate. The generative v2 scripts now:utils.load_eval_dataset_multi
, similar toutils.load_eval_dataset
from v1 but more flexible for multiple chosen/rejected--score_w_ratings
argumentrun_generative_v2.py
enforces this.Testing
I have tested this branch's code with the image
saumyam/rewardbench-2-pr-0530-tie-penalty
on:Small things still to fix
generative_v2.py
, but they are just as long ingenerative.py
, which doesn't have issues?