8000 Revamped Eval Function · Issue #38 · ScalingIntelligence/KernelBench · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Revamped Eval Function #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
simonguozirui opened this issue May 7, 2025 · 1 comment
Open

Revamped Eval Function #38

simonguozirui opened this issue May 7, 2025 · 1 comment
< 8000 /div>
Assignees

Comments

@simonguozirui
Copy link
Collaborator

During investigation with Sakana's Kernel in #25, we created a stronger eval function to avoid that kind of exploits that some observed.
I didn't merge it in (sit on a branch) because we want to make sure our paper result didn't change from such an update (for ICML rebuttal).

During ICML rebuttal, I have also checked if any of our existing kernels have similar kind of exploits. Luckily, none of our kernels are smart enough to do that yet.

Now that ICML is over, I plan to merge the more robust eval function in.

In particular, the simple fix is

compute reference, Model
clear cache
compute reference, ModelNew
check if they are equivalent

AND

compute reference, ModelNew
clear cache
compute reference, Model
check if they are equivalent

Check both directions to be extra sure!

@simonguozirui
Copy link
Collaborator Author

Actually shout out to the @CognitionAI-AI folks for spotting that again in their Kevin blog post. They resort to by first running the tested kernel and then the reference implementation, thus avoiding this hack. So this proposed eval update will address that as well.

We should keep the infra updated and robust so more people can build on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0