See original KernelBench repo. Original README is KernelBench.md.
Implementations of test-time scaling methods are in main
directory. The following methods are implemented:
- Best-of-N (KernelBench): sample
$N$ independent kernels and pick best performance. - Iterative Refinement (KernelBench): sample kernel and get execution feedback to iteratively refine for
$N$ steps. -
METR: Initially, independently generate
$N_0$ kernels. New kernel is generated by sampling based on efficiency from previously generated kernels and evolving it. Repeat until$N$ kernels total. -
Stanford (Beam-search): From current kernel, generate
$P$ independent NL ideas -> kernel. Pick best performing kernel to use for the next step. Repeat for$N$ steps.