Welcome! This repo is a modified fork of the original MMLU benchmark, updated to support open-source Hugging Face models like google/gemma-3b-it
and stabilityai/stable-code-3b
.
MMLU (Massive Multitask Language Understanding) is a benchmark designed to test language models across 57 subjects
For a deep dive, check out my blog post:
Paper Breakdown #1 – MMLU: LLMs Have Exams Too!
- Modified
evaluate.py
to work with Hugging Face models - A random subset of 20 subjects from MMLU (because Colab runtimes aren't infinite)
- Scripts to run few-shot evaluation
You can swap in any Hugging Face causal LM (AutoModelForCausalLM
compatible).
Model | Description |
---|---|
google/gemma-3b-it |
General-purpose instruction-tuned LLM |
stabilityai/stable-code-3b |
Code-first model, tested just for fun |
Abstract Algebra
, Anatomy
, College Biology
, College Chemistry
, College Mathematics
, Global Facts
, High School Biology
, High School Computer Science
, High School Government and Politics
, High School World History
, Human Sexuality
, Management
, Medical Genetics
, Miscellaneous
, Moral Disputes
, Professional Accounting
, Public Relations
, Sociology
, Virology
, World Religions
LLMs are smart, but they're not magic. This repo exists to help you measure just how smart (or not) they really are.
Got suggestions? Found a bug? Want to run it on another model? Open an issue or shoot me a message. Let’s benchmark responsibly.