MMLU Benchmark (Hugging Face Edition)

Welcome! This repo is a modified fork of the original MMLU benchmark, updated to support open-source Hugging Face models like google/gemma-3b-it and stabilityai/stable-code-3b.

What's MMLU?

MMLU (Massive Multitask Language Understanding) is a benchmark designed to test language models across 57 subjects

For a deep dive, check out my blog post:
Paper Breakdown #1 – MMLU: LLMs Have Exams Too!

What's in this repo?

Modified evaluate.py to work with Hugging Face models
A random subset of 20 subjects from MMLU (because Colab runtimes aren't infinite)
Scripts to run few-shot evaluation

Quickstart is in the Colab notebook

You can swap in any Hugging Face causal LM (AutoModelForCausalLM compatible).

Models Used in My Tests

Model	Description
`google/gemma-3b-it`	General-purpose instruction-tuned LLM
`stabilityai/stable-code-3b`	Code-first model, tested just for fun

Results

Subjects Tested (20)

Abstract Algebra, Anatomy, College Biology, College Chemistry, College Mathematics, Global Facts, High School Biology, High School Computer Science, High School Government and Politics, High School World History, Human Sexuality, Management, Medical Genetics, Miscellaneous, Moral Disputes, Professional Accounting, Public Relations, Sociology, Virology, World Religions

Credits & References

Final Thoughts

LLMs are smart, but they're not magic. This repo exists to help you measure just how smart (or not) they really are.

Got suggestions? Found a bug? Want to run it on another model? Open an issue or shoot me a message. Let’s benchmark responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
LICENSE		LICENSE
README.md		README.md
Running_the_code.ipynb		Running_the_code.ipynb
calib_tools.py		calib_tools.py
categories.py		categories.py
crop.py		crop.py
evaluate.py		evaluate.py
evaluate_flan.py		evaluate_flan.py
test_calibration.py		test_calibration.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMLU Benchmark (Hugging Face Edition)

What's MMLU?

What's in this repo?

Quickstart is in the Colab notebook

Models Used in My Tests

Results

Subjects Tested (20)

Credits & References

Final Thoughts

About

Releases

Packages

Languages

License

aki-au/test

Folders and files

Latest commit

History

Repository files navigation

MMLU Benchmark (Hugging Face Edition)

What's MMLU?

What's in this repo?

Quickstart is in the Colab notebook

Models Used in My Tests

Results

Subjects Tested (20)

Credits & References

Final Thoughts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages