8000 GitHub - aki-au/test: Measuring Massive Multitask Language Understanding | ICLR 2021
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ test Public
forked from hendrycks/test

Measuring Massive Multitask Language Understanding | ICLR 2021

License

Notifications You must be signed in to change notification settings

aki-au/test

 
 

Repository files navigation

MMLU Benchmark (Hugging Face Edition)

Welcome! This repo is a modified fork of the original MMLU benchmark, updated to support open-source Hugging Face models like google/gemma-3b-it and stabilityai/stable-code-3b.


What's MMLU?

MMLU (Massive Multitask Language Understanding) is a benchmark designed to test language models across 57 subjects

For a deep dive, check out my blog post:
Paper Breakdown #1 – MMLU: LLMs Have Exams Too!


What's in this repo?

  • Modified evaluate.py to work with Hugging Face models
  • A random subset of 20 subjects from MMLU (because Colab runtimes aren't infinite)
  • Scripts to run few-shot evaluation

You can swap in any Hugging Face causal LM (AutoModelForCausalLM compatible).


Models Used in My Tests

Model Description
google/gemma-3b-it General-purpose instruction-tuned LLM
stabilityai/stable-code-3b Code-first model, tested just for fun

Subjects Tested (20)

Abstract Algebra, Anatomy, College Biology, College Chemistry, College Mathematics, Global Facts, High School Biology, High School Computer Science, High School Government and Politics, High School World History, Human Sexuality, Management, Medical Genetics, Miscellaneous, Moral Disputes, Professional Accounting, Public Relations, Sociology, Virology, World Religions


Credits & References


Final Thoughts

LLMs are smart, but they're not magic. This repo exists to help you measure just how smart (or not) they really are.

Got suggestions? Found a bug? Want to run it on another model? Open an issue or shoot me a message. Let’s benchmark responsibly.

About

Measuring Massive Multitask Language Understanding | ICLR 2021

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 68.8%
  • Python 31.2%
0