A comprehensive benchmark for evaluating Large Language Model performance on Wordle word-guessing games. This framework tests how well different LLMs can strategically play Wordle using tool calling to make guesses and interpret feedback.
- Real Wordle Gameplay: Authentic Wordle rules with proper Green/Yellow/Gray feedback
- Multi-Provider Support: Test models from OpenAI, Anthropic, Google, and more
- Individual Puzzle Tracking: Separate results for each of the 20 included Wordle puzzles
- Simple Scoring: 1-6 points based on guesses used (lower is better), 7 for failed attempts
- Detailed Logging: Track all guesses made by each model
- Comprehensive Analysis: Success rates, average scores, and strategic play patterns
- TypeScript: Fully typed for better development experience
Install Bun:
curl -fsSL https://bun.sh/install | bash
Install dependencies:
bun install
-
Copy the example environment file:
cp .env.example .env
-
Configure API keys in
.env
:OPENROUTER_API_KEY="your-api-key-here" ANTHROPIC_API_KEY="your-api-key-here" OPENAI_API_KEY="your-api-key-here" GOOGLE_GENERATIVE_AI_API_KEY="your-api-key-here"
# Test with GPT-4o on a random Wordle puzzle
bun run run-task --model gpt-4o --task wordle
# Test with Claude Sonnet
bun run run-task --model claude-4-sonnet-20250514-32k-thinking --task wordle
# Run benchmarks for all models on all 20 Wordle puzzles
bun run run-benchmarks
# Run benchmarks for specific models only
bun run run-benchmarks --model gpt-4o --model claude-4-sonnet-20250514-32k-thinking
# Run with lower concurrency for stability
bun run run-benchmarks --concurrency 2
# Process all benchmark results and generate summary
bun run aggregate-results
- 1-6 points: Number of guesses used to solve (1 = solved in 1 guess, 6 = solved in 6 guesses)
- 7 points: Failed to solve within 6 guesses
- Lower scores are better
The benchmark includes 20 real Wordle puzzles from recent dates:
- VIXEN (06/12/25), PLAID (06/11/25), TAFFY (06/10/25), BOARD (06/09/25)
- LEASE (06/08/25), REUSE (06/07/25), EDIFY (06/06/25), DATUM (06/05/25)
- And 12 more challenging words...
- Tool-based interaction: Models use the
makeGuess
tool to submit guesses - Real Wordle feedback: G (Green) = correct position, Y (Yellow) = wrong position, X (Gray) = not in word
- Strategic gameplay: Models must use feedback to eliminate possibilities and make informed guesses
- Word validation: Only accepts valid 5-letter uppercase English words
-
Wordle Game Engine (
src/task-runner.ts
):WordleGame
: Complete game logic with authentic Wordle rulesWordleTaskRunner
: LLM interface for playing Wordle games- Zod schemas for type safety and validation
-
Puzzle Management:
- Individual benchmark files for each puzzle date
- Deterministic puzzle selection for consistent testing
- Separate scoring for each word difficulty
-
Results Storage:
benchmarks/{model}-wordle-{date}.json
: Individual puzzle results- Minimal data storage: just guesses, target word, and score
- Easy aggregation and analysis
🎯 Target Word: CRANE (06/01/25)
🎮 Guesses Made: 3
📊 Score: 3 ✅
📝 Guess History:
1. ADIEU
2. NORTH
3. CRANE
benchmarks/
├── gpt-4o-wordle-06-12-25.json
├── gpt-4o-wordle-06-11-25.json
├── claude-4-sonnet-wordle-06-12-25.json
└── ...
The framework includes pre-configured models from major providers:
- OpenAI: GPT-4o, GPT-4.1, O3, O1-mini (various reasoning efforts)
- Anthropic: Claude 4 Sonnet/Opus, Claude 3.7 Sonnet, Claude 3.5 Sonnet
- Google: Gemini 2.5 Pro, Gemini 2.5 Flash
- Others: Grok, Qwen, and more via OpenRouter
- Success Rate: Percentage of puzzles solved within 6 guesses
- Average Score: Mean score across all attempted puzzles (1-7 scale)
- Guess Efficiency: How quickly models solve puzzles
- Strategic Play: Quality of starting words and guess patterns
- Token Usage: Computational efficiency of different models
This project is licensed under the MIT License.