Wordle Benchmark for LLMs

A comprehensive benchmark for evaluating Large Language Model performance on Wordle word-guessing games. This framework tests how well different LLMs can strategically play Wordle using tool calling to make guesses and interpret feedback.

Features

Real Wordle Gameplay: Authentic Wordle rules with proper Green/Yellow/Gray feedback
Multi-Provider Support: Test models from OpenAI, Anthropic, Google, and more
Individual Puzzle Tracking: Separate results for each of the 20 included Wordle puzzles
Simple Scoring: 1-6 points based on guesses used (lower is better), 7 for failed attempts
Detailed Logging: Track all guesses made by each model
Comprehensive Analysis: Success rates, average scores, and strategic play patterns
TypeScript: Fully typed for better development experience

Quick Start

Installation

Install Bun:

curl -fsSL https://bun.sh/install | bash

Install dependencies:

bun install

Configuration

Copy the example environment file:
```
cp .env.example .env
```

Configure API keys in .env:

OPENROUTER_API_KEY="your-api-key-here"
ANTHROPIC_API_KEY="your-api-key-here"
OPENAI_API_KEY="your-api-key-here"
GOOGLE_GENERATIVE_AI_API_KEY="your-api-key-here"

Basic Usage

Test a Single Wordle Game

# Test with GPT-4o on a random Wordle puzzle
bun run run-task --model gpt-4o --task wordle

# Test with Claude Sonnet
bun run run-task --model claude-4-sonnet-20250514-32k-thinking --task wordle

Run Full Wordle Benchmarks

# Run benchmarks for all models on all 20 Wordle puzzles
bun run run-benchmarks

# Run benchmarks for specific models only
bun run run-benchmarks --model gpt-4o --model claude-4-sonnet-20250514-32k-thinking

# Run with lower concurrency for stability
bun run run-benchmarks --concurrency 2

Aggregate Results

# Process all benchmark results and generate summary
bun run aggregate-results

Wordle Benchmark Details

Scoring System

1-6 points: Number of guesses used to solve (1 = solved in 1 guess, 6 = solved in 6 guesses)
7 points: Failed to solve within 6 guesses
Lower scores are better

Included Puzzles

The benchmark includes 20 real Wordle puzzles from recent dates:

VIXEN (06/12/25), PLAID (06/11/25), TAFFY (06/10/25), BOARD (06/09/25)
LEASE (06/08/25), REUSE (06/07/25), EDIFY (06/06/25), DATUM (06/05/25)
And 12 more challenging words...

Game Mechanics

Tool-based interaction: Models use the makeGuess tool to submit guesses
Real Wordle feedback: G (Green) = correct position, Y (Yellow) = wrong position, X (Gray) = not in word
Strategic gameplay: Models must use feedback to eliminate possibilities and make informed guesses
Word validation: Only accepts valid 5-letter uppercase English words

Architecture

Core Components

Wordle Game Engine (src/task-runner.ts):
- WordleGame: Complete game logic with authentic Wordle rules
- WordleTaskRunner: LLM interface for playing Wordle games
- Zod schemas for type safety and validation
Puzzle Management:
- Individual benchmark files for each puzzle date
- Deterministic puzzle selection for consistent testing
- Separate scoring for each word difficulty
Results Storage:
- benchmarks/{model}-wordle-{date}.json: Individual puzzle results
- Minimal data storage: just guesses, target word, and score
- Easy aggregation and analysis

Example Output

🎯 Target Word: CRANE (06/01/25)
🎮 Guesses Made: 3
📊 Score: 3 ✅

📝 Guess History:
  1. ADIEU
  2. NORTH
  3. CRANE

Benchmark File Structure

benchmarks/
├── gpt-4o-wordle-06-12-25.json
├── gpt-4o-wordle-06-11-25.json
├── claude-4-sonnet-wordle-06-12-25.json
└── ...

Available Models

The framework includes pre-configured models from major providers:

OpenAI: GPT-4o, GPT-4.1, O3, O1-mini (various reasoning efforts)
Anthropic: Claude 4 Sonnet/Opus, Claude 3.7 Sonnet, Claude 3.5 Sonnet
Google: Gemini 2.5 Pro, Gemini 2.5 Flash
Others: Grok, Qwen, and more via OpenRouter

Metrics Tracked

Success Rate: Percentage of puzzles solved within 6 guesses
Average Score: Mean score across all attempted puzzles (1-7 scale)
Guess Efficiency: How quickly models solve puzzles
Strategic Play: Quality of starting words and guess patterns
Token Usage: Computational efficiency of different models

License

This project is licensed under the MIT License.

Name	Name	Last commit message
Latest commit History 3 Commits
src	src	8000
task-data	task-data
.env.example	.env.example
.gitignore	.gitignore
.prettierignore	.prettierignore
.prettierrc	.prettierrc
CLAUDE.md	CLAUDE.md
README.md	README.md
bun.lockb	bun.lockb
package.json	package.json
tsconfig.json	tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wordle Benchmark for LLMs

Features

Quick Start

Installation

Configuration

Basic Usage

Test a Single Wordle Game

Run Full Wordle Benchmarks

Aggregate Results

Wordle Benchmark Details

Scoring System

Included Puzzles

Game Mechanics

Architecture

Core Components

Example Output

Benchmark File Structure

Available Models

Metrics Tracked

License

About

Uh oh!

Releases

Packages

Languages

ashryanbeats/wordle-benchmark

Folders and files

Latest commit

History

Repository files navigation

Wordle Benchmark for LLMs

Features

Quick Start

Installation

Configuration

Basic Usage

Test a Single Wordle Game

Run Full Wordle Benchmarks

Aggregate Results

Wordle Benchmark Details

Scoring System

Included Puzzles

Game Mechanics

Architecture

Core Components

Example Output

Benchmark File Structure

Available Models

Metrics Tracked

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages