8000 GitHub - ashryanbeats/wordle-benchmark
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ashryanbeats/wordle-benchmark

 
 

Repository files navigation

Wordle Benchmark for LLMs

A comprehensive benchmark for evaluating Large Language Model performance on Wordle word-guessing games. This framework tests how well different LLMs can strategically play Wordle using tool calling to make guesses and interpret feedback.

Features

  • Real Wordle Gameplay: Authentic Wordle rules with proper Green/Yellow/Gray feedback
  • Multi-Provider Support: Test models from OpenAI, Anthropic, Google, and more
  • Individual Puzzle Tracking: Separate results for each of the 20 included Wordle puzzles
  • Simple Scoring: 1-6 points based on guesses used (lower is better), 7 for failed attempts
  • Detailed Logging: Track all guesses made by each model
  • Comprehensive Analysis: Success rates, average scores, and strategic play patterns
  • TypeScript: Fully typed for better development experience

Quick Start

Installation

Install Bun:

curl -fsSL https://bun.sh/install | bash

Install dependencies:

bun install

Configuration

  1. Copy the example environment file:

    cp .env.example .env
  2. Configure API keys in .env:

    OPENROUTER_API_KEY="your-api-key-here"
    ANTHROPIC_API_KEY="your-api-key-here"
    OPENAI_API_KEY="your-api-key-here"
    GOOGLE_GENERATIVE_AI_API_KEY="your-api-key-here"
    

Basic Usage

Test a Single Wordle Game

# Test with GPT-4o on a random Wordle puzzle
bun run run-task --model gpt-4o --task wordle

# Test with Claude Sonnet
bun run run-task --model claude-4-sonnet-20250514-32k-thinking --task wordle

Run Full Wordle Benchmarks

# Run benchmarks for all models on all 20 Wordle puzzles
bun run run-benchmarks

# Run benchmarks for specific models only
bun run run-benchmarks --model gpt-4o --model claude-4-sonnet-20250514-32k-thinking

# Run with lower concurrency for stability
bun run run-benchmarks --concurrency 2

Aggregate Results

# Process all benchmark results and generate summary
bun run aggregate-results

Wordle Benchmark Details

Scoring System

  • 1-6 points: Number of guesses used to solve (1 = solved in 1 guess, 6 = solved in 6 guesses)
  • 7 points: Failed to solve within 6 guesses
  • Lower scores are better

Included Puzzles

The benchmark includes 20 real Wordle puzzles from recent dates:

  • VIXEN (06/12/25), PLAID (06/11/25), TAFFY (06/10/25), BOARD (06/09/25)
  • LEASE (06/08/25), REUSE (06/07/25), EDIFY (06/06/25), DATUM (06/05/25)
  • And 12 more challenging words...

Game Mechanics

  • Tool-based interaction: Models use the makeGuess tool to submit guesses
  • Real Wordle feedback: G (Green) = correct position, Y (Yellow) = wrong position, X (Gray) = not in word
  • Strategic gameplay: Models must use feedback to eliminate possibilities and make informed guesses
  • Word validation: Only accepts valid 5-letter uppercase English words

Architecture

Core Components

  1. Wordle Game Engine (src/task-runner.ts):

    • WordleGame: Complete game logic with authentic Wordle rules
    • WordleTaskRunner: LLM interface for playing Wordle games
    • Zod schemas for type safety and validation
  2. Puzzle Management:

    • Individual benchmark files for each puzzle date
    • Deterministic puzzle selection for consistent testing
    • Separate scoring for each word difficulty
  3. Results Storage:

    • benchmarks/{model}-wordle-{date}.json: Individual puzzle results
    • Minimal data storage: just guesses, target word, and score
    • Easy aggregation and analysis

Example Output

🎯 Target Word: CRANE (06/01/25)
🎮 Guesses Made: 3
📊 Score: 3 ✅

📝 Guess History:
  1. ADIEU
  2. NORTH
  3. CRANE

Benchmark File Structure

benchmarks/
├── gpt-4o-wordle-06-12-25.json
├── gpt-4o-wordle-06-11-25.json
├── claude-4-sonnet-wordle-06-12-25.json
└── ...

Available Models

The framework includes pre-configured models from major providers:

  • OpenAI: GPT-4o, GPT-4.1, O3, O1-mini (various reasoning efforts)
  • Anthropic: Claude 4 Sonnet/Opus, Claude 3.7 Sonnet, Claude 3.5 Sonnet
  • Google: Gemini 2.5 Pro, Gemini 2.5 Flash
  • Others: Grok, Qwen, and more via OpenRouter

Metrics Tracked

  • Success Rate: Percentage of puzzles solved within 6 guesses
  • Average Score: Mean score across all attempted puzzles (1-7 scale)
  • Guess Efficiency: How quickly models solve puzzles
  • Strategic Play: Quality of starting words and guess patterns
  • Token Usage: Computational efficiency of different models

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • TypeScript 100.0%
0