A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our Computer Agent) across areas of ability : perception, single-step and multi-step agentic behaviour.
This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on smolagents.
Data Source | Evaluation Type | Platform | Link |
---|---|---|---|
ScreenSpot | BBox + click accuracy | Web | HuggingFace |
ScreenSpot v2 | BBox + click accuracy | Web | HuggingFace |
ScreenSpot-Pro | BBox + click accuracy | Web | HuggingFace |
Visual-WebBench | Multi-task (Caption, OCR, QA, Grounding, Action) | Web | HuggingFace |
WebSRC | Web QA | Web | HuggingFace |
ScreenQA-short | Mobile QA | Mobile | HuggingFace |
ScreenQA-complex | Mobile QA | Mobile | HuggingFace |
Showdown-Clicks | Click prediction | Web | HuggingFace |
Data Source | Evaluation Type | Platform | Link |
---|---|---|---|
Multimodal-Mind2Web | Web navigation | Web | HuggingFace |
AndroidControl | Mobile control | Mobile | GitHub |
Data Source | Evaluation Type | Platform | Link |
---|---|---|---|
Mind2Web-Live | URL matching | Web | HuggingFace |
GAIA | Exact match | Web | HuggingFace |
BrowseComp | LLM judge | Web | Link |
AndroidWorld | Task-specific | Mobile | GitHub |
MobileMiniWob | Task-specific | Mobile | Included in AndroidWorld GitHub |
OSWorld | Task-specific | Desktop | GitHub |
Make sure to clone the repository with submodules required:
git clone --recurse-submodules git@github.com:huggingface/geekagents.git
or
git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules
- Docker
- Python >= 3.11
- uv
For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled. To check if your hosting platform supports KVM, run
egrep -c '(vmx|svm)' /proc/cpuinfo
on Linux. If the return value is greater than zero, the processor should be able to support KVM. Note: macOS hosts generally do not support KVM.
# Using uv (faster)
uv sync --extra submodules --python 3.11
If you encounter issues with
evdev
python package, you can try installing the build-essential package:sudo apt-get install build-essential
# Install development dependencies
uv sync --all-extras
# Run tests
uv run pytest
# Code quality
uv run pre-commit run --all-files --show-diff-on-failure
#!/usr/bin/env python
import json
import os
from datetime import datetime
from dotenv import load_dotenv
from smolagents.models import InferenceClientModel
from screensuite import (
EvaluationConfig,
ImageResizeConfig,
OSWorldEnvironmentConfig,
get_registry,
)
load_dotenv()
# Setup results directory
RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results")
os.makedirs(RESULTS_DIR, exist_ok=True)
def run_benchmarks():
# Get benchmarks to run
registry = get_registry()
# benchmarks = registry.list_all()
benchmarks = registry.get(
[
"screenqa_short",
"screenqa_complex",
"screenspot-v1-click-prompt",
"screenspot-v1-bounding-box-prompt",
"screenspot-v2-click-prompt",
"screenspot-v2-bounding-box-prompt",
"screenspot-pro-click-prompt",
"screenspot-pro-bounding-box-prompt",
"websrc_dev",
"visualwebbench",
"android_control",
"showdown_clicks",
"mmind2web",
"android_world",
"osworld",
"gaia_web",
]
)
for bench in benchmarks:
print(bench.name)
# Configure your model (choose one)
model = InferenceClientModel(
model_id="Qwen/Qwen2.5-VL-32B-Instruct",
provider="fireworks-ai",
max_tokens=4096,
)
# Alternative models:
# model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096)
# model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096)
# see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py
# Run benchmarks
run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}"
max_samples_to_test = 1
parallel_workers = 1
osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker")
for benchmark in benchmarks:
print(f"Running: {benchmark.name}")
# Configure based on benchmark type
config = EvaluationConfig(
parallel_workers=parallel_workers,
run_name=run_name,
max_samples_to_test=max_samples_to_test,
image_resize_config=ImageResizeConfig(), # If you want to resize the images given to the model - Default is to Qwen2.5-VL resize values
)
try:
results = benchmark.evaluate(
model,
evaluation_config=config,
env_config=osworld_env_config if "osworld" in benchmark.tags else None,
)
print(f"Results: {results._metrics}")
# Save results
with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f:
entry = {"benchmark_name": benchmark.name, "metrics": results._metrics}
f.write(json.dumps(entry) + "\n")
except Exception as e:
print(f"Error in {benchmark.name}: {e}")
continue
if __name__ == "__main__":
run_benchmarks()
To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project. See OSWorld documentation for more details.
This project is licensed under the terms of the Apache License 2.0.