8000 GitHub - huggingface/screensuite: ScreenSuite - The most comprehensive benchmarking suite for GUI Agents!
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

huggingface/screensuite

Repository files navigation

ScreenSuite

A comprehensive benchmaring suite for evaluating Graphical User Interface (GUI) agents (i.e. agents that act on your screen, like our Computer Agent) across areas of ability : perception, single-step and multi-step agentic behaviour.

This does not aim to compare agent implementations, only the MLLMs that power them: thus we propose only simple agent implementations based on smolagents.

GUI Agent Benchmarks Overview

Grounding/Perception Benchmarks

Data Source Evaluation Type Platform Link
ScreenSpot BBox + click accuracy Web HuggingFace
ScreenSpot v2 BBox + click accuracy Web HuggingFace
ScreenSpot-Pro BBox + click accuracy Web HuggingFace
Visual-WebBench Multi-task (Caption, OCR, QA, Grounding, Action) Web HuggingFace
WebSRC Web QA Web HuggingFace
ScreenQA-short Mobile QA Mobile HuggingFace
ScreenQA-complex Mobile QA Mobile HuggingFace
Showdown-Clicks Click prediction Web HuggingFace

Single Step - Offline Agent Benchmarks

Data Source Evaluation Type Platform Link
Multimodal-Mind2Web Web navigation Web HuggingFace
AndroidControl Mobile control Mobile GitHub

Multi-step - Online Agent Benchmarks

Data Source Evaluation Type Platform Link
Mind2Web-Live URL matching Web HuggingFace
GAIA Exact match Web HuggingFace
BrowseComp LLM judge Web Link
AndroidWorld Task-specific Mobile GitHub
MobileMiniWob Task-specific Mobile Included in AndroidWorld GitHub
OSWorld Task-specific Desktop GitHub

screensuite_benchmark_scores_filtered

scores_screensuite

Cloning the Repository

Make sure to clone the repository with submodules required:

git clone --recurse-submodules git@github.com:huggingface/geekagents.git

or

git submodule update --init --recursive # if you already cloned the repository. To run also when you pull branches to update the submodules

Requirements

  • Docker
  • Python >= 3.11
  • uv

For multistep agent benchmarks, we need to spawn containers environment. To do so, you need KVM virtualization enabled. To check if your hosting platform supports KVM, run

egrep -c '(vmx|svm)' /proc/cpuinfo

on Linux. If the return value is greater than zero, the processor should be able to support KVM. Note: macOS hosts generally do not support KVM.

Installation

# Using uv (faster)
uv sync --extra submodules --python 3.11

If you encounter issues with evdev python package, you can try installing the build-essential package:

sudo apt-get install build-essential

Development

# Install development dependencies
uv sync --all-extras

# Run tests
uv run pytest

# Code quality
uv run pre-commit run --all-files --show-diff-on-failure

Running the benchmarks

#!/usr/bin/env python
import json
import os
from datetime import datetime

from dotenv import load_dotenv
from smolagents.models import InferenceClientModel

from screensuite import (
    EvaluationConfig,
    ImageResizeConfig,
    OSWorldEnvironmentConfig,
    get_registry,
)

load_dotenv()

# Setup results directory
RESULTS_DIR = os.path.join(os.path.dirname(__file__), "results")
os.makedirs(RESULTS_DIR, exist_ok=True)


def run_benchmarks():
    # Get benchmarks to run
    registry = get_registry()
    # benchmarks = registry.list_all()
    benchmarks = registry.get(
        [
            "screenqa_short",
            "screenqa_complex",
            "screenspot-v1-click-prompt",
            "screenspot-v1-bounding-box-prompt",
            "screenspot-v2-click-prompt",
            "screenspot-v2-bounding-box-prompt",
            "screenspot-pro-click-prompt",
            "screenspot-pro-bounding-box-prompt",
            "websrc_dev",
            "visualwebbench",
            "android_control",
            "showdown_clicks",
            "mmind2web",
            "android_world",
            "osworld",
            "gaia_web",
        ]
    )

    for bench in benchmarks:
        print(bench.name)

    # Configure your model (choose one)
    model = InferenceClientModel(
        model_id="Qwen/Qwen2.5-VL-32B-Instruct",
        provider="fireworks-ai",
        max_tokens=4096,
    )

    # Alternative models:
    # model = OpenAIServerModel(model_id="gpt-4o", max_tokens=4096)
    # model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514", max_tokens=4096)
    # see smolagents documentation for more models -> https://github.com/huggingface/smolagents/blob/main/examples/agent_from_any_llm.py

    # Run benchmarks
    run_name = f"test_{datetime.now().strftime('%Y-%m-%d')}"
    max_samples_to_test = 1
    parallel_workers = 1
    osworld_env_config = OSWorldEnvironmentConfig(provider_name="docker")

    for benchmark in benchmarks:
        print(f"Running: {benchmark.name}")

        # Configure based on benchmark type
        config = EvaluationConfig(
            parallel_workers=parallel_workers,
            run_name=run_name,
            max_samples_to_test=max_samples_to_test,
            image_resize_config=ImageResizeConfig(),  # If you want to resize the images given to the model - Default is to Qwen2.5-VL resize values
        )

        try:
            results = benchmark.evaluate(
                model,
                evaluation_config=config,
                env_config=osworld_env_config if "osworld" in benchmark.tags else None,
            )
            print(f"Results: {results._metrics}")

            # Save results
            with open(f"{RESULTS_DIR}/results_{run_name}.jsonl", "a") as f:
                entry = {"benchmark_name": benchmark.name, "metrics": results._metrics}
                f.write(json.dumps(entry) + "\n")

        except Exception as e:
            print(f"Error in {benchmark.name}: {e}")
            continue


if __name__ == "__main__":
    run_benchmarks()

OSWorld Google Tasks

To run OSWorld Google tasks, you need to create a Google account and a Google Cloud project. See OSWorld documentation for more details.

License

This project is licensed under the terms of the Apache License 2.0.

About

ScreenSuite - The most comprehensive benchmarking suite for GUI Agents!

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages

0