8000 GitHub - bytedance/web-bench: Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

bytedance/web-bench

Repository files navigation

Web-Bench

中文InstallPaperDatasetsLeaderBoardCitation

📖 Overview

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and complexity of these projects, which were designed by engineers with 5-10 years of experience, each presents a significant challenge. On average, a single project takes 4–8 hours for a senior engineer to complete. On our given benchmark agent (Web-Agent), SOTA (Claude 3.7 Sonnet) achieves only 25.1% Pass@1.

The distribution of the experimental data aligns well with the current code generation capabilities of mainstream LLMs.

pass@1

HumanEval and MBPP have approached saturation. APPS and EvalPlus are approaching saturation. The SOTA for Web-Bench is 25.1%, which is lower (better) than that of the SWE-bench Full and Verified sets.

SOTAs

🚀 Installation

  1. Install Node.js 22+
  2. Init
git clone https://github.com/bytedance/Web-Bench.git
cd Web-Bench
npm i -g pnpm@9.12.0 @microsoft/rush@5.140.0 playwright@1.49.1
cd projects/angular &&  npx playwright install
rush update
rush build

If you wish to use Docker, refer to Docker Guide.

📘 Usage

Complete Configuration and run:

rush eval

🛠️ Contribution

📚 Citation

@article{xu2025webbench,
  title={Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks},
  author={Xu, Kai and Mao, YiWei and Guan, XinYi and Feng, ZiLong},
  journal={arXiv preprint arXiv:2505.07473},
  year={2025}
}

📄 License

Apache 2.0

0