GitHub - WPENGxs/X-WebAgentBench: [ACL 2025 Findings] X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

X-WebAgentBench

Data Prepare

Download product data and instruction data, which is about 5G after unzip. Download link: Google Drive.
Unzip the files to X-WebAgentBench/data/.

Setup

Create conda environment:

conda create -n xwebagentbench python=3.8.13
conda activate xwebagentbench

Install environment and jdk:

conda install -c pytorch faiss-cpu
conda install -c conda-forge openjdk=11
pip install -r requirements.txt

Note: If you get the output error: command 'g++' failed: No such file or directory, please install g++ by sudo apt-get install g++ to solve it.

Download en_core_web_lg model:

python -m spacy download en_core_web_lg

Create index files (including 15 languages, about 30G). It takes about 30 minutes on our device:

(xwebagentbench) root@server:~/X-WebAgentBench$ ./reset_index.sh

Set host in web_agent_site/app_multi.py (default port is 3000):

--> app.run(host='your ip address', port=3000)

Launch X-WebAgentBench (need 16G RAM):

(xwebagentbench) root@server:~/X-WebAgentBench$ python -m web_agent_site.app_multi --log

Note: We recommend that the server needs at least 16G memory, or using higher memory.

Start in Specific Languages: If you want start X-WebAgentBench in specific languages, please modify X-WebAgentBench/web_agent_site/app_multi.py:

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="WebShop flask app backend configuration")
    parser.add_argument("--log", action='store_true', help="Log actions on WebShop in trajectory file")
    parser.add_argument("--attrs", action='store_true', help="Show attributes tab in item page")

--> languages = ['en', 'zh', 'fr', 'es', 'de', 'el', 'bg', 'ru', 'tr', 'ar', 'vi', 'th', 'hi', 'sw', 'ur']

Adjust the languages list to the languages you want to start. And 'en' MUST in list.

Usage

Webpage: You can use a browser to access X-WebAgentBench directly by below url:

http://webshop_url:<port>/<language>/fixed_<num>

<port> is the port that was set when the X-WebAgentBench was launched, and it defaults to 3000.
<language> is an abbreviation for each language.
<num> is human_ins number id, between 0 and 199, inclusive.

Text: Also, you can parse html to text using the code Eval/WebShopEnv.py, you can use code to interact with the X-WebAgentBench and fetch text content:

from WebShopEnv import webshopEnv
env = webshopEnv()
res = env.step(f'{<language>}/fixed_{<num_id>}', <action>, <action_attr>)
observation = res[0]

<action>: Action input
<action_attr>: ['reset', 'think', 'search', 'click']

observation example print (init page):

WebShop 
Instruction:  
im looking for a earbud headphones for stereo sound quality of style je-04b which will be more comfortable for me to use without disturbing others , and price lower than 60.00 dollars 
[Search]

observation example print (search page):

Instruction: 
im looking for a earbud headphones for stereo sound quality of style je-04b which will be more comfortable for me to use without disturbing others , and price lower than 60.00 dollars 
[Back to Search] 
Page 1 (Total results: 50) 
[Next >] 
[B09743DFJC] 
Jinpei Cute Panda Wireless Earphones, Waterproof, Noise Cancelling in-Ear erbuds, TWS Stereo Headphones, Built in mic Headset Premium Sound with deep Bass 
$39.9 
[B097445J5R] 
Jinpei Cute Pink cat Wireless Earphones, Waterproof, Noise Cancelling in-Ear erbuds, TWS Stereo Headphones, Built in mic Headset Premium Sound with deep bass 
$39.9 
[B084TSH1YW] 
TWS Headphones Wireless Earbuds Earphones for Moto Z4, True Wireless Stereo Headset Hands-Free Mic Charging Case Compatible with Motorola Moto Z4 
$59.99 
[B09QVFV6GG] 
TWS Headphones Wireless Earbuds Earphones for Galaxy A03s - True Wireless Stereo Headset Hands-Free Mic Charging Case Compatible with Samsung Galaxy A03s 
$54.99 
[B084Z3GPP7] 
TWS Headphones Wireless Earbuds Earphones for Moto G Stylus, True Wireless Stereo Headset Hands-Free Mic Charging Case Compatible with Motorola Moto G Stylus 
$59.99

observation example print (item page):

Instruction: 
im looking for a earbud headphones for stereo sound quality of style je-04b which will be more comfortable for me to use without disturbing others , and price lower than 60.00 dollars 
[Back to Search] 
[< Prev] 
style [je-01b][je-02b][je-03b][je-04b][je-05b]
Jinpei Cute Panda Wireless Earphones, Waterproof, Noise Cancelling in-Ear erbuds, TWS Stereo Headphones, Built in mic Headset Premium Sound with deep Bass 
Price: $39.9 
Rating: N.A. 
[Description] 
[Features] 
[Reviews] 
[Buy Now]

observation example print (buy page):

Your score (min 0.0, max 1.0): 0.6666666666666666

Evaluation

Setup

Create conda environment:

conda create -n eval python=3.10.0
conda activate eval

Install base environment:

pip install tqdm requests bs4 deep_translator

(Optional) If you want to run GPT-3.5-turbo/GPT-4o, please install openai and set openai API key in X-WebAgentBench/Eval/model.py:

--> client_gpt = OpenAI(api_key="api key", base_url="base url")

(Optional) For Llama, we use the API to call the llama series model (Llama3-8B/70B, Llama3.1-8B/70B/405B). If you want to run it locally, please follow the official tutorial and modify our code. You can follow our step to install open-source LLM dependencies for Qwen2 and Mistral:

Qwen2: transformers>=4.40.0
Mistral: pip install transformers torch torchvision accelerate sentencepiece protobuf

Fill in the url in X-WebAgentBench/Eval/WebShopEnv.py (like http://webshop_url:<port>):

--> WEBSHOP_URL = "webshop_url"

Run our code:

(eval) root@server:~/X-WebAgentBench/Eval$ python main.py --model MODEL --method METHOD --language LANGUAGE [--test_n NUM --device DEVICE]

MODEL = ['gpt-3.5-turbo', 'gpt-4o', 'qwen2', 'mistral', 'llama3']
METHOD = ['direct', 'translate_en', 'self-translate_en', 'clp_en']
LANGUAGE = ['zh', 'fr', 'es', 'de', 'el', 'bg', 'ru', 'tr', 'ar', 'vi', 'th', 'hi', 'sw', 'ur']
NUM = 200 (default)
DEVICE = 'cuda:0' (default)

Final Result

After the evaluation code is run, the final result (Task Score=100*avg.reward) will be output:

########################################
model: MODEL
method: METHOD
language: LANGUAGE
total_score: SCORE
########################################

In addition, you can get output log in X-WebAgentBench/Eval/saved_log/MODEL/METHOD/METHOD_LANGUAGE.json.

Reference

If you find this project useful for your research, please consider citing the following paper:

@article

Contact

If you have any questions or suggestions, please create Github issues here or email Peng Wang, Ruihan Tao, and Libo Qin.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Eval		Eval
data		data
search_engine		search_engine
user_session_logs		user_session_logs
web_agent_site		web_agent_site
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_environment.yml		eval_environment.yml
requirements.txt		requirements.txt
reset_index.sh		reset_index.sh
xwebagentbench_environment.yml		xwebagentbench_environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

Table of Contents

X-WebAgentBench

Data Prepare

Setup

Usage

Evaluation

Setup

Final Result

Reference

Contact

About

Uh oh!

Languages

License

WPENGxs/X-WebAgentBench

Folders and files

Latest commit

History

Repository files navigation

X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System

Table of Contents

X-WebAgentBench

Data Prepare

Setup

Usage

Evaluation

Setup

Final Result

Reference

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages