| Website | Paper | WebVoyager Model | WebArena Model |
Code repository for preprint "Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction".
Junhong Shen*, Hao Bai*, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar
CMU, Scribe, UIUC, U Toronto, UC Berkeley, The AGI Company, NYU
*Equal contribution
TTI (Test-Time Interaction) is a framework for improving agent reasoning through scaled interaction during test time. The codebase supports training and evaluation on WebArena and WebVoyager benchmarks using online reinforcement learning techniques (filtered behavioral cloning).
This code base supports:
- Filtered Behavior Cloning: Advanced imitation learning with trajectory filtering
- Multi-GPU Support: Distributed training with DeepSpeed
- Vision-Language Models: Support for multimodal agents with image understanding
- Web Environment Integration: Native support for WebArena and WebVoyager benchmarks
- Python 3.10+
- CUDA-compatible GPU(s)
- Conda/Miniconda
-
Clone the repository:
git clone https://github.com/test-time-interaction/TTI.git cd TTI
-
Create and activate conda environment:
conda env create -f environment.yml conda activate tti
-
Alternative pip installation:
pip install -e .
Before running training or evaluation, you need to set up the following credentials:
-
OpenAI API Key (required for some evaluations):
export OPENAI_API_KEY='your_openai_api_key_here'
-
Hugging Face Token (required for model access):
export HF_TOKEN='your_huggingface_token_here'
-
Weights & Biases Key (optional, for experiment tracking):
- Update
wandb_key
in configuration files - Set
use_wandb: True
in configs to enable logging
- Update
For WebArena experiments, you'll need to set up the WebArena containers following the official instructions. To facilitate parallel data
8000
collection, we recommend creating multiple containers on a single machine. We provide a reference script to do so in scripts/create_webarena_containers.sh
. Then, update the webarena_host
addresses in the configuration files with your container endpoints.
The main configuration files are located in scripts/config/main/
:
webvoyager_rl.yaml
- WebVoyager training configurationwebvoyager_eval.yaml
- WebVoyager evaluation configurationwebarena_rl.yaml
- WebArena training configurationwebarena_eval.yaml
- WebArena evaluation configurationdefault.yaml
- Base configuration with common settings
Key parameters you may want to modify:
policy_lm
: The base language model to uselm_lr
: Learning rate for the language modelbatch_size
: Training batch sizerollout_size
: Number of trajectories to collectvllm_tensor_parallel_size
: Number of GPUs for model parallelismmin_try
: Number of attempts the agent is allowed to submit an answer. For example, if it's set to 2, we are allowing the model to re-check its answer (this is the inference-time interaction scaling strategy mentioned in the paper, see here for implementation)
We use at least 4x NVIDIA H100 GPU for both training and evaluation.
The configuration file specifies the agent and evaluation prompts, which are stored under the directory prompts
. You can generate new prompts by
Training data is stored in the tasks/
directory:
webvoyager_train_data.jsonl
- WebVoyager training taskswebvoyager_test_data.jsonl
- WebVoyager test taskswebvoyager_subset.jsonl
- WebVoyager subset for real-time progress trackingwebarena_train_data.jsonl
- WebArena training taskswebarena_test_data.jsonl
- WebArena test tasks
Navigate to the scripts directory and run the appropriate training script:
cd scripts
./webvoyager_train.sh
./webarena_train.sh
The training scripts will:
- Activate the conda environment
- Set required environment variables
- Alternate between data collection (using vllm) and model updates (using DeepSpeed)
In the training script, you can modify the curriculum learning schedule and the number of GPUs used.
The TTI checkpoints of WebVoyager and WebArena are released:
You can directly specify these in the configuration file.
cd scripts
./webarena_eval.sh
cd scripts
./webvoyager_eval.sh
- CUDA out of memory: Reduce
batch_size
,safe_batch_size
, orvllm_tensor_parallel_size
- Port conflicts: Change
master_port
values in training scripts - WebArena connection errors: Verify container setup and host addresses
- Model download failures: Check HuggingFace token and internet connection
If you use this code in your research, please cite:
@misc{shenbai2025tti,
title={Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction},
author={Junhong Shen and Hao Bai and Lunjun Zhang and Yifei Zhou and Amrith Setlur and Shengbang Tong and Diego Caples and Nan Jiang and Tong Zhang and Ameet Talwalkar and Aviral Kumar},
year={2025},
eprint={2506.07976},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.07976},
}