GitHub - test-time-interaction/TTI

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Code repository for preprint "Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction".

Junhong Shen*, Hao Bai*, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar
CMU, Scribe, UIUC, U Toronto, UC Berkeley, The AGI Company, NYU
*Equal contribution

Overview

TTI (Test-Time Interaction) is a framework for improving agent reasoning through scaled interaction during test time. The codebase supports training and evaluation on WebArena and WebVoyager benchmarks using online reinforcement learning techniques (filtered behavioral cloning).

This code base supports:

Filtered Behavior Cloning: Advanced imitation learning with trajectory filtering
Multi-GPU Support: Distributed training with DeepSpeed
Vision-Language Models: Support for multimodal agents with image understanding
Web Environment Integration: Native support for WebArena and WebVoyager benchmarks

Installation

Prerequisites

Python 3.10+
CUDA-compatible GPU(s)
Conda/Miniconda

Environment Setup

Clone the repository:

git clone https://github.com/test-time-interaction/TTI.git
cd TTI

Create and activate conda environment:

conda env create -f environment.yml
conda activate tti

Alternative pip installation:
```
pip install -e .
```

Before running training or evaluation, you need to set up the following credentials:

OpenAI API Key (required for some evaluations):

export OPENAI_API_KEY='your_openai_api_key_here'

Hugging Face Token (required for model access):

export HF_TOKEN='your_huggingface_token_here'

Weights & Biases Key (optional, for experiment tracking):
- Update wandb_key in configuration files
- Set use_wandb: True in configs to enable logging

WebArena Setup

For WebArena experiments, you'll need to set up the WebArena containers following the official instructions. To facilitate parallel data 8000 collection, we recommend creating multiple containers on a single machine. We provide a reference script to do so in scripts/create_webarena_containers.sh. Then, update the webarena_host addresses in the configuration files with your container endpoints.

Usage

Configuration Files

The main configuration files are located in scripts/config/main/:

webvoyager_rl.yaml - WebVoyager training configuration
webvoyager_eval.yaml - WebVoyager evaluation configuration
webarena_rl.yaml - WebArena training configuration
webarena_eval.yaml - WebArena evaluation configuration
default.yaml - Base configuration with common settings

Key parameters you may want to modify:

policy_lm: The base language model to use
lm_lr: Learning rate for the language model
batch_size: Training batch size
rollout_size: Number of trajectories to collect
vllm_tensor_parallel_size: Number of GPUs for model parallelism
min_try: Number of attempts the agent is allowed to submit an answer. For example, if it's set to 2, we are allowing the model to re-check its answer (this is the inference-time interaction scaling strategy mentioned in the paper, see here for implementation)

We use at least 4x NVIDIA H100 GPU for both training and evaluation.

Prompts

The configuration file specifies the agent and evaluation prompts, which are stored under the directory prompts. You can generate new prompts by

Data

Training data is stored in the tasks/ directory:

webvoyager_train_data.jsonl - WebVoyager training tasks
webvoyager_test_data.jsonl - WebVoyager test tasks
webvoyager_subset.jsonl - WebVoyager subset for real-time progress tracking
webarena_train_data.jsonl - WebArena training tasks
webarena_test_data.jsonl - WebArena test tasks

Training

Navigate to the scripts directory and run the appropriate training script:

cd scripts

WebVoyager Training

./webvoyager_train.sh

WebArena Training

./webarena_train.sh

The training scripts will:

Activate the conda environment
Set required environment variables
Alternate between data collection (using vllm) and model updates (using DeepSpeed)

In the training script, you can modify the curriculum learning schedule and the number of GPUs used.

Evaluation

The TTI checkpoints of WebVoyager and WebArena are released:

You can directly specify these in the configuration file.

WebArena Evaluation

cd scripts
./webarena_eval.sh

WebVoyager Evaluation

cd scripts  
./webvoyager_eval.sh

Troubleshooting

CUDA out of memory: Reduce batch_size, safe_batch_size, or vllm_tensor_parallel_size
Port conflicts: Change master_port values in training scripts
WebArena connection errors: Verify container setup and host addresses
Model download failures: Check HuggingFace token and internet connection

Citation

If you use this code in your research, please cite:

@misc{shenbai2025tti,
      title={Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction}, 
      author={Junhong Shen and Hao Bai and Lunjun Zhang and Yifei Zhou and Amrith Setlur and Shengbang Tong and Diego Caples and Nan Jiang and Tong Zhang and Ameet Talwalkar and Aviral Kumar},
      year={2025},
      eprint={2506.07976},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.07976}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
scripts		scripts
tasks		tasks
tti		tti
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Overview

Installation

Prerequisites

Environment Setup

WebArena Setup

Usage

Configuration Files

Prompts

Data

Training

WebVoyager Training

WebArena Training

Evaluation

WebArena Evaluation

WebVoyager Evaluation

Troubleshooting

Citation

About

Uh oh!

Releases

Packages

Languages

test-time-interaction/TTI

Folders and files

Latest commit

History

Repository files navigation

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Overview

Installation

Prerequisites

Environment Setup

WebArena Setup

Usage

Configuration Files

Prompts

Data

Training

WebVoyager Training

WebArena Training

Evaluation

WebArena Evaluation

WebVoyager Evaluation

Troubleshooting

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages