MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

MLE-Dojo is a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Built upon 200+ real-world Kaggle challenges. MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic Machine Learning Engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging, etc. MLE-Dojo's fully executable environment and flexible interface support comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification.

🛠️ Experiment Setup

Prerequisites

Docker: (Optional) Required for building and running the execution environment.
NVIDIA Container Toolkit: Required if using GPUs within Docker containers.
Conda: One of the most convenient ways is to build with conda.
Git: Basic git support.

Installation

Clone the repository:

git clone https://github.com/MLE-Dojo/MLE-Dojo.git
cd MLE-Dojo

Build the Docker image: (Optional)

docker build -t mle-dojo .

Build the Conda Env:

pip install --upgrade pip setuptools wheel
conda create -y -n mle-dojo python=3.11
conda activate mle-dojo
pip install -e .

📇 Data Preparation

MLE-Dojo currently supports 200+ Kaggle competitions. We aggregate 68 competitions from MLE-Bench (excluding 7 tasks that are unavailable, excessively large, or tightly coupled with specific packages), 74 from DSBench, and 75 additionally scraped and prepared carefully from Kaggle's official website. After removing duplicate entries across sources, we obtain a diverse collection of over 200 unique tasks.

Install the Kaggle package with pip and setup:

pip install kaggle

Next, go to your Kaggle account settings, generate a new API token, and move the downloaded kaggle.json file to ~/.kaggle/kaggle.json on Linux/macOS or C:\Users\<YourUsername>\.kaggle\kaggle.json on Windows. Refer to Kaggle API for details.

Accept terms and conditions

Before downloading each competition, you may need to manually accept its terms and conditions on its official Kaggle website. We provide the competition details in prepare/*.json, including website urls, in ascending order of competition data sizes. Feel free to prepare and play with any competition you want!

❗️Note: This action is needed for each competition you want to prepare and utilize.

To prepare newly introduced or MLE-Bench competitions, specify the competitions-file (a txt file with each line corresponding to a competition), data-dir and the logs-dir. We actively maintain and update supported competitions in prepare/mle.json.

PYTHONPATH="." python prepare/mle.py \
  --competitions-file prepare/competitions.txt \
  --data-dir ./data/prepared \
  --logs-dir ./data/prepare_logs

Alternatively, you can specify the competitions to prepare via args.

PYTHONPATH="." python prepare/mle.py \
--competitions random-acts-of-pizza \
--data-dir ./data/prepared \
--logs-dir ./data/prepare_logs

To prepare the DS-Bench data, specify paths for raw data and prepared data. This will prepare all the competitions in prepare/dsbench.json

PYTHONPATH="." python prepare/dsbench.py \
--raw-dir ./data/raw \
--prepared-dir ./data/prepared

⏰ Reminder: The data preparation is both time- and space-consuming. Please allocate sufficient resources based on the data size and ensure you have accepted the competition's terms and conditions on the official website in advance.

🚀 Quick Start in Python

Here's a quick example to show how to interact with MLE-Dojo:

1. Import from `mledojo` and register

from mledojo.gym.competition import CompetitionRegistry
from mledojo.competitions import get_metric
from mledojo.gym.env import KaggleEnvironment

competition_name = "random-acts-of-pizza"
data_dir = ...
output_dir = ...

# register the competition
registry.register(
    name=competition_name,
    data_dir=data_dir,  # "random-acts-of-pizza/data"
    comp_info=CompInfo(
        category="General",
        level="beginner",
        output_type="submission.csv",
        higher_is_better=True
    ),
    metric_class=get_metric(competition_name)
)

2. Initialize the environment and start interacting with it

# initialize the env
env = KaggleEnvironment.make(
    competition_name=competition_name,      
    output_dir=output_dir,         
    competition_registry=registry,                  
    score_mode="position",              
    gpu_device=0,                     
    gpu_memory_limit=32,                   
    execution_timeout=3600             
)

# request_info
env.step("request_info", **{"info_type": "overview"})

# validate_code
env.step("validate_code", **{"code": "import pandas as pd\nprint('Welcome to MLE-Dojo!')"})

# Execute_code
absolute_data_dir = Path(os.path.abspath(data_dir))
absolute_output_dir = Path(os.path.abspath(output_dir))
code_to_execute = f'''
import pandas as pd
submission = pd.read_csv('{absolute_data_dir / "public" / "sample_submission.csv"}')
submission.to_csv('{absolute_output_dir / "submission.csv"}', index=False)
print("Submission created successfully.")
'''
env.step("execute_code", **{"code": code_to_execute})

💽 Run Experiments

We support running experiments both with and without Docker. We recommend Docker for running experiments with our supported agent scaffolds and models.

Manage LLM API Keys

Create .env under the main directory and save API Keys. "HF_TOKEN" is used for local models.

OPENAI_API_KEY=""
GEMINI_API_KEY=""
ANTHROPIC_API_KEY=""
DEEPSEEK_API_KEY=""
XAI_API_KEY=""
HF_TOKEN=""

We also support Azure Openai API in mledojo/chat/utils, you can leave OPENAI_API_KEY blank and fill AZURE_API_CONFIG with corresponding information.

Run with Docker

After getting some competitions prepared, you can get started with running experiments on your prepared data. Experiment with o3-mini and MLE Agent scaffold on prepared random-acts-of-pizza competition or competitions in prepare/competitions.txt (replace --competitions random-acts-of-pizza with --competitions-file prepare/competitions.txt):

python run.py \
    --gpu-indices 0 \
    --max-tasks-per-gpu 1 \
    --competitions random-acts-of-pizza \
    --docker-image mle-dojo \
    --task-runtime-seconds 43200 \
    --kaggle-data ./data/prepared \
    --log-dir ./results/logs \
    --output-dir ./results/output

Run without Docker

We also support running experiments without Docker, refer to main.py for more args:

python main.py \
    --output-dir ./output \
    --data-dir ./data/prepared \
    --competition-name random-acts-of-pizza \
    --agent-type mle

Configs and Settings

We use separate configs for Agent and Experiment. Config Path for Agent is mledojo/agent/*/config.yaml and Experiment Config is config.yaml. MLE Agent scaffold supports most mainstream LLMs through API or local serving.

model_mode	Supported model_name Variants
gpt	gpt-4o, gpt-4o-mini, o1-mini, o3-mini...
deepseek	deepseek-reasoner, deepseek-chat
gemini	gemini-2.0-flash, gemini-2.0-pro, gemini-2.5-pro-preview-03-25, gemini-2.5-pro-exp-03-25...
grok	grok-3-beta, grok-3-mini-beta...
claude	claude-3-7-sonnet-latest, claude-3-5-sonnet-latest...
local	LLaMA, Qwen, ...

For running local models, refer to run_local.sh. We use vLLM for local model serving.

Agent Scaffolds and Packages

We currently support:

Agents with originally supported action space in mledojo, named MLE Agent.
AIDE (AI-Driven Exploration), an LLM agent that starts by generating a set of initial solution drafts and then iteratively refines and improves them based on performance feedback. We modify the source of the feedback from direct code outputs to mle-dojo environment feedback.
OpenAI Agents, a package that enables building agentic AI apps with abstractions. We implement a version of MLE Agent with openai-agents package.

🌟 Develop with `mledojo`

`mledojo` Interface and APIs

We provide quick examples of APIs. MLE-Dojo provides flexible Gym-style APIs that allow users to build personalized environments, introduce new datasets and develop/utilize different agent scaffolds. Specifically, mledojo serves a powerful and flexible toolkit for interacting with machine learning competitions, designed for ease of use and extensibility. Its core components offer a seamless development experience:

Interface: Provides the primary way to interact with a competition. It manages distinct sub-interfaces for specific tasks:
- InfoInterface: Fetches competition details (overview, data structure, etc.).
- CodeValidationInterface: Performs safe, preliminary checks on user code syntax and basic runtime behavior within a sandbox.
- CodeExecutionInterface: Manages the full execution of user code within the sandbox, handles submission generation, and triggers evaluation.
- Flexibility: The Interface is modular, allowing developers to easily register custom interaction components tailored to specific needs.
Sandbox: Ensures secure and fair code execution by running user scripts in an isolated environment with configurable resource limits (GPU, memory, time).
FeedbackManager: Processes the results from validation and execution, generating structured and informative feedback.
- Extensibility: Designed to be extensible, allowing the registration of custom feedback providers (e.g., LLM-based analysis, specific error pattern detection) alongside the default BaseFeedback.
KaggleEnvironment: The main entry point, wrapping all components into a standardized, Gymnasium-compatible environment. It orchestrates the entire workflow (setup, action handling via step, state tracking, feedback generation) providing a consistent API for users or automated agents.

MLE-Dojo combines well-defined interfaces, secure execution, structured feedback, and centralized management within a familiar Gym-style environment. Its modular and extensible design grants developers significant flexibility to adapt and build upon the core framework.

Collect trajectories for training

MLE-Dojo provides a detailed history management system and a well-defined reward feedback mechanism, including both final outcome rewards and intermediate step rewards. This design enables flexible use for model training via Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). We provide both Agent History and Environment History structures.

Contributing

Contributions are welcome! We will soon release instructions on Contributing.

Dataset Usage Terms and Conditions

The dataset is provided solely for educational and academic research purposes, with the goal of supporting scholarly work in relevant fields. Users must comply with the following terms:

Data Integrity:
While efforts have been made to curate and organize the dataset, no guarantees are made regarding its accuracy, completeness, or timeliness. Users are responsible for independently verifying the data and for any interpretations or outcomes based on its use.
Permitted Use:
The dataset is restricted to non-commercial use only. Any commercial application, product development, or monetization based on the dataset requires prior written permission from the competition organizers.
Legal and Privacy Compliance:
Users must comply with applicable laws and regulations, especially those related to data privacy and security. The dataset providers are not liable for any legal issues resulting from improper or unauthorized usage.
Intellectual Property:
The dataset may include content derived from external sources and is intended for non-commercial research only. The providers do not claim ownership over such content and acknowledge the rights of the original creators. Users must ensure their usage does not infringe upon any intellectual property rights.
Disclaimer:
The dataset is provided "as is", without warranties of any kind. The providers are not responsible for any damages—direct or indirect—that may result from its use, including but not limited to financial loss, misinterpretation, or third-party claims.

License

This code of project is licensed under the MIT License - see the LICENSE file for details. We adapt some of the data preparation codes from DS Bench and MLE-Bench. Different competitions may be governed by different licenses. Users are responsible for reviewing and complying with the specific license associated with each competition. We provide the all the URLs of specific rules and license for each competition. Please refer to the details.

Citation

If you find this useful, you are more than welcome to cite:

@misc{qiang2025mledojointeractiveenvironmentsempowering,
  title={MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering}, 
  author={Rushi Qiang and Yuchen Zhuang and Yinghao Li and Dingu Sagar V K and Rongzhi Zhang and Changhao Li and Ian Shu-Hei Wong and Sherry Yang and Percy Liang and Chao Zhang and Bo Dai},
  year={2025},
  eprint={2505.07782},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2505.07782}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
example		example
mledojo		mledojo
prepare		prepare
scripts		scripts
tests		tests
trajectories		trajectories
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
config.yaml		config.yaml
entrypoint.sh		entrypoint.sh
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
run_api.sh		run_api.sh
run_local.sh		run_local.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

🛠️ Experiment Setup

Prerequisites

Installation

📇 Data Preparation

Install the Kaggle package with pip and setup:

Accept terms and conditions

🚀 Quick Start in Python

1. Import from `mledojo` and register

2. Initialize the environment and start interacting with it

💽 Run Experiments

Manage LLM API Keys

Run with Docker

Run without Docker

Configs and Settings

Agent Scaffolds and Packages

🌟 Develop with `mledojo`

`mledojo` Interface and APIs

Collect trajectories for training

Contributing

Dataset Usage Terms and Conditions

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

MLE-Dojo/MLE-Dojo

Folders and files

Latest commit

History

Repository files navigation

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

🛠️ Experiment Setup

Prerequisites

Installation

📇 Data Preparation

Install the Kaggle package with pip and setup:

Accept terms and conditions

🚀 Quick Start in Python

1. Import from mledojo and register

2. Initialize the environment and start interacting with it

💽 Run Experiments

Manage LLM API Keys

Run with Docker

Run without Docker

Configs and Settings

Agent Scaffolds and Packages

🌟 Develop with mledojo

mledojo Interface and APIs

Collect trajectories for training

Contributing

Dataset Usage Terms and Conditions

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Import from `mledojo` and register

🌟 Develop with `mledojo`

`mledojo` Interface and APIs

Packages