MLE-Dojo is a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Built upon 200+ real-world Kaggle challenges. MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic Machine Learning Engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging, etc. MLE-Dojo's fully executable environment and flexible interface support comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification.
- Docker: (Optional) Required for building and running the execution environment.
- NVIDIA Container Toolkit: Required if using GPUs within Docker containers.
- Conda: One of the most convenient ways is to build with
conda
. - Git: Basic git support.
Clone the repository:
git clone https://github.com/MLE-Dojo/MLE-Dojo.git
cd MLE-Dojo
Build the Docker image: (Optional)
docker build -t mle-dojo .
Build the Conda Env:
pip install --upgrade pip setuptools wheel
conda create -y -n mle-dojo python=3.11
conda activate mle-dojo
pip install -e .
MLE-Dojo currently supports 200+ Kaggle competitions. We aggregate 68 competitions from MLE-Bench (excluding 7 tasks that are unavailable, excessively large, or tightly coupled with specific packages), 74 from DSBench, and 75 additionally scraped and prepared carefully from Kaggle's official website. After removing duplicate entries across sources, we obtain a diverse collection of over 200 unique tasks.
pip install kaggle
Next, go to your Kaggle account settings, generate a new API token, and move the downloaded kaggle.json file to ~/.kaggle/kaggle.json
on Linux/macOS or C:\Users\<YourUsername>\.kaggle\kaggle.json
on Windows. Refer to Kaggle API for details.
Before downloading each competition, you may need to manually accept its terms and conditions on its official Kaggle website. We provide the competition details in prepare/*.json
, including website urls, in ascending order of competition data sizes. Feel free to prepare and play with any competition you want!
❗️Note: This action is needed for each competition you want to prepare and utilize.
To prepare newly introduced or MLE-Bench competitions, specify the competitions-file
(a txt file with each line corresponding to a competition), data-dir
and the logs-dir
. We actively maintain and update supported competitions in prepare/mle.json
.
PYTHONPATH="." python prepare/mle.py \
--competitions-file prepare/competitions.txt \
--data-dir ./data/prepared \
--logs-dir ./data/prepare_logs
Alternatively, you can specify the competitions
to prepare via args.
PYTHONPATH="." python prepare/mle.py \
--competitions random-acts-of-pizza \
--data-dir ./data/prepared \
--logs-dir ./data/prepare_logs
To prepare the DS-Bench data, specify paths for raw data and prepared data. This will prepare all the competitions in prepare/dsbench.json
PYTHONPATH="." python prepare/dsbench.py \
--raw-dir ./data/raw \
--prepared-dir ./data/prepared
⏰ Reminder: The data preparation is both time- and space-consuming. Please allocate sufficient resources based on the data size and ensure you have accepted the competition's terms and conditions on the official website in advance.
Here's a quick example to show how to interact with MLE-Dojo:
from mledojo.gym.competition import CompetitionRegistry
from mledojo.competitions import get_metric
from mledojo.gym.env import KaggleEnvironment
competition_name = "random-acts-of-pizza"
data_dir = ...
output_dir = ...
# register the competition
registry.register(
name=competition_name,
data_dir=data_dir, # "random-acts-of-pizza/data"
comp_info=CompInfo(
category="General",
level="beginner",
output_type="submission.csv",
higher_is_better=True
),
metric_class=get_metric(competition_name)
)
# initialize the env
env = KaggleEnvironment.make(
competition_name=competition_name,
output_dir=output_dir,
competition_registry=registry,
score_mode="position",
gpu_device=0,
gpu_memory_limit=32,
execution_timeout=3600
)
# request_info
env.step("request_info", **{"info_type": "overview"})
# validate_code
env.step("validate_code", **{"code": "import pandas as pd\nprint('Welcome to MLE-Dojo!')"})
# Execute_code
absolute_data_dir = Path(os.path.abspath(data_dir))
absolute_output_dir = Path(os.path.abspath(output_dir))
code_to_execute = f'''
import pandas as pd
submission = pd.read_csv('{absolute_data_dir / "public" / "sample_submission.csv"}')
submission.to_csv('{absolute_output_dir / "submission.csv"}', index=False)
print("Submission created successfully.")
'''
env.step("execute_code", **{"code": code_to_execute})
We support running experiments both with and without Docker. We recommend Docker for running experiments with our supported agent scaffolds and models.
Create .env
under the main directory and save API Keys. "HF_TOKEN" is used for local models.
OPENAI_API_KEY=""
GEMINI_API_KEY=""
ANTHROPIC_API_KEY=""
DEEPSEEK_API_KEY=""
XAI_API_KEY=""
HF_TOKEN=""
We also support Azure Openai API in mledojo/chat/utils
, you can leave OPENAI_API_KEY
blank and fill AZURE_API_CONFIG
with corresponding information.
After getting some competitions prepared, you can get started with running experiments on your prepared data.
Experiment with o3-mini
and MLE Agent
scaffold on prepared random-acts-of-pizza
competition or competitions in prepare/competitions.txt
(replace --competitions random-acts-of-pizza
with --competitions-file prepare/competitions.txt
):
python run.py \
--gpu-indices 0 \
--max-tasks-per-gpu 1 \
--competitions random-acts-of-pizza \
--docker-image mle-dojo \
--task-runtime-seconds 43200 \
--kaggle-data ./data/prepared \
--log-dir ./results/logs \
--output-dir ./results/output
We also support running experiments without Docker, refer to main.py
for more args:
python main.py \
--output-dir ./output \
--data-dir ./data/prepared \
--competition-name random-acts-of-pizza \
--agent-type mle
We use separate configs for Agent and Experiment. Config Path for Agent is mledojo/agent/*/config.yaml
and Experiment Config is config.yaml
.
MLE Agent scaffold supports most mainstream LLMs through API or local serving.
model_mode | Supported model_name Variants |
---|---|
gpt | gpt-4o, gpt-4o-mini, o1-mini, o3-mini... |
deepseek | deepseek-reasoner, deepseek-chat |
gemini | gemini-2.0-flash, gemini-2.0-pro, gemini-2.5-pro-preview-03-25, gemini-2.5-pro-exp-03-25... |
grok | grok-3-beta, grok-3-mini-beta... |
claude | claude-3-7-sonnet-latest, claude-3-5-sonnet-latest... |
local | LLaMA, Qwen, ... |
For running local models, refer to run_local.sh
. We use vLLM for local model serving.
We currently support:
-
Agents with originally supported action space in
mledojo
, named MLE Agent. -
AIDE (AI-Driven Exploration), an LLM agent that starts by generating a set of initial solution drafts and then iteratively refines and improves them based on performance feedback. We modify the source of the feedback from direct code outputs to
mle-dojo
environment feedback. -
OpenAI Agents, a package that enables building agentic AI apps with abstractions. We implement a version of MLE Agent with
openai-agents
package.
We provide quick examples of APIs.
MLE-Dojo provides flexible Gym-style APIs that allow users to build personalized environments, introduce new datasets and develop/utilize different agent scaffolds.
Specifically, mledojo
serves a powerful and flexible toolkit for interacting with machine learning competitions, designed for ease of use and extensibility. Its core components offer a seamless development experience:
-
Interface
: Provides the primary way to interact with a competition. It manages distinct sub-interfaces for specific tasks:InfoInterface
: Fetches competition details (overview, data structure, etc.).CodeValidationInterface
: Performs safe, preliminary checks on user code syntax and basic runtime behavior within a sandbox.CodeExecutionInterface
: Manages the full execution of user code within the sandbox, handles submission generation, and triggers evaluation.- Flexibility: The
Interface
is modular, allowing developers to easily register custom interaction components tailored to specific needs.
-
Sandbox
: Ensures secure and fair code execution by running user scripts in an isolated environment with configurable resource limits (GPU, memory, time). -
FeedbackManager
: Processes the results from validation and execution, generating structured and informative feedback.- Extensibility: Designed to be extensible, allowing the registration of custom feedback providers (e.g., LLM-based analysis, specific error pattern detection) alongside the default
BaseFeedback
.
- Extensibility: Designed to be extensible, allowing the registration of custom feedback providers (e.g., LLM-based analysis, specific error pattern detection) alongside the default
-
KaggleEnvironment
: The main entry point, wrapping all components into a standardized, Gymnasium-compatible environment. It orchestrates the entire workflow (setup, action handling viastep
, state tracking, feedback generation) providing a consistent API for users or automated agents.
MLE-Dojo combines well-defined interfaces, secure execution, structured feedback, and centralized management within a familiar Gym-style environment. Its modular and extensible design grants developers significant flexibility to adapt and build upon the core framework.
MLE-Dojo provides a detailed history management system and a well-defined reward feedback mechanism, including both final outcome rewards and intermediate step rewards. This design enables flexible use for model training via Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). We provide both Agent History and Environment History structures.
Contributions are welcome! We will soon release instructions on Contributing.
The dataset is provided solely for educational and academic research purposes, with the goal of supporting scholarly work in relevant fields. Users must comply with the following terms:
-
Data Integrity:
While efforts have been made to curate and organize the dataset, no guarantees are made regarding its accuracy, completeness, or timeliness. Users are responsible for independently verifying the data and for any interpretations or outcomes based on its use. -
Permitted Use:
The dataset is restricted to non-commercial use only. Any commercial application, product development, or monetization based on the dataset requires prior written permission from the competition organizers. -
Legal and Privacy Compliance:
Users must comply with applicable laws and regulations, especially those related to data privacy and security. The dataset providers are not liable for any legal issues resulting from improper or unauthorized usage. -
Intellectual Property:
The dataset may include content derived from external sources and is intended for non-commercial research only. The providers do not claim ownership over such content and acknowledge the rights of the original creators. Users must ensure their usage does not infringe upon any intellectual property rights. -
Disclaimer:
The dataset is provided "as is", without warranties of any kind. The providers are not responsible for any damages—direct or indirect—that may result from its use, including but not limited to financial loss, misinterpretation, or third-party claims.
This code of project is licensed under the MIT License - see the LICENSE file for details. We adapt some of the data preparation codes from DS Bench and MLE-Bench. Different competitions may be governed by different licenses. Users are responsible for reviewing and complying with the specific license associated with each competition. We provide the all the URLs of specific rules and license for each competition. Please refer to the details.
If you find this useful, you are more than welcome to cite:
@misc{qiang2025mledojointeractiveenvironmentsempowering,
title={MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering},
author={Rushi Qiang and Yuchen Zhuang and Yinghao Li and Dingu Sagar V K and Rongzhi Zhang and Changhao Li and Ian Shu-Hei Wong and Sherry Yang and Percy Liang and Chao Zhang and Bo Dai},
year={2025},
eprint={2505.07782},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.07782},
}