GitHub - wshi83/MedAgentGym: This is the official repository for paper "MedAgentGYM: Training LLM Agents for Code-Based Medical Reasoning at Scale"

MedAgentGYM

This is the official repository for the paper: "MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale". In the paper, we introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents.

Dataset Access

EHR Data Access (Important!!!)

MedAgentGym has been carefully curated with strict adherence to ethical standards, leveraging datasets that are publicly available or that incorporate rigorous privacy protection and anonymization measures. Table 7 in the Appendix provides detailed access requirements for each of the 12 datasets included in MedAgentGym. Researchers seeking access to preprocessed task and data files should first obtain and attach all required data usage agreements and submit a formal request via email to medagentgym@gmail.com, using the subject line “MedAgentGym Preprocessed Data Access".

Tasks Definition and Access

We provide the basic data of train_tasks.jsonl and test_tasks.jsonl in this repository, which contains. Once the previous step is taken and the access is approved, we will send the applicants a download_data.py file to down load the entire pre-processed dataset from HuggingFace. This will automatically download the full datasets we have prepared and uploaded in a private repository of an anonymous HuggingFace Account. Please download the data into the directory ./data/. The downloaded dataset should be like ./data/biocoder/*. The dataset details involved in the paper are listed below:

Build Docker Container

As our dataset is based on the docker environment for isolated coding and execution. Thus, you need to build the docker container first. Please run the following command:

docker buildx build -t ehr_gym:latest .

or directly run the command we have prepared:

bash build_docker.sh

Run Experiment

Please prepare the experiment scripts in the entrypoint.sh file. For example if we wnat to run the experiments on biocoder task and test the performance of gpt-4.1-mini. We can run the following command for 5-thread parallel running:

python3 /home/main.py --config /home/configs/gpt_4_1_mini/exp-gpt_4_1_mini-biocoder.yaml --async_run --parallel_backend joblib --n_jobs 5

Results

Sampled Data Helps Agent Training

Figure below highlights substantial performance gains from SFT across four OSS backbone LLMs of varying sizes.

Warmed-up DPO Works Best for Coding Agent Training

The table below compares several post-training methods, revealing that simple SFT over successful trajectories significantly boosts performance on structured coding tasks, demonstrating its effectiveness in capturing structured coding patterns. Besides, DPO is particularly beneficial for optimizing open-ended task performance. Although DPO alone slightly underperforms compared to SFT, combining an initial SFT warm-up with subsequent DPO further improves overall results by leveraging their complementary strengths.

MedAgentGym Enables Both Inference- and Training-Time Scaling

Inference-Time Scaling: The left figure illustrates performance scaling with increased trajectory sampling. Pass@K significantly improves from 17.0% at K = 1 to 45.0% at 16, while Best@K shows steady advancement from 17.0% to 41.7%. The relatively small gap between metrics indicates that our trained verifier effectively identifies successful trajectories, unleashing its potential as a reward model for integration into advanced online RL frameworks such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO).

Training-Time Scaling: The right figure examines agent performance as a function of increased training data volumes (25%, 50%, 75%, and 100%) in SFT. We observe consistent performance improvements with greater training data availability, suggesting additional computational resources dedicated to sampling further trajectories are likely to yield continued performance gains.

📚 Citation

@article{xu2025medagentgym,
  title={MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale},
  author={Xu, Ran and Zhuang, Yuchen and Zhong, Yishan and Yu, Yue and Tang, Xiangru and Wu, Hang and Wang, May D and Ruan, Peifeng and Yang, Donghan and Wang, Tao and others},
  journal={arXiv preprint arXiv:2506.04405},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
configs/gpt_4_1_mini		configs/gpt_4_1_mini
data		data
ehr_gym		ehr_gym
Dockerfile		Dockerfile
README.md		README.md
entrypoint.sh		entrypoint.sh
main.py		main.py
requirements.txt		requirements.txt
rollout.py		rollout.py
start_docker.sh		start_docker.sh
test_docker.sh		test_docker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedAgentGYM

Dataset Access

EHR Data Access (Important!!!)

Tasks Definition and Access

Build Docker Container

Run Experiment

Results

Sampled Data Helps Agent Training

Warmed-up DPO Works Best for Coding Agent Training

MedAgentGym Enables Both Inference- and Training-Time Scaling

📚 Citation

About

Uh oh!

Releases

Packages

Languages

wshi83/MedAgentGym

Folders and files

Latest commit

History

Repository files navigation

MedAgentGYM

Dataset Access

EHR Data Access (Important!!!)

Tasks Definition and Access

Build Docker Container

Run Experiment

Results

Sampled Data Helps Agent Training

Warmed-up DPO Works Best for Coding Agent Training

MedAgentGym Enables Both Inference- and Training-Time Scaling

📚 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages