8000 GitHub - microsoft/mineworld: MineWorld: A Real-time interactive world model on Minecraft
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

microsoft/mineworld

Repository files navigation

MineWorld
A Real-time Interactive World Model on Minecraft

arXiv   Project   HuggingFace

We introduce MineWorld, an interactive world model on Minecraft that brings several key advancements over existing approaches:

  • πŸ•ΉοΈ High generation quality. Built on a visual-action autoregressive Transformer, MineWorld generates coherent, high-fidelity frames conditioned on both visuals and actions.
  • πŸ•ΉοΈ Strong controllability. We propose benchmarks for the action-following capacity, where MineWorld shows precise and consistent behavior.
  • πŸ•ΉοΈ Fast inference speed. With Diagonal Decoding, MineWorld achieves a generation rate of 4 to 7 frames per second, enabling real-time interaction in open-ended game environments.
video.demo.mp4

πŸ”₯ News

πŸ”§ Setup

  1. Clone this repository and navigate to MineWorld folder:
git clone https://github.com/microsoft/mineworld.git
cd mineworld
  1. We provide an requirements.txt file for setting up a pip environment.
# 1. Prepare conda environment
conda create -n mineworld python=3.10
# 2. Activate the environment
conda activate mineworld
# 3. install our environment
pip3 install -r requirements.txt

We recommend using high-end GPU for inference. We have done all testing and development using A100 and H100 GPU.

🎈 Checkpoints

Download pre-trained models here. Each checkpoint has a corresponding config file with the same name in the configs folder in this repository. All models share the same vae checkpoint and config. The data structure is as follows:

└── checkpoints
    β”œβ”€β”€ 300M_16f.ckpt
    β”œβ”€β”€ 700M_16f.ckpt
    β”œβ”€β”€ 700M_32f.ckpt
    β”œβ”€β”€ 1200M_16f.ckpt
    └── 1200M_32f.ckpt
    └── vae
        β”œβ”€β”€ config.json
        └── vae.ckpt
└── validation
    └── validation.zip
└── gradio_scene
    β”œβ”€β”€ scene.mp4
    └── scene.jsonl

πŸš€ Inference

We provide two ways to use our model: interacting with it in a web demo, and running locally to reproduce the evaluation results in our paper. In addition to download the checkpoints and place them in the checkpoints folder, it is also required to download scene.mp4 and scene.jsonl when running the web demo. Make sure they are placed in the same directory.

Run Web Demo

To launch the webpage game, run the following command:

python mineworld.py --scene "path/to/scene.mp4"    
    --model_ckpt "path/to/ckpt" 
    --config "path/to/config" 

image

Once the demo is running, you can access the website through the local URL or the public URL displayed in the command line. Initialization and the first action may take some time due to compilation.

You can specify a reference frame using the --reference_frame option, which should be larger than 4 and smaller than the context length of the model (i.e., 16 or 32 depending on the model utilized). A higher reference frame number generally corresponds to better visual quality. Once the initial state has been set, perform the game actions by selecting options in each chatbox. The game progresses when pressing the "Run" button, displaying the last 8 frames and the most recent frame separately. Players can also set an action count to repeat an action multiple times.

Explanations to the buttons in the web demo are as follows:

Start frame: select a frame in scene.mp4 with its frame index
Jump to start frame: use the selected frame as the initial state
Camera `X` and `Y`: control the camera movements between `-90` and `90` degrees
Other action buttons: same as the actions in Minecraft 
Generate video: save previous game progress

Run Local Inference

To run inference locally, use the following command:

python inference.py \
        --data_root "/path/to/validation/dataset" \
        --model_ckpt "path/to/ckpt" \
        --config "path/to/config" \
        --demo_num 1 \
        --frames 15 \
        --accelerate-algo 'naive' \
        --top_p 0.8 \
        --output_dir "path/to/output"

Check scripts/inference_16f_models.sh for examples. To switch between naive autoregressive decoding and diagonal decoding, change the command --accelerate-algo to naive and image_diagd correspondingly.

After the inference of a set of videos, you can compute the metrics and reproduce the numerical results in our paper, check and run the following scripts:

bash scripts/setup_metrics.sh # only required in the first time
bash scripts/compute_metrics.sh 

The evalution outputs will have the following structure:

└── videos 
    β”œβ”€β”€ inference_setting1
        β”œβ”€β”€ clip_1.mp4
        └── clip_1.json
    β”œβ”€β”€ inference_setting2
        β”œβ”€β”€ clip_1.mp4
        └── clip_1.json
└── metrics_log
    β”œβ”€β”€ fvd_inference_setting1.json
    β”œβ”€β”€ fvd_inference_setting2.json
    β”œβ”€β”€ idm_inference_setting1.json
    β”œβ”€β”€ idm_inference_setting2.json
    └── latest_metrics.csv 

All results will be aggregated into metrics_log/latest_metrics.csv.

πŸ’‘ Intended Uses

Our model is solely trained in the Minecraft game domain. As a world model, an initial image in the game scene will be provided, and the users should select an action from the action list. Then the model will generate the next scene that takes place the selected action.

πŸͺ§ Out-of-scope Uses

Our models are not specifically designed for any tasks or scenarios other than the Minecraft model.

Developers should expect failures in generation results regarding the out-of-scope scenarios.

Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, and evaluate and mitigate for privacy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.

πŸ€–οΈ Risks and Limitations

Some of the limitations of this model to be aware of include:

  • Quality of Service: MineWorld is trained solely on Minecraft, so it cannot generate results for other video domains (such as internet video). And the model cannot generate videos with higher resolution.
  • Information Reliability: MineWorld is trained on videos with a fixed resolution, therefore the results may lose detailed information due to the low resolution.
  • MineWorld inherits any biases, errors, or omissions characteristic of its training data, which may be amplified by any AI-generated interpretations.
  • MineWorld was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios.
  • The input of other images than Minecraft will result in incoherent imagery being created and should not be attempted.
  • Users are responsible for sourcing their datasets legally and ethically. This could include securing appropriate copy rights, ensuring consent for use of audio/images, and/or the anonymization of data prior to use in research.

✏️ BibTeX

@article{guo2025mineworld,
  title={MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft}, 
  author={Guo, Junliang and Ye, Yang and He, Tianyu and Wu, Haoyu and Jiang, Yushu and Pearce, Tim and Bian, Jiang}
  year={2025},
  journal={arXiv preprint arXiv:2504.08388},
}

πŸ€— Acknowledgments

This codebase borrows code from VPT and generative-models. We thank them for their efforts and innovations, which have made the development process more efficient and convenient.

Thank you to everyone who contributed their wisdom and efforts to this project.

☎️ Contact

We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us through junliangguo AT microsoft.com.

πŸ“„ Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

πŸ“ Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

MineWorld: A Real-time interactive world model on Minecraft

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
0