GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Chengqi Duan^1*, Rongyao Fang^2*, Yuqing Wang^1*, Kun Wang³, Linjiang Huang⁴, Xingyu Zeng, Hongsheng Li², Xihui Liu^{1 ✉️}

¹HKU MMLab, ²CUHK MMLab, ³Sensetime, ⁴Beihang University

*Equal contribution, ✉️Corresponding authors

Paper • Introduction • Framework • Key Features • License • Citation

Introduction

Visual generation models have made remarkable progress but still struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. This limitation often stems from a direct mapping from text embeddings to visual features without explicit reasoning about the compositional structure.

We present GoT-R1, a framework that significantly enhances semantic-spatial reasoning in visual generation by applying reinforcement learning. Building upon the Generation Chain-of-Thought (GoT) approach, GoT-R1 enables models to autonomously discover effective reasoning strategies that go beyond predefined templates. This is achieved through a carefully designed dual-stage, multi-dimensional reward framework that leverages Multimodal Large Language Models (MLLMs) to evaluate both the intermediate reasoning process and the final visual output. Our reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified manner. Experimental results demonstrate significant improvements on benchmarks like T2I-CompBench, particularly in compositional tasks requiring precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art by successfully transferring sophisticated reasoning capabilities to the visual generation domain.

GoT-R1 pioneers advancements in reasoning-driven visual generation by:

Enhanced Semantic-Spatial Reasoning: Utilizes reinforcement learning to improve the model's ability to understand and plan complex scenes with accurate object attributes and spatial arrangements.
Autonomous Reasoning Chain Discovery: Moves beyond fixed templates by allowing the model to autonomously explore and learn more effective reasoning chains.
Comprehensive MLLM-based Rewards: Implements a novel dual-stage, multi-dimensional reward system for effective supervision across the entire generation pipeline.

Released Model: GoT-R1

Model	Link
GoT-R1-1B	🤗 HuggingFace
GoT-R1-7B	🤗 HuggingFace

Framework Overview

GoT-R1 builds upon the Generation Chain-of-Thought (GoT) framework by introducing reinforcement learning (RL) to refine the model's semantic-spatial reasoning capabilities. The base model is a unified MLLM architecture (e.g., Janus-Pro) that autoregressively generates a textual reasoning chain followed by image tokens.

The RL process involves:

Sampling multiple reasoning chains (GoT) and corresponding images for a given prompt.
Evaluating these samples using our multi-dimensional MLLM-based reward model.
Updating the model parameters using Group Relative Policy Optimization (GRPO) to encourage high-reward reasoning and generation strategies.

Key Features

MLLM-based Dual-stage Multi-dimensional Reward

A core innovation of GoT-R1 is its comprehensive reward framework designed to address the unique challenges of applying RL to visual generation. This system evaluates both the intermediate reasoning process and the final image:

Prompt-to-Reasoning Semantic Reward ($R_{sem}$): Assesses if the reasoning chain accurately captures all semantic elements (objects, attributes) from the prompt without contradiction. It considers completeness, faithfulness, consistency, and clarity.
Prompt-to-Reasoning Spatial Reward ($R_{spa}$): Evaluates the correctness of planned spatial arrangements in the reasoning chain relative to the prompt. To enhance MLLM's spatial evaluation, textual coordinates are rendered as bounding boxes on a blank canvas for visual assessment.
Reasoning-to-Image Reward ($R_{RI}$): Measures how faithfully the generated image reflects the planned reasoning, checking if objects appear at their specified locations using IoU between planned and grounded bounding boxes.
Prompt-to-Image Reward ($R_{PI}$): Assesses the overall quality and compositional accuracy of the final generated image against the initial prompt.

The total reward $R_{total}$ is a product of these individual rewards, ensuring holistic optimization: $R_{total} = R_{PI} \cdot (R_{sem} + R_{spa}) \cdot R_{RI}$.

Usage

Dependencies

Python >= 3.8 (Recommend to use Anaconda)
PyTorch >=2.0.1
NVIDIA GPU + CUDA

Installation

Clone the repo and install dependent packages.

git clone git@github.com:gogoduan/GoT-R1.git
cd GoT-R1
pip install -r requirements.txt

This automatically downloads cuda-11.7 and pytorch-2.0.1. If you are using sm-90 GPUs like Nvidia H100, please download cuda-11.8。

Model Weights

Expected directory structure might be:

GoT-R1
├── ckpts
│   ├── GoT-R1-1B 
│   ├── GoT-R1-7B 
├── ...

Inference

python infer.py --ckpt_path <Your GoT-R1 checkpoint path>

License

This code is released under the MIT License.

Citation

If you find this work helpful, please consider citing our paper:

@article{duan2025got,
  title={GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning},
  author={Duan, Chengqi and Fang, Rongyao and Wang, Yuqing and Wang, Kun and Huang, Linjiang and Zeng, Xingyu and Li, Hongsheng and Liu, Xihui},
  journal={arXiv preprint arXiv:2505.17022},
  year={2025}
}

Contact

If you have any questions, please raise an issue or contact us at duancq24@connect.hku.hk.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
configs		configs
examples		examples
figures		figures
src		src
README.md		README.md
infer.py		infer.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Introduction

Released Model: GoT-R1

Framework Overview

Key Features

MLLM-based Dual-stage Multi-dimensional Reward

Usage

Dependencies

Installation

Model Weights

Inference

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

gogoduan/GoT-R1

Folders and files

Latest commit

History

Repository files navigation

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Introduction

Released Model: GoT-R1

Framework Overview

Key Features

MLLM-based Dual-stage Multi-dimensional Reward

Usage

Dependencies

Installation

Model Weights

Inference

License

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages