8000 GitHub - gogoduan/GoT-R1: GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Notifications You must be signed in to change notification settings

gogoduan/GoT-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Home visitors

Chengqi Duan1*, Rongyao Fang2*, Yuqing Wang1*, Kun Wang3, Linjiang Huang4, Xingyu Zeng , Hongsheng Li2, Xihui Liu1 ✉️

1HKU MMLab, 2CUHK MMLab, 3Sensetime, 4Beihang University

*Equal contribution, ✉️Corresponding authors

 

  Paper •   Introduction •   Framework •   Key Features •   License •   Citation

Introduction

Visual generation models have made remarkable progress but still struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. This limitation often stems from a direct mapping from text embeddings to visual features without explicit reasoning about the compositional structure.

We present GoT-R1, a framework that significantly enhances semantic-spatial reasoning in visual generation by applying reinforcement learning. Building upon the Generation Chain-of-Thought (GoT) approach, GoT-R1 enables models to autonomously discover effective reasoning strategies that go beyond predefined templates. This is achieved through a carefully designed dual-stage, multi-dimensional reward framework that leverages Multimodal Large Language Models (MLLMs) to evaluate both the intermediate reasoning process and the final visual output. Our reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified manner. Experimental results demonstrate significant improvements on benchmarks like T2I-CompBench, particularly in compositional tasks requiring precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art by successfully transferring sophisticated reasoning capabilities to the visual generation domain.

GoT-R1 pioneers advancements in reasoning-driven visual generation by:

  • Enhanced Semantic-Spatial Reasoning: Utilizes reinforcement learning to improve the model's ability to understand and plan complex scenes with accurate object attributes and spatial arrangements.
  • Autonomous Reasoning Chain Discovery: Moves beyond fixed templates by allowing the model to autonomously explore and learn more effective reasoning chains.
  • Comprehensive MLLM-based Rewards: Implements a novel dual-stage, multi-dimensional reward system for effective supervision across the entire generation pipeline.

Released Model: GoT-R1

Model      Link
GoT-R1-1B 🤗 HuggingFace
GoT-R1-7B 🤗 HuggingFace

Framework Overview

GoT-R1 builds upon the Generation Chain-of-Thought (GoT) framework by introducing reinforcement learning (RL) to refine the model's semantic-spatial reasoning capabilities. The base model is a unified MLLM architecture (e.g., Janus-Pro) that autoregressively generates a textual reasoning chain followed by image tokens.

The RL process involves:

  1. Sampling multiple reasoning chains (GoT) and corresponding images for a given prompt.
  2. Evaluating these samples using our multi-dimensional MLLM-based reward model.
  3. Updating the model parameters using Group Relative Policy Optimization (GRPO) to encourage high-reward reasoning and generation strategies.

Key Features

MLLM-based Dual-stage Multi-dimensional Reward

A core innovation of GoT-R1 is its comprehensive reward framework designed to address the unique challenges of applying RL to visual generation. This system evaluates both the intermediate reasoning process and the final image:

  • Prompt-to-Reasoning Semantic Reward ($R_{sem}$): Assesses if the reasoning chain accurately captures all semantic elements (objects, attributes) from the prompt without contradiction. It considers completeness, faithfulness, consistency, and clarity.
  • Prompt-to-Reasoning Spatial Reward ($R_{spa}$): Evaluates the correctness of planned spatial arrangements in the reasoning chain relative to the prompt. To enhance MLLM's spatial evaluation, textual coordinates are rendered as bounding boxes on a blank canvas for visual assessment.
  • Reasoning-to-Image Reward ($R_{RI}$): Measures how faithfully the generated image reflects the planned reasoning, checking if objects appear at their specified locations using IoU between planned and grounded bounding boxes.
  • Prompt-to-Image Reward ($R_{PI}$): Assesses the overall quality and compositional accuracy of the final generated image against the initial prompt.

The total reward $R_{total}$ is a product of these individual rewards, ensuring holistic optimization: $R_{total} = R_{PI} \cdot (R_{sem} + R_{spa}) \cdot R_{RI}$.

Usage

Dependencies

Installation

Clone the repo and install dependent packages.

git clone git@github.com:gogoduan/GoT-R1.git
cd GoT-R1
pip install -r requirements.txt

This automatically downloads cuda-11.7 and pytorch-2.0.1. If you are using sm-90 GPUs like Nvidia H100, please download cuda-11.8

Model Weights

Expected directory structure might be:

GoT-R1
├── ckpts
│   ├── GoT-R1-1B 
│   ├── GoT-R1-7B 
├── ...

Inference

python infer.py --ckpt_path <Your GoT-R1 checkpoint path>

License

This code is released under the MIT License.

Citation

If you find this work helpful, please consider citing our paper:

@article{duan2025got,
  title={GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning},
  author={Duan, Chengqi and Fang, Rongyao and Wang, Yuqing and Wang, Kun and Huang, Linjiang and Zeng, Xingyu and Li, Hongsheng and Liu, Xihui},
  journal={arXiv preprint arXiv:2505.17022},
  year={2025}
}

Contact

If you have any questions, please raise an issue or contact us at duancq24@connect.hku.hk.

About

GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages

0