MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation (ICLR 2025)

This repository provides the official PyTorch implementation of the following paper:

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation

Donggon Jang* (KAIST), Yucheol Cho* (KAIST), Suin Lee (KAIST), Taehyeon Kim (KAIST), and Dae-Shik Kim (KAIST) (*The authors have equally contributed.)

Accepted at ICLR 2025.

Abstract: The fusion of Large Language Models (LLMs) with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. However, seamless human-AI interaction demands more than just object-level recognition; it requires understanding both objects and the functions of their detailed parts, particularly in multi-target scenarios. For example, when instructing a robot to "turn on the TV," there could be various ways to accomplish this command. Recognizing multiple objects capable of turning on the TV, such as the TV itself or a remote control (multi-target), provides more flexible options and aids in finding the optimized scenario. Furthermore, understanding specific parts of these objects, like the TV's button or the remote's button (part-level), is important for completing the action. Unfortunately, current reasoning segmentation datasets predominantly focus on a single target object-level reasoning, which limits the detailed recognition of an object's parts in multi-target contexts. To address this gap, we construct a large-scale dataset called Multi-target and Multi-granularity Reasoning (MMR). MMR comprises 194K complex and implicit instructions that consider multi-target, object-level, and part-level aspects, based on pre-existing image-mask sets. This dataset supports diverse and context-aware interactions by hierarchically providing object and part information. Moreover, we propose a straightforward yet effective framework for multi-target, object-level, and part-level reasoning segmentation. Experimental results on MMR show that the proposed method can reason effectively in multi-target and multi-granularity scenarios, while the existing reasoning segmentation model still has room for improvement.

MMR Dataset

MMR datsset is a large-scale benchmark dataset for multi-target and multi-granularity reasoning segmentation.

Generation Process

To generate a multi-target, object-level, and part-level reasoning segmentation dataset, we adopt ChatGPT/GPT-4V API which has robust visual understanding capabilitie. To guide the GPT-4V API effectively, we carefully craft prompts that include GPT role, object and part information, task prompts, and requirements. The GPT-4V-assisted data generation follows a two-step process: 1) Global Caption Generation: GPT-4V API first generates a global caption based on the image to foster a deep understanding of its context. 2) Question-Answer Pair Generation: Leveraging this global caption alongside object and part information, GPT-4V autonomously crafts multi-target and multi-granularity question-answer pairs

Example

Statistics

The MMR dataset includes 194,398 intricate and implicit question-answer pairs with 57,643 corresponding images and masks selected from PACO-LVIS. The entire dataset is split into distinct sets for training (154,127 pairs), validation (8,194 pairs), and test (32,077 pairs).

Download

The MMR dataset can be downloaded from this google drive link.

The link for the MMR dataset is structured as follows:

MMR/
├── MMR_test_mixed.json
├── MMR_test_obj_only.json
├── MMR_test_part_only.json
├── MMR_train.json
├── MMR_val.json

Json Format

 data:
    {
        `file_name': str, a file name of the image,
        `height': int, height of the image,
        `width': int, width of the image,
        `image_id': int, id of the image,
        `not_exhaustive_category_ids': List[int], list of category ids that don't have all of their instances marked exhaustively,
        `neg_category_ids': List[int], list of category ids that were verified as not present in the image,
        `coco_url': str, image URL,
        `questions': List[str], the complex and implicit questions about the objects and parts within an image,
        
        `annotations':
            {
            `bbox': List[float], bounding box of the object or part,
            `segmentation': 
                {
                `size': List[int], the size of the image,
                `counts': RLE format, segmentation binary mask information,
                }
            `image_id': int, id of the image,
            `category_name': str, category_name of the object or part,
            `category_id': int, category_id,
            `sorted_category_id': int, sorted id in ascending order,
            }
        `answers': List[dicts], the annotations corresponding to the questions,
        `text_answers': List[str], the text answers to the questions,
        `raw_answers': List[str], the raw answers from GPT API to the questions,
    }

M²SA Model

Architecture

Installation

Clone this repository

git clone https://github.com/jdg900/MMR.git
cd MMR

To install requirements using conda environment

conda env create -n [env name] -f M2SA.yaml
conda activate [env name]
pip install flash-attn --no-build-isolation

Training

Data Preparation

The training datasets are composed in the same way as LISA.

The training datasets consist of 4 types of data:

Semantic segmentation datasets: ADE20K, COCO-Stuff, Mapillary, PACO-LVIS, PASCAL-Part, COCO Images

Note: You should also add COCO train2017 and COCO val 2017 under the refer_seg path.
Referring expression segmentation datasets: RefCOCO, RefCOCO+, RefCOCOg, RefCLEF (saiapr_tc-12), RefCOCOm
Visual question answering dataset: LLaVA-Instruct-150k
Our MMR dataset: MMR Note: Images and masks in MMR dataset are based on COCO Images.

Download the total datasets from the above links, and organize them as follows.

├── dataset
│   ├── ade20k
│   │   ├── annotations
│   │   └── images
│   ├── coco
│   │   └── train2017
│   │       ├── 000000000009.jpg
│   │       └── ...
│   ├── cocostuff
│   │   └── train2017
│   │       ├── 000000000009.png
│   │       └── ...
│   ├── llava_dataset
│   │   └── llava_instruct_150k.json
│   ├── mapillary
│   │   ├── config_v2.0.json
│   │   ├── testing
│   │   ├── training
│   │   └── validation
│   ├── refer_seg
│   │   ├── images
│   │   |   ├── saiapr_tc-12 
│   │   |   └── mscoco
│   │   |       └── images
│   │   |           └── train2014
│   │   ├── refclef
│   │   ├── refcoco
│   │   ├── refcoco+
│   │   └── refcocog
│   │   └── RefCOCOm
│   │        ├── masks
│   │        └── annotations
│   ├── vlpart
│   │    ├── paco
│   │    │   └── annotations
│   │    └── pascal_part
│   │        ├── train.json
│   │        └── VOCdevkit
│   │
│   │   
│   └── MMR
│       ├── MMR_train.json
│       ├── MMR_val.json
│       ├── MMR_test_mixed.json
│       └── MMR_test_obj_only.json
│       └── MMR_test_part_only.json

Pre-trained LLaVA weights

To train M²SA-7B and M²SA-13B, loading LLaVA's pre-trained weights are required. For M²SA-7B we use LLaVA-Lightning-7B-v1-1 merged from liuhaotian/LLaVA-Lightning-7B-delta-v1-1, and for M²SA-13B, we use liuhaotian/llava-llama-2-13b-chat-lightning-preview.

Pre-trained SAM weights

Download SAM ViT-H pre-trained weights from the link, and put pre-trained weights in ./vision_pretrained.

Training

deepspeed --include=localhost:0,1,2,3 --master_port=24999 train_ds.py \
 --version="PATH_TO_LLaVA" \
 --dataset_dir="./dataset/" \
 --dataset="sem_seg||refer_seg||vqa||multi_part_reason_seg" \
 --vision-tower="openai/clip-vit-large-patch14" \
 --batch_size=2 \
 --num_classes_per_sample=3 \
 --num_classes_per_question=3 \
 --use_expand_question_list \
 --model_max_length 2048 \
 --sample_rates="2,9,2,6" \
 --exp_name="M2SA" \
 --val_dataset="MultiPartReasonSeg|val" \
 --val_json_name="MMR_val.json" \

When training is finished, to get the full model weight:

cd ./runs/M2SA-7B/ckpt_model && python zero_to_fp32.py . ../pytorch_model.bin

Merge LoRA Weight

Merge the LoRA weights of pytorch_model.bin, save the resulting model into your desired path in the Hugging Face format:

CUDA_VISIBLE_DEVICES="" python merge_lora_weights_and_save_hf_model.py \
  --version="PATH_TO_LLaVA" \
  --weight="PATH_TO_pytorch_model.bin" \
  --save_path="PATH_TO_SAVED_MODEL"

For example:

CUDA_VISIBLE_DEVICES=0 python merge_lora_weights_and_save_hf_model.py \
  --version="./LLaVA/LLaVA-Lightning-7B-v1-1" \
  --weight="./runs/M2SA-7B/pytorch_model.bin" \
  --save_path="M2SA-7B"

Pre-trained weights

You can download M²SA-7B and M²SA-13B in huggingface.

Validation

deepspeed --include=localhost:0,1,2,3 --master_port=24999  train_ds.py \
 --version="PATH_TO_M2SA_MODEL_Directory" \
 --exp_name="M2SA-7B-val" \
 --dataset_dir='./dataset' \
 --val_dataset="MultiPartReasonSeg|val" \
 --eval_only \
 --val_json_name="MMR_val.json" \

Benchmark Results

Results on MMR benchmark.

Methods	val (gIoU)	val (cIoU)	Obj (gIoU)	Obj (cIoU)	Part (gIoU)	Part (cIoU)	Obj & Part (gIoU)	Obj & Part (cIoU)
LISA-7B	13.8	18.3	23.5	25.1	6.6	7.9	14.5	17.9
LISA-7B_tr	19.4	31.6	34.7	41.8	8.0	13.1	19.5	27.1
M²SA-7B	27.8	48.6	41.0	55.6	13.5	27.0	30.9	46.8
LISA-Llama2-13B	15.4	20.0	26.1	27.9	7.4	8.4	16.1	19.8
LISA-Llama2-13B_tr	22.3	33.4	40.2	45.2	10.7	16.4	23.0	29.2
M²SA-Llama2-13B	28.4	49.1	42.3	57.6	13.6	27.2	31.6	47.6

Results on RefCOCOm benchmark. For a fair comparison with previous methods, the mIoU metrics are adopted.

Methods	val-Part	val-Obj & Part	testA-Part	testA-Obj & Part	testB-Part	testB-Obj & Part
SeqTR	13.9	28.2	12.1	22.8	18.1	34.7
CRIS	10.6	25.4	10.1	21.2	12.9	30.0
LAVT	15.3	29.9	13.2	24.4	18.7	35.5
X-Decoder	16.2	29.5	13.6	23.6	20.3	33.8
SEEM	16.1	29.4	13.6	23.4	20.4	33.9
UniRES	19.6	34.3	16.4	27.8	25.2	41.7
LISA-7B	21.3	34.3	18.5	28.6	25.7	40.1
M²SA-7B	22.4	35.5	19.9	30.1	27.1	41.4
LISA-Llama2-13B	22.1	35.2	19.4	29.7	27.2	41.6
M²SA-Llama2-13B	24.5	37.3	21.9	31.9	28.5	42.7

Citation

If you find this project useful in your research, please consider citing:

@inproceedings{jangmmr,
  title={MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation},
  author={Jang, Donggon and Cho, Yucheol and Lee, Suin and Kim, Taehyeon and Kim, Daeshik},
  booktitle={The Thirteenth International Conference on Learning Representations}
}

Acknowledgements

This codebase ie built on LISA, LLaVA, and SAM. We thank the authors for sharing their code. Their valuable work has greatly contributed to the development of our codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figures		figures
model		model
utils		utils
LICENSE		LICENSE
M2SA.yaml		M2SA.yaml
README.md		README.md
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
train_ds.py		train_ds.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation (ICLR 2025)

MMR Dataset

Generation Process

Example

Statistics

Download

Json Format

M²SA Model

Architecture

Installation

Training

Data Preparation

Pre-trained LLaVA weights

Pre-trained SAM weights

Training

Merge LoRA Weight

Pre-trained weights

Validation

Benchmark Results

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

jdg900/MMR

Folders and files

Latest commit

History

Repository files navigation

MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation (ICLR 2025)

MMR Dataset

Generation Process

Example

Statistics

Download

Json Format

M2SA Model

Architecture

Installation

Training

Data Preparation

Pre-trained LLaVA weights

Pre-trained SAM weights

Training

Merge LoRA Weight

Pre-trained weights

Validation

Benchmark Results

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

M²SA Model

Packages