8000 GitHub - yu-rp/Dimple: Dimple, the first Discrete Diffusion Multimodal Large Language Model
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

yu-rp/Dimple

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

Dimple

🤗 Model   |    💬 Demo: Chat with Dimple   |   📑 Paper   |    ✨ Code  

video3

💧 Dimple

Dimple is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that leverages a hybrid training paradigm combining autoregressive and diffusion-based instruction tuning. The model architecture is similar to Qwen and LLaVA, while introducing an autoregressive-then-diffusion training strategy:

  • Stage 1: Autoregressive fine-tuning for alignment and initial instruction tuning.
  • Stage 2: Diffusion-based fine-tuning for enhanced instruction-following capabilities.

Trained on the same dataset as LLaVA-NEXT, Dimple-7B surpasses LLaVA-NEXT-7B by 3.9%, demonstrating that diffusion-based multimodal large language models can match its autoregressive counterparts under similar training budget.


🔍 Highlights

  • Hybrid Training: Combines autoregressive and diffusion training.
  • Diffusion Decoding: Supports confident decoding, random decoding, maskgit-style decoding, and entropy-based decoding.
  • Controllable Generation: Enables fine-grained control over format, structure, and length via structure priors.
  • Autoregressive-like Prefilling: Enhances inference speed using prefilling techniques.

📊 Evaluation Results

Benchmark Dimple-7B (ours) LLaVA-1.5-7B LLaVA-NEXT-7B Eagle-7B Eagle2-9B Qwen-VL-7B Qwen2.5-VL-7B
Training Samples 1.3M 1.2M 1.3M 2.4M 27.8M 1.5B -
Training Tokens 0.8B - - - - - 2.6T
Base LLM Dream (Qwen2.5) Vicuna Vicuna-1.5 Vicuna Qwen2.5 Qwen Qwen2.5
GQA 59.2 62.0 64.8 64.9 - 59.3 -
MMBench (en test) 74.6 64.3 68.7 68.4 - - 83.5
MME (Perception) 1514 1510 1519 1528 - - -
MME (Cognition) 432 - 332 - - - -
MME (Total) 1946 - 1851 - - - 2347
POPE 86.2 85.8 86.7 88.8 - - -
MMMU (val) 45.2 - 35.8 36.3 56.1 - 58.6
SQA (img) 77.1 66.8 72.8 70.0 - - -
AI2D 74.4 - 65.4 - 83.9 62.3 83.9
ChartQA 63.4 - 54.9 67.7 86.4 65.7 87.3
TextVQA 61.6 - 64.8 - 83.0 - -
OCRBench 565 - 490 529 - - -
MathVista (mini) 42.3 - 33.0 - 63.8 37.0 68.2
MMVet 41.2 31.1 47.3 - 62.2 - 67.1

🛠️ Environment

Make sure your environment includes the following versions:

transformers==4.46.2
torch==2.5.1
accelerate==1.6.0

🚀 Inference Example

import torch
from transformers import AutoProcessor, AutoModel
import json, requests
from PIL import Image

model_name = "rp-yu/Dimple-7B"
processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True
)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
messages = [
    [{"role": "user", "content": [
        {"type": "image", "image": image_url},
        {"type": "text", "text": "Describe this image."}
    ]}],
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
)
images = [
    Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
]

inputs = processor(
    text=text,
    images=images,
    videos=None,
    padding="longest",
    return_tensors="pt",
)

input_ids = inputs.pop("input_ids")
output = model.diffusion_generate(
    input_ids,
    max_new_tokens=64,
    output_history=True,
    return_dict_in_generate=True,
    steps=64,
    temperature=0.2,
    top_p=0.95,
    alg="origin",
    use_cache=True,
    alg_p_threshold=0.95,
    use_original_confidence=True,
    decoding_pipeline="dim",
    **inputs
)

generations = [
    processor.tokenizer.decode(g[len(p):].cpu().tolist())
    for p, g in zip(input_ids, output.sequences)
]

for j in range(len(messages)):
    print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])

# output: 0 In the image, a woman wearing a shirt with a plaid and a dog are sitting together on a beach. The sun appears to be setting in the background, creating a warm and serene atmosphere.

🏆 Evaluation

We evaluate Dimple using lmms-eval. The evaluation scripts are provided in the lmms-eval folder, and the evaluation commands for Dimple can be found at lmms-eval/examples/models/dimple.sh.

To run the evaluation, follow these steps:

  1. Install Dimple dependencies Make sure all required packages for Dimple are installed. Refer to Dimple’s environment setup instructions for details.

  2. Install lmms-eval dependencies Next, install the necessary dependencies for lmms-eval. These are listed in the lmms-eval repository.

  3. Set your OPENAI_API_KEY Some 682D tasks require OpenAI API access. To do this, edit the file lmms-eval/examples/models/dimple.sh and replace OPENAI_API_KEY=MY_OPENAI_API_KEY with your actual API key.

Once all dependencies are installed and your API key is set, you can run the evaluation script directly:

sh lmms-eval/examples/models/dimple.sh

This will execute the evaluation pipeline for Dimple using the default configuration.


📢 Community

Feel free to join the Dimple Community for in-depth discussions and idea exchange!

📚 Citation

@misc{dimple,
      title={Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding}, 
      author={Runpeng Yu and Xinyin Ma and Xinchao Wang},
      year={2025},
      eprint={2505.16990},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.16990}, 
}

About

Dimple, the first Discrete Diffusion Multimodal Large Language Model

Topics

Resources

Stars

Watchers

Forks

0