LaViDa:A Large Diffusion Language Model for Multimodal Understanding

[Paper] [Arxiv] [Checkpoints] [Data] [Website]

Installation

conda create --name lavida python=3.13
conda activate lavida
pip install -e .[train]
cd eval
pip install -e .
cd ../
pip install trl==0.17.0

Download Checkpoint

Please download checkpoints from Huggingface and organize them in the following structure

<repo root>
--lavida-ckpts # create this folder via mkdir
   --lavida-llada-hd # jacklishufan/lavida-llada-v1.0-instruct
   --lavida-dream-hd # jacklishufan/lavida-dream-v1.0-instruct
   --lavida-llada-hd-fim  # jacklishufan/lavida-llada-1.0-fim
   --lavida-llada-hd-reason # hbXNov/lavida-llada-reason
   --lavida-llada-lowres  # jacklishufan/lavida-llada-1.0-lowres

Inference

run the following script to perfom standard inference and text-infilling

python predict.py
python predict_fim.py

Evaluation

Reproduce Main Evaluation Results

Model	MME	MMMU	MMB	Latency (s/image)
LaViDa-Dream	1463.5	42.6	73.8	1.13
LaViDa-LLaDa	1365.6	43.3	70.5	1.32
MMaDa	1410.7	30.2	68.5	3.93

(speed measurement conducted with generation length=32 and steps=16)

The evaluation scrips are under eval folder. Please use the following script to reproduce the main results on MMMU.

bash eval/run.sh lavida-ckpts/lavida-llada-hd --tasks mmmu_val # for LaViDa-LLaDa
bash eval/run_dream.sh lavida-ckpts/lavida-dream-hd --tasks mmmu_val # for LaViDa-Dream

To reproduce results on other dataset, simply replace mmmu_val to respective dataset.

Reproduce COCO Caption Results (Speed-Quality Tradeoff)

bash eval/run_coco.sh lavida-ckpts/lavida-llada-hd

Model	KV Cache	CIDEr $\uparrow$	Latency $\downarrow$	NFE
LaviDa-LLaDa	off	110.2	6.65	100%
LaviDa-LLaDa	on	107.8	2.01	100%
LaviDa-LLaDa	off	108.5	3.57	50%
LaviDa-LLaDa	on	104.4	1.32	50%
LLaVa-1.6-7B (Baseline)	on	96.7	1.67	100%

We find that the low resolution model is slightly faster than HD model and have stronger performance on some tasks (e.g. COCO captioning). We provide the inference script as well.

bash eval/run_coco_lowres.sh lavida-ckpts/lavida-llada-lowres

Training

Data Preparation

The expected data folder structure looks like the following

<repo root>
--data
   --pretrain # LCS-558K
      -- images
      -- blip_laion_cc_sbu_558k.json
   --Open-LLaVA-NeXT
      -- ai2d
      -- ...
      -- open-llava-next 
   --infovqa-v1
   --VQAv2_train

Download LCS-558K and place it in data/pretrain
Download all datasets from Open-LLaVa-Next and place it in data/Open-LLaVa-Next
Download the remaining datasets from Our Huggingface. This dataset contains three subfolders

infovqa-v1 -> put under data/
VQAv2_train -> put under data/
open-llava-next -> put under data/Open-LLaVA-NeXT/, merge with an existing folder of same name

Launch Scripts

Pretrain(Stage 1) Scripts:
scripts/train/exps/cluster/pretrain_llada.sh
scripts/train/exps/cluster/pretrain_dream.sh

Finetune(Stage 2) Scripts

scripts/train/exps/cluster/llada-hd-llada-s2.sh
scripts/train/exps/cluster/llada-hd-dream-s2.sh

To launch finetuning scripts, you need to change the BASE_RUN_NAME variable in the shell scripts to the path of stage 1 checkpoints. If you want to directly launch stage 2 training, we provide pretrained stage 1 checkpoints in the link Stage-1-LLaDa and Stage-1-Dream

Common Questions

Why validation acc is 0 during the training

We observed a bug with deepspeed Zero-3 that it breaks inference for validation. Hence, if you want to start a training run with eval results logged to wandb, please use Zero-2.

Where is Reasoning Data used in Stage 3

It can be fund in the huggingface collection

How to launch FIM training

The scripts is in scripts/train/exps/cluster/llada-hd-llada-s3-fim.sh

Acknowledgements

This repo is largely based on LLaVa-Next. We use LMMS-Eval for evaluations.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
eval		eval
images		images
llava		llava
paper		paper
playground		playground
scripts		scripts
trl		trl
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
predict_fim.py		predict_fim.py
predict_fim_2.py		predict_fim_2.py
predict_fim_3.py		predict_fim_3.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LaViDa:A Large Diffusion Language Model for Multimodal Understanding

Installation

Download Checkpoint

Inference

Evaluation

Reproduce Main Evaluation Results

Reproduce COCO Caption Results (Speed-Quality Tradeoff)

Training

Data Preparation

Launch Scripts

Common Questions

Why validation acc is 0 during the training

Where is Reasoning Data used in Stage 3

How to launch FIM training

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

jacklishufan/LaViDa

Folders and files

Latest commit

History

Repository files navigation

LaViDa:A Large Diffusion Language Model for Multimodal Understanding

Installation

Download Checkpoint

Inference

Evaluation

Reproduce Main Evaluation Results

Reproduce COCO Caption Results (Speed-Quality Tradeoff)

Training

Data Preparation

Launch Scripts

Common Questions

Why validation acc is 0 during the training

Where is Reasoning Data used in Stage 3

How to launch FIM training

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages