[Paper] [Arxiv] [Checkpoints] [Data] [Website]
conda create --name lavida python=3.13
conda activate lavida
pip install -e .[train]
cd eval
pip install -e .
cd ../
pip install trl==0.17.0
Please download checkpoints from Huggingface and organize them in the following structure
<repo root>
--lavida-ckpts # create this folder via mkdir
--lavida-llada-hd # jacklishufan/lavida-llada-v1.0-instruct
--lavida-dream-hd # jacklishufan/lavida-dream-v1.0-instruct
--lavida-llada-hd-fim # jacklishufan/lavida-llada-1.0-fim
--lavida-llada-hd-reason # hbXNov/lavida-llada-reason
--lavida-llada-lowres # jacklishufan/lavida-llada-1.0-lowres
run the following script to perfom standard inference and text-infilling
python predict.py
python predict_fim.py
Model | MME | MMMU | MMB | Latency (s/image) |
---|---|---|---|---|
LaViDa-Dream | 1463.5 | 42.6 | 73.8 | 1.13 |
LaViDa-LLaDa | 1365.6 | 43.3 | 70.5 | 1.32 |
MMaDa | 1410.7 | 30.2 | 68.5 | 3.93 |
(speed measurement conducted with generation length=32 and steps=16)
The evaluation scrips are under eval
folder. Please use the following script to reproduce the main results on MMMU.
bash eval/run.sh lavida-ckpts/lavida-llada-hd --tasks mmmu_val # for LaViDa-LLaDa
bash eval/run_dream.sh lavida-ckpts/lavida-dream-hd --tasks mmmu_val # for LaViDa-Dream
To reproduce results on other dataset, simply replace mmmu_val
to respective dataset.
bash eval/run_coco.sh lavida-ckpts/lavida-llada-hd
Model | KV Cache | CIDEr |
Latency |
NFE |
---|---|---|---|---|
LaviDa-LLaDa | off | 110.2 | 6.65 | 100% |
LaviDa-LLaDa | on | 107.8 | 2.01 | 100% |
LaviDa-LLaDa | off | 108.5 | 3.57 | 50% |
LaviDa-LLaDa | on | 104.4 | 1.32 | 50% |
LLaVa-1.6-7B (Baseline) | on | 96.7 | 1.67 | 100% |
We find that the low resolution model is slightly faster than HD model and have stronger performance on some tasks (e.g. COCO captioning). We provide the inference script as well.
bash eval/run_coco_lowres.sh lavida-ckpts/lavida-llada-lowres
The expected data folder structure looks like the following
<repo root>
--data
--pretrain # LCS-558K
-- images
-- blip_laion_cc_sbu_558k.json
--Open-LLaVA-NeXT
-- ai2d
-- ...
-- open-llava-next
--infovqa-v1
--VQAv2_train
- Download LCS-558K and place it in
data/pretrain
- Download all datasets from Open-LLaVa-Next and place it in
data/Open-LLaVa-Next
- Download the remaining datasets from Our Huggingface. This dataset contains three subfolders
infovqa-v1 -> put under data/
VQAv2_train -> put under data/
open-llava-next -> put under data/Open-LLaVA-NeXT/, merge with an existing folder of same name
Pretrain(Stage 1) Scripts:
scripts/train/exps/cluster/pretrain_llada.sh
scripts/train/exps/cluster/pretrain_dream.sh
Finetune(Stage 2) Scripts
scripts/train/exps/cluster/llada-hd-llada-s2.sh
scripts/train/exps/cluster/llada-hd-dream-s2.sh
To launch finetuning scripts, you need to change the BASE_RUN_NAME
variable in the shell scripts to the path of stage 1 checkpoints. If you want to directly launch stage 2 training, we provide pretrained stage 1 checkpoints in the link Stage-1-LLaDa and Stage-1-Dream
We observed a bug with deepspeed Zero-3 that it breaks inference for validation. Hence, if you want to start a training run with eval results logged to wandb, please use Zero-2.
It can be fund in the huggingface collection
The scripts is in scripts/train/exps/cluster/llada-hd-llada-s3-fim.sh
This repo is largely based on LLaVa-Next. We use LMMS-Eval for evaluations.