10000 GitHub - llaith-ai/BLIP3o
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

llaith-ai/BLIP3o

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌌 BLIP3-o

BLIP3-o is a unified multimodal model that combines the reasoning and instruction following strength of autoregressive models with the generative power of diffusion models. Unlike prior works that diffuse VAE features or raw pixels, BLIP3-o diffuses semantically rich CLIP image features, enabling a powerful and efficient architecture for both image understanding and generation.

📖 Arxiv

Update

  • [2025/05/16] 🔥 We’ve published a dataset of 20 million images with detailed captions BLIP3o Pretrain Long Caption and 4 million images with short caption BLIP3o Pretrain Short Caption. All images and their captions are compressed into tar archives, no separate image url downloads or manual unzipping required.

  • [2025/05/16] 🔥 We’ve reorganized and cleaned up the repository to ensure a clear, well-structured codebase. Please give the training and inference scripts a try, and feel free to leave an issue if you run into any problems. We apologize for any confusion caused by our original codebase release.

✨ Highlights

  • Fully Open-Source: Fully open-source training data (Pretraining and Instruction Tuning), training recipe, model weights, code.
  • Unified Architecture: for both image understanding and generation.
  • CLIP Feature Diffusion: Directly diffuses semantic vision features for stronger alignment and performance.
  • State-of-the-art performance: across a wide range of image understanding and generation benchmarks.

Demo

You can try out BLIP3-o in your browser using our interactive Demo.

Install package for tranining

conda create -n blip3o python=3.11 -y
conda activate blip3o
pip install --upgrade pip  setuptools
pip install -r requirements.txt

Model Checkpoint

BLIP3o-4B 4B

BLIP3o-8B 8B

Inference

You can download our chekpoint

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Model', repo_type='model'))"

and run the inference code

python inference.py  /HF_model/checkpoint/path/

Training

We include two scripts: slurm.sh for multi-node training on Slurm clusters, and run.sh for debugging.

For both slurm.sh and run.sh, you need to import huggingface home HF_HOME, training data folder IMG_FOLDER and output model save folder OUTPUT_FOLDER.

For our open source model training, we combine the pretraining dataset, including both long and short captions, images from JourneyDB. You can download JourneyDB. When training the diffusion transformer from scratch, we recommend using a large number of training steps along with a cosine annealing learning rate schedule that decays from 1×10⁻⁴ down to 1×10⁻⁵.

CLIP + Diffusion (Encoder + Decoder)

We also provide two CLIP + Diffusion:

[EVA-CLIP + SDXL]: The model checkpoint already includes the diffusion decoder diffusion-decoder. The EVA-CLIP vision tower weights can be downloaded here EVA-CLIP, the preprocess of EVA-CLIP is in the training code EVA-CLIP-preprocess.

[SigLIP + SANA]: [coming soon]

Supported Tasks

  • Text → Text
  • Image → Text (Image Understanding)
  • Text → Image (Image Generation)
  • Image → Image (Image Editing)
  • Multitask Training (Image generation and undetstanding mix training)

Supported Image Generation Methods

  • CLIP + MSE
  • CLIP + Flow Matching
  • VAE + Flow Matching
  • Transfusion, LMFusion

Supported Autoregressive Backbones

  • Qwen-2.5-VL
  • LLaMA 3

We suggest to use Qwen-2.5-VL as the backbone, we are fixing some tokenizer issues for LLama3.

Supported Dataset Format

  • Webdataset
  • Json

Data Loading

Most of our training data use Huggingface datasets to load WebDataset. To download the datasets:

Pretrain

You can download the datasets by

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='BLIP3o/BLIP3o-Pretrain', repo_type='dataset'))"

And load them directly with HuggingFace WebDataset

train_dataset = load_dataset("webdataset", data_files=data_files, split="train", num_proc=128)

BLIP3o-60k

BLIP3-o Overview Figure Figure: Qualitative results of BLIP3-o.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • Shell 0.7%
0