Yinqi Li1,2, Jiahe Zhao1,2, Hong Chang1,2, Ruibing Hou1, Shiguang Shan1,2, Xilin Chen1,2
1Institute of Computing Technology, Chinese Academy of Sciences
unCLIP provides an encoding-decoding tool for observing which features are disregarded by CLIP.
Our un2CLIP further leverages this framework to improve CLIP, aiming to recapture the disregarded features.
Clone this repository and create a conda environment with the following commands:
git clone git@github.com:LiYinqi/un2CLIP.git
cd un2CLIP
conda env create -f environment.yaml
conda activate un2clip
Our models are released on HuggingFace🤗.
CLIP Model | Resolution | MMVP-VLM (Original) | MMVP-VLM (Ours) | Link |
---|---|---|---|---|
OpenAI CLIP ViT-L-14 | 224 | 19.3 | 32.6 | openai_vit_l_14_224.ckpt |
OpenAI CLIP ViT-L-14 | 336 | 20.0 | 30.4 | openai_vit_l_14_336.ckpt |
OpenCLIP ViT-H-14 | 224 | 28.9 | 36.3 | openclip_vit_h_14_224.ckpt |
SigLIP ViT-SO-14 | 384 | 37.0 | 41.5 | siglip_vit_so_14_384.ckpt |
We assume the checkpoints are saved in the ./pretrained_models
directory with their original names.
-
Download the MMVP-VLM benchmark and place it in a local directory.
-
Run the evaluation script for each CLIP model by specifying different
un2clip_ckpt_path
arguments. For example, to evaluate OpenAI CLIP ViT-L-14 at 224 resolution, run:
python eval_mmvpvlm.py \
--benchmark_dir "YOUR_MMVP_VLM_PATH" \
--un2clip_ckpt_path "./pretrained_models/openai_vit_l_14_224.ckpt"
- Release model checkpoints.
- Release training codes.
If you find this code or project useful, please consider giving a star⭐ or citing:
@article{li2025un2clip,
title = {{un$^2$CLIP}: Improving {CLIP}'s Visual Detail Capturing Ability via Inverting {unCLIP}},
author = {Yinqi Li and Jiahe Zhao and Hong Chang and Ruibing Hou and Shiguang Shan and Xilin Chen},
year = {2025},
journal = {arXiv preprint arXiv: 2505.24517}
}