Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning | Amit Peleg*, Naman Singh*, Matthias Hein | arXiv, 2025
conda create --name clic python=3.12
conda activate clic
conda install pytorch==2.2.2 torchvision==0.17.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r pip_reqs.txt
python -m spacy download en_core_web_sm
bash eval.sh
- Choose the
architecture
in the bash file. - Choose the
modelName
in the bash file.- For pre-trained non-CLIC models, use the
Pre-train key
from the table below. - For CLIC models, use the
CLIC FT-key
from the table below. - For evaluating your own checkpoints, use the
Pre-train key
from the training and add the argument--load_pretrained_clip path\to\ckpt\folder
to the eval file.
- For pre-trained non-CLIC models, use the
- Evaluation datasets (ImageNet, COCO, SugarCrepe, SugarCrepe++, etc) need to be downloaded by the user.
- Make sure the evaluation dataset paths in local_settings are correct.
Model name | Pre-train key | CLIC FT-key | CLIC-model HF-link |
---|---|---|---|
ViT-B-32-CogVLM | ViT-B-32 | HF-CLIC-ViT-B-32-224-CogVLM | HF-Link |
ViT-B-32-PixPr-RedCaps | ViT-B-32 | HF-CLIC-ViT-B-32-224-PixPr-RedCaps | HF-Link |
ViT-B-16-CogVLM | ViT-B-16 | HF-CLIC-ViT-B-16-224-CogVLM | HF-Link |
ViT-L-14-CogVLM | ViT-L-14 | HF-CLIC-ViT-L-14-224-CogVLM | HF-Link |
ViT-L-14-PixPr-RedCaps | ViT-L-14 | HF-CLIC-ViT-L-14-224-PixPr-RedCaps | HF-Link |
CLIPA-CogVLM | CLIPA | HF-CLIC-CLIPA-ViT-L-14-224-CogVLM | HF-Link |
CLIPA-PixPr-RedCaps | CLIPA | HF-CLIC-CLIPA-ViT-L-14-224-PixPr-RedCaps | HF-Link |
CLIPS-CogVLM | CLIPS | HF-CLIC-CLIPS-ViT-L-14-224-CogVLM | HF-Link |
CLIPS-PixPr-RedCaps | CLIPS | HF-CLIC-CLIPS-ViT-L-14-224-PixPr-RedCaps | HF-Link |
Note: with the correct key in modelName
variable in eval.sh
, the models would be downloaded automatically.
We fine-tune different models with CLIC using
- CogVLM relabelled 1M Laion samples
- RedCaps subset from the PixelProse dataset
The default location for the datasets is in the data
folder.
You can change the location of each dataset in the local_settings file.
mkdir data
# Download the csv file with the images urls
wget -O data/CLIC-CogVLM-relabelled-Laion.csv https://huggingface.co/datasets/nmndeep/CLIC-CogVLM-relabelled-Laion
# Download the 1M Laion subset and create csv with the image locations
python -m assets.download_cogvlm
# Download the redcaps images as described in https://huggingface.co/datasets/tomg-group-umd/pixelprose
python -m assets.download_redcaps
# Process the captions and create the csv file
# If you changed the default location, make sure to change the output path argument as well
python -m assets.create_dataset --input_file data/path/to/downloaded/csv/file.csv --output_file data/redcaps_pixelprose/redcaps_pixelprose.csv
- Change the
dataset
variable in thetrigger_train.sh
tolaion_cogvlm
/redcaps_pixelprose
- Change the
modelName
andarchitecture
variables as desired intrigger_train.sh
.- For the
modelName
use thePre-train key
from the table above.
- For the
- Make sure the csv file paths in local_settings are correct.
- You can run training without evaluation by adding the
--no_eval
argument to the training script.
bash trigger_train.sh
bash trigger_train_negclip.sh
bash trigger_train_baseline.sh
This work uses code/models from:
If you find this repository useful, please consider citing our paper:
@article{peleg2025advancing,
title={Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning},
author={Peleg, Amit and Singh, Naman Deep and Hein, Matthias},
journal={arXiv preprint arXiv:2505.24424},
year={2025}
}