Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Project page | Paper

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning | Amit Peleg*, Naman Singh*, Matthias Hein | arXiv, 2025

Installation

conda create --name clic python=3.12
conda activate clic
conda install pytorch==2.2.2 torchvision==0.17.2 pytorch-cuda=12.1 -c pytorch -c nvidia
pip install -r pip_reqs.txt
python -m spacy download en_core_web_sm

Evaluation

bash eval.sh

Choose the architecture in the bash file.
Choose the modelName in the bash file.
- For pre-trained non-CLIC models, use the Pre-train key from the table below.
- For CLIC models, use the CLIC FT-key from the table below.
- For evaluating your own checkpoints, use the Pre-train key from the training and add the argument --load_pretrained_clip path\to\ckpt\folder to the eval file.
Evaluation datasets (ImageNet, COCO, SugarCrepe, SugarCrepe++, etc) need to be downloaded by the user.
Make sure the evaluation dataset paths in local_settings are correct.

Model name	Pre-train key	CLIC FT-key	CLIC-model HF-link
ViT-B-32-CogVLM	ViT-B-32	HF-CLIC-ViT-B-32-224-CogVLM	HF-Link
ViT-B-32-PixPr-RedCaps	ViT-B-32	HF-CLIC-ViT-B-32-224-PixPr-RedCaps	HF-Link
ViT-B-16-CogVLM	ViT-B-16	HF-CLIC-ViT-B-16-224-CogVLM	HF-Link
ViT-L-14-CogVLM	ViT-L-14	HF-CLIC-ViT-L-14-224-CogVLM	HF-Link
ViT-L-14-PixPr-RedCaps	ViT-L-14	HF-CLIC-ViT-L-14-224-PixPr-RedCaps	HF-Link
CLIPA-CogVLM	CLIPA	HF-CLIC-CLIPA-ViT-L-14-224-CogVLM	HF-Link
CLIPA-PixPr-RedCaps	CLIPA	HF-CLIC-CLIPA-ViT-L-14-224-PixPr-RedCaps	HF-Link
CLIPS-CogVLM	CLIPS	HF-CLIC-CLIPS-ViT-L-14-224-CogVLM	HF-Link
CLIPS-PixPr-RedCaps	CLIPS	HF-CLIC-CLIPS-ViT-L-14-224-PixPr-RedCaps	HF-Link

Note: with the correct key in modelName variable in eval.sh, the models would be downloaded automatically.

Training datasets

We fine-tune different models with CLIC using

CogVLM relabelled 1M Laion samples
RedCaps subset from the PixelProse dataset

The default location for the datasets is in the data folder. You can change the location of each dataset in the local_settings file.

1M subset of the LAION dataset

mkdir data
# Download the csv file with the images urls
wget -O data/CLIC-CogVLM-relabelled-Laion.csv https://huggingface.co/datasets/nmndeep/CLIC-CogVLM-relabelled-Laion 
# Download the 1M Laion subset and create csv with the image locations
python -m assets.download_cogvlm

RedCaps subset of the PixelProse dataset

# Download the redcaps images as described in https://huggingface.co/datasets/tomg-group-umd/pixelprose
python -m assets.download_redcaps
# Process the captions and create the csv file
# If you changed the default location, make sure to change the output path argument as well
python -m assets.create_dataset --input_file data/path/to/downloaded/csv/file.csv --output_file data/redcaps_pixelprose/redcaps_pixelprose.csv

CLIC Training

Change the dataset variable in the trigger_train.sh to laion_cogvlm/redcaps_pixelprose
Change the modelName and architecture variables as desired in trigger_train.sh.
- For the modelName use the Pre-train key from the table above.
Make sure the csv file paths in local_settings are correct.
You can run training without evaluation by adding the --no_eval argument to the training script.

bash trigger_train.sh

For training with single image (non concat) version - our NegCLIP:

bash trigger_train_negclip.sh

For training the baseline of CLIC (Single-Image Baseline in Table 10):

bash trigger_train_baseline.sh

Acknowledgements

This work uses code/models from:

Citation

If you find this repository useful, please consider citing our paper:

@article{peleg2025advancing,
  title={Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning},
  author={Peleg, Amit and Singh, Naman Deep and Hein, Matthias},
  journal={arXiv preprint arXiv:2505.24424},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
clips		clips
eval		eval
utils		utils
.gitignore		.gitignore
README.md		README.md
data_loading.py		data_loading.py
eval.sh		eval.sh
local_setting.py		local_setting.py
losses.py		losses.py
models.py		models.py
pip_reqs.txt		pip_reqs.txt
train.py		train.py
trigger_train.sh		trigger_train.sh
trigger_train_baseline.sh		trigger_train_baseline.sh
trigger_train_negclip.sh		trigger_train_negclip.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Project page | Paper

Installation

Evaluation

Training datasets

1M subset of the LAION dataset

RedCaps subset of the PixelProse dataset

CLIC Training

For training with single image (non concat) version - our NegCLIP:

For training the baseline of CLIC (Single-Image Baseline in Table 10):

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AmitPeleg/CLIC

Folders and files

Latest commit

History

Repository files navigation

Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Project page | Paper

Installation

Evaluation

Training datasets

1M subset of the LAION dataset

RedCaps subset of the PixelProse dataset

CLIC Training

For training with single image (non concat) version - our NegCLIP:

For training the baseline of CLIC (Single-Image Baseline in Table 10):

Acknowledgements

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages