- (Unofficial) PyTorch implementation of Training Vision Transformers for Image Retrieval(El-Nouby, Alaaeldin, et al. 2021).
- I have not yet achieved exactly the same results as reported in the paper(Differential entropy regularization does not have much effect on In-shop and SOP datasets).
# Python 3.7
pip install -r requirements.txt
- See
scripts/train.*.sh
# CUB-200-2011
python main.py \
--model deit_small_distilled_patch16_224 \
--max-iter 2000 \
--dataset cub200 \
--data-path /data/CUB_200_2011 \
--rank 1 2 4 8 \
--lambda-reg 0.7
# Stanford Online Products
python main.py \
--model deit_small_distilled_patch16_224 \
--max-iter 35000 \
--dataset sop \
--m 2 \
--data-path /data/Stanford_Online_Products \
--rank 1 10 100 1000 \
--lambda-reg 0.7
# In-shop
python main.py \
--model deit_small_distilled_patch16_224 \
--max-iter 35000 \
--dataset inshop \
--data-path /data/In-shop \
--m 2 \
--rank 1 10 20 30 \
--memory-ratio 0.2 \
--device cuda:2 \
--encoder-momentum 0.999 \
--lambda-reg 0.7
- IRTO – off-the-shelf extraction of features from a ViT backbone, pre-trained on ImageNet;
- IRTL – fine-tuning a transformer with metric learning, in particular with a contrastive loss;
- IRTR – additionally regularizing the output feature space to encourage uniformity.
- †: Models pre-trained with distillation with a convnet trained on ImageNet1k
Method | Backbone | SOP | CUB-200 | In-Shop | |||||||||
1 | 10 | 100 | 1000 | 1 | 2 | 4 | 8 | 1 | 10 | 20 | 30 | ||
IRTO | DeiT-S | 53.12 | 68.96 | 81.60 | 94.09 | 58.68 | 71.30 | 80.96 | 88.18 | 31.28 | 57.03 | 64.20 | 68.28 |
IRTL | DeiT-S | 83.56 | 93.29 | 97.23 | 99.03 | 73.68 | 82.58 | 88.77 | 92.71 | 93.09 | 98.28 | 98.74 | 99.02 |
IRTR | DeiT-S | 82.67 | 92.73 | 96.69 | 98.80 | 73.73 | 82.91 | 89.30 | 93.35 | 90.47 | 97.97 | 98.61 | 98.92 |
IRTR | DeiT-S† | 82.70 | 92.85 | 96.92 | 98.86 | 76.55 | 85.26 | 90.92 | 94.65 | 90.66 | 98.16 | 98.68 | 98.99 |
- El-Nouby, Alaaeldin, et al. "Training vision transformers for image retrieval." arXiv preprint arXiv:2102.05644 (2021).