[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3591106.3592266acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Improving Image Encoders for General-Purpose Nearest Neighbor Search and Classification

Published: 12 June 2023 Publication History

Abstract

Recent advances in computer vision research led to large vision foundation models that generalize to a broad range of image domains and perform exceptionally well in various image based tasks. However, content-based image-to-image retrieval is often overlooked in this context. This paper investigates the effectiveness of different vision foundation models on two challenging nearest neighbor search-based tasks: zero-shot retrieval and k-NN classification. A benchmark for evaluating the performance of various vision encoders and their pre-training methods is established, where significant differences in the performance of these models are observed. Additionally, we propose a fine-tuning regime that improves zero-shot retrieval and k-NN classification through training with a combination of large publicly available datasets without specializing in any data domain. Our results show that the retrained vision encoders have a higher degree of generalization across different search-based tasks and can be used as general-purpose embedding models for image retrieval.

References

[1]
Tianchi Alibaba. 2020. CVPR 2020 AliProducts Challenge: Large-scale product recognition. https://tianchi.aliyun.com/competition/entrance/231780/information
[2]
Andre Araujo and Bingyi Cao. 2022. Google Universal Image Embedding. https://kaggle.com/competitions/google-universal-image-embedding
[3]
Artem Babenko, Anton Slesarev, Alexander Chigorin, and Victor S. Lempitsky. 2014. Neural Codes for Image Retrieval. In ECCV (1).
[4]
Yalong Bai, Yuxiang Chen, Wei Yu, Linfang Wang, and Wei Zhang. 2020. Products-10K: A Large-scale Product Recognition Dataset. https://doi.org/10.48550/ARXIV.2008.10545
[5]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. CoRR abs/2005.14165 (2020). arXiv:2005.14165https://arxiv.org/abs/2005.14165
[6]
Matic Broz. 2023. How Many Photos Are There? (2023) 50+ Photos Statistics — photutorial.com. https://photutorial.com/photos-statistics. [Accessed 20-Jan-2023].
[7]
Bingyi Cao, André Araujo, and Jack Sim. 2020. Unifying Deep Local and Global Features for Image Search. In ECCV (20), Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm (Eds.).
[8]
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. https://doi.org/10.48550/ARXIV.2006.09882
[9]
David R. Cox. 1958. The Regression Analysis of Binary Sequences (with Discussion). J Roy Stat Soc B 20 (1958), 215–242.
[10]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In CVPR.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
[12]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.CoRR (2020).
[13]
Eran Eidinger, Roee Enbar, and Tal Hassner. 2014. Age and Gender Estimation of Unfiltered Faces. IEEE Transactions on Information Forensics and Security 9, 12 (2014), 2170–2179. https://doi.org/10.1109/TIFS.2014.2359646
[14]
Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, and Hervé Jégou. 2021. Training Vision Transformers for Image Retrieval. CoRR abs/2102.05644 (2021). arXiv:2102.05644https://arxiv.org/abs/2102.05644
[15]
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. 2022. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale. https://doi.org/10.48550/ARXIV.2211.07636
[16]
Albert Gordo, Jon Almazán, Jérôme Revaud, and Diane Larlus. 2016. Deep Image Retrieval: Learning Global Representations for Image Search. In ECCV (6).
[17]
Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, Ishan Misra, Levent Sagun, Armand Joulin, and Piotr Bojanowski. 2022. Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision. (2022). arxiv:2202.08360 [cs.CV]
[18]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross B. Girshick. 2022. Masked Autoencoders Are Scalable Vision Learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 15979–15988. https://doi.org/10.1109/CVPR52688.2022.01553
[19]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[20]
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. 2020. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. https://doi.org/10.48550/ARXIV.2006.16241
[21]
Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2019. Natural Adversarial Examples. https://doi.org/10.48550/ARXIV.1907.07174
[22]
Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alexander Shepard, Hartwig Adam, Pietro Perona, and Serge J. Belongie. 2018. The INaturalist Species Classification and Detection Dataset. In CVPR.
[23]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. CoRR abs/2102.05918 (2021). arXiv:2102.05918https://arxiv.org/abs/2102.05918
[24]
Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. 2020. Proxy Anchor Loss for Deep Metric Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25]
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 2020. Big Transfer (BiT): General Visual Representation Learning. In ECCV (5).
[26]
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 2013. 3D Object Representations for Fine-Grained Categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13).
[27]
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. 2022. Grounded Language-Image Pre-training. In CVPR.
[28]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll’a r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. CoRR abs/1405.0312 (2014). arxiv:1405.0312http://arxiv.org/abs/1405.0312
[29]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv preprint arXiv:2103.14030 (2021).
[30]
David G. Lowe. 2004. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vision 60, 2 (Nov. 2004), 91–110. https://doi.org/10.1023/B:VISI.0000029664.99615.94
[31]
S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. 2013. Fine-Grained Visual Classification of Aircraft. Technical Report. arxiv:1306.5151 [cs-cv]
[32]
Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated Flower Classification over a Large Number of Classes. In Indian Conference on Computer Vision, Graphics and Image Processing.
[33]
Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. 2015. Deep Face Recognition. In British Machine Vision Conference.
[34]
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In CVPR.
[35]
Filip Radenovic, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. 2018. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking. In CVPR.
[36]
Filip Radenovic, Giorgos Tolias, and Ondrej Chum. 2019. Fine-Tuning CNN Image Retrieval with No Human Annotation.IEEE Trans. Pattern Anal. Mach. Intell. (2019).
[37]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event(Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748–8763. http://proceedings.mlr.press/v139/radford21a.html
[38]
Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Designing Network Design Spaces. https://doi.org/10.48550/ARXIV.2003.13678
[39]
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet Classifiers Generalize to ImageNet?https://doi.org/10.48550/ARXIV.1902.10811
[40]
Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. 2021. ImageNet-21K Pretraining for the Masses. https://doi.org/10.48550/ARXIV.2104.10972
[41]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) (2015).
[42]
Konstantin Schall, Kai Uwe Barthel, Nico Hezel, and Klaus Jung. 2022. GPR1200: A Benchmark for General-Purpose Content-Based Image Retrieval. In MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6–10, 2022, Proceedings, Part I (Phu Quoc, Vietnam). Springer-Verlag, Berlin, Heidelberg, 205–216. https://doi.org/10.1007/978-3-030-98358-1_17
[43]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. 2022. LAION-5B: An open large-scale dataset for training next generation image-text models. https://doi.org/10.48550/ARXIV.2210.08402
[44]
Shihao Shao and Qinghua Cui. 2022. 1st Place Solution in Google Universal Images Embedding. https://doi.org/10.48550/ARXIV.2210.08473
[45]
Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep Metric Learning via Lifted Structured Feature Embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[46]
Mingxing Tan and Quoc Le. 2021. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the 38th International Conference on Machine Learning.
[47]
Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2016. Particular object retrieval with integral max-pooling of CNN activations. In ICLR (Poster).
[48]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30. Curran Associates, Inc.
[49]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology.
[50]
Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. 2019. Learning Robust Global Representations by Penalizing Local Predictive Power. In Advances in Neural Information Processing Systems. 10506–10518.
[51]
Shuang Wang and Shuqiang Jiang. 2015. INSTRE: A New Benchmark for Instance-Level Object Retrieval and Recognition.TOMM (2015).
[52]
Tobias Weyand, André Araujo, Bingyi Cao, and Jack Sim. 2020. Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval. In CVPR.
[53]
Ross Wightman. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models. https://doi.org/10.5281/zenodo.4414861
[54]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. https://doi.org/10.48550/ARXIV.1910.03771
[55]
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. 2023. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. https://doi.org/10.48550/ARXIV.2301.00808
[56]
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo-Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt. 2021. Robust fine-tuning of zero-shot models. https://doi.org/10.48550/ARXIV.2109.01903
[57]
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan, and Jianfeng Gao. 2022. Unified Contrastive Learning in Image-Text-Label Space. arxiv:2204.03610 [cs.CV]
[58]
Min Yang, Dongliang He, Miao Fan, Baorong Shi, Xuetong Xue, Fu Li, Errui Ding, and Jizhou Huang. 2021. DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features. CoRR abs/2108.02927 (2021). arXiv:2108.02927https://arxiv.org/abs/2108.02927
[59]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2 (02 2014), 67–78. https://doi.org/10.1162/tacl_a_00166 arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00166/1566848/tacl_a_00166.pdf
[60]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2021. Scaling Vision Transformers. CoRR abs/2106.04560 (2021). https://arxiv.org/abs/2106.04560
[61]
Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil Houlsby. 2019. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark. https://doi.org/10.48550/ARXIV.1910.04867
[62]
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. 2022. GLIPv2: Unifying Localization and Vision-Language Understanding. arXiv preprint arXiv:2206.05836 (2022).
[63]
Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, and Ziwei Liu. 2022. Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy. arxiv:2203.07845 [cs.CV]

Cited By

View all
  • (2024)Known-Item Search in Video: An Eye Tracking-Based StudyProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658119(311-319)Online publication date: 30-May-2024
  • (2024)Libro - Lifelog Search BrowserProceedings of the 7th Annual ACM Workshop on the Lifelog Search Challenge10.1145/3643489.3661124(70-75)Online publication date: 10-Jun-2024
  • (2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '23: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval
June 2023
694 pages
ISBN:9798400701788
DOI:10.1145/3591106
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Content-Based Image Retrieval
  2. Deep Learning
  3. Generalization in Nearest Neighbor-Based Tasks

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMR '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)75
  • Downloads (Last 6 weeks)9
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Known-Item Search in Video: An Eye Tracking-Based StudyProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658119(311-319)Online publication date: 30-May-2024
  • (2024)Libro - Lifelog Search BrowserProceedings of the 7th Annual ACM Workshop on the Lifelog Search Challenge10.1145/3643489.3661124(70-75)Online publication date: 10-Jun-2024
  • (2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024
  • (2024)Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding AlignmentSimilarity Search and Applications10.1007/978-3-031-75823-2_9(97-110)Online publication date: 5-Nov-2024
  • (2024)Optimizing the Interactive Video Retrieval Tool Vibro for the Video Browser Showdown 2024MultiMedia Modeling10.1007/978-3-031-53302-0_33(364-371)Online publication date: 29-Jan-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media