[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Published: 23 October 2023 Publication History

Abstract

Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.

References

[1]
Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. 2021. Evaluating clip: Towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818 (2021).
[2]
Jamil Ahmad, Khan Muhammad, Sambit Bakshi, and Sung Wook Baik. 2018. Object-oriented convolutional features for fine-grained image retrieval in large surveillance datasets. Future Generation Computer Systems 81 (2018), 314–330.
[3]
Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. 2021. Compositional learning of image-text query for image retrieval. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 1140–1149.
[4]
Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Exploiting CLIP-based multi-modal approach for artwork classification and retrieval. In The Future of Heritage Science and Technologies: ICT and Digital Heritage: Third International Conference, Florence Heri-Tech 2022, Florence, Italy, May 16–18, 2022, Proceedings. Springer, 140–149.
[5]
Imon Banerjee, Camille Kurtz, Alon Edward Devorah, Bao Do, Daniel L. Rubin, and Christopher F. Beaulieu. 2018. Relevance feedback for enhancing content based image retrieval and automatic prediction of semantic image features: Application to bone tumor radiographs. Journal of Biomedical Informatics 84 (2018), 123–135.
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
[7]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
[8]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image search with text feedback by visiolinguistic attention learning. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR).
[9]
Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, and Joseph E. Gonzalez. 2021. Data-efficient language-supervised zero-shot learning with self-distillation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 3119–3124.
[10]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
[11]
Mario G. C. A. Cimino, Federico A. Galatolo, and Gigliola Vaglini. 2021. Generating images from caption and vice versa via CLIP-guided generative latent space search. In Proceedings of the International Conference on Image Processing and Vision Engineering. 166–174.
[12]
Claudia Companioni-Brito, Zygred Mariano-Calibjio, Mohamed Elawady, and Sule Yildirim. 2018. Mobile-based painting photo retrieval using combined features. In Proceedings of International Conference on Image Analysis and Recognition (ICIAR), Vol. 10882. Springer, 278.
[13]
Marcos V. Conde and Kerem Turgutlu. 2021. CLIP-Art: Contrastive pre-training for fine-grained art classification. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). 3956–3960.
[14]
Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, and Diane Larlus. 2021. ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity. In International Conference on Learning Representations.
[15]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
[16]
Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye. 2020. Modality-agnostic attention fusion for visual search with text feedback. arXiv preprint arXiv:2007.00145 (2020).
[17]
Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei, Xiaoyong Wei, Minlong Lu, and Xiaodan Liang. 2021. M5product: A multi-modal pretraining benchmark for e-commercial product downstream tasks. arXiv preprint arXiv:2109.04275 (2021).
[18]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint arXiv:2106.11097 (2021).
[19]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. Advances in Neural Information Processing Systems 31 (2018).
[20]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE International Conference on Computer Vision. 1463–1471.
[21]
Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-focused vision-and-language representation learning. In European Conference on Computer Vision. Springer, 634–651.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[23]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[24]
Bogdan Ionescu, Henning Müller, Renaud Péteri, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Dzmitri Klimuk, Aleh Tarasau, Asma Ben Abacha, Sadid A. Hasan, et al. 2019. ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature. In Proceedings of International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF). Springer, 358–386.
[25]
Bogdan Ionescu, Henning Müller, Renaud Péteri, Duc-Tien Dang-Nguyen, Liting Zhou, Luca Piras, Michael Riegler, Pål Halvorsen, Minh-Triet Tran, Mathias Lux, et al. 2020. ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications. Advances in Information Retrieval 12036 (2020), 533.
[26]
Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. 2022. SAC: Semantic attention composition for text-conditioned image retrieval. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 4021–4030.
[27]
Surgan Jandial, Ayush Chopra, Pinkesh Badjatiya, Pranit Chawla, Mausoom Sarkar, and Balaji Krishnamurthy. 2020. TRACE: Transform aggregate and compose visiolinguistic representations for image search with text feedback. arXiv preprint arXiv:2009.01485 7 (2020), 7.
[28]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of International Conference on Machine Learning (ICML).
[29]
Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual compositional learning in interactive image retrieval. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), Vol. 35. 1771–1779. https://ojs.aaai.org/index.php/AAAI/article/view/16271
[30]
Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2015. WhittleSearch: Interactive image search with relative attribute feedback. International Journal of Computer Vision (IJCV) 115, 2 (Apr.2015), 185–210.
[31]
Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-style modulation for image retrieval with text feedback. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 802–812.
[32]
Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. CLIP-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16420–16429.
[33]
Xiaoqing Li, Jiansheng Yang, and Jinwen Ma. 2021. Recent developments of content-based image retrieval (CBIR). Neurocomputing 452 (2021), 675–689.
[34]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 121–137.
[35]
Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y. Zou. 2022. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems 35 (2022), 17612–17625.
[36]
Yating Liu and Yan Lu. 2021. Multi-grained fusion for conditional image retrieval. In Proceedings of International Conference on Multimedia Modeling (MMM). Springer International Publishing, Cham, 315–327.
[37]
Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. 2021. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2125–2134.
[38]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
[39]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2018. Mixed precision training. In International Conference on Learning Representations.
[40]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
[41]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
[42]
Yong Rui, T. S. Huang, M. Ortega, and S. Mehrotra. 1998. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 8, 5 (1998), 644–655.
[43]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (2015), 211–252.
[44]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2019. Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision (IJCV) 128, 2 (Oct.2019), 336–359.
[45]
Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. 2021. RTIC: Residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021).
[46]
Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22, 12 (2000), 1349–1380.
[47]
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6418–6428.
[48]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6439–6448.
[49]
Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, and Lu Yuan. 2022. CLIP-TD: CLIP Targeted distillation for vision-language tasks. arXiv preprint arXiv:2201.05729 (2022). arxiv:cs.CV/2201.05729
[50]
Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. 2021. Comprehensive linguistic-visual composition network for image retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21). Association for Computing Machinery, New York, NY, USA, 1369–1378.
[51]
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11307–11317.
[52]
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. 2020. CurlingNet: Compositional learning between images and text for fashion IQ data. arXiv preprint arXiv:2003.12299 (2020).
[53]
Yifei Yuan and Wai Lam. 2021. Conversational fashion image retrieval via multiturn natural language feedback. In Proc. of International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM.
[54]
Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1M: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). 11782–11791.
[55]
Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-augmented attribute manipulation networks for interactive fashion search. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). 6156–6164.
[56]
Liang Zheng, Yi Yang, and Qi Tian. 2017. SIFT meets CNN: A decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40, 5 (2017), 1224–1244.

Cited By

View all
  • (2024)Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and NegativesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680808(1632-1641)Online publication date: 28-Oct-2024
  • (2024)Negative-Sensitive Framework With Semantic Enhancement for Composed Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.336989826(7608-7621)Online publication date: 2024
  • (2024)BAMG: Text-Based Person Re-identification via Bottlenecks Attention and Masked Graph ModelingComputer Vision – ACCV 202410.1007/978-981-96-0966-6_23(384-401)Online publication date: 7-Dec-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 3
March 2024
665 pages
EISSN:1551-6865
DOI:10.1145/3613614
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2023
Online AM: 30 August 2023
Accepted: 20 August 2023
Revised: 23 May 2023
Received: 16 November 2022
Published in TOMM Volume 20, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Multimodal retrieval
  2. combiner networks
  3. vision language model

Qualifiers

  • Research-article

Funding Sources

  • European Commission under European Horizon 2020 Programme

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)629
  • Downloads (Last 6 weeks)56
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and NegativesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680808(1632-1641)Online publication date: 28-Oct-2024
  • (2024)Negative-Sensitive Framework With Semantic Enhancement for Composed Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.336989826(7608-7621)Online publication date: 2024
  • (2024)BAMG: Text-Based Person Re-identification via Bottlenecks Attention and Masked Graph ModelingComputer Vision – ACCV 202410.1007/978-981-96-0966-6_23(384-401)Online publication date: 7-Dec-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media