More Web Proxy on the site http://driver.im/

research-article

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Authors:

Alberto Baldrati,

Tiberio Uricchio,

Alberto Del BimboAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 3

Article No.: 62, Pages 1 - 24

https://doi.org/10.1145/3617597

Published: 23 October 2023 Publication History

Abstract

Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir.

References

[1]

Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. 2021. Evaluating clip: Towards characterization of broader capabilities and downstream implications. arXiv preprint arXiv:2108.02818 (2021).

[2]

Jamil Ahmad, Khan Muhammad, Sambit Bakshi, and Sung Wook Baik. 2018. Object-oriented convolutional features for fine-grained image retrieval in large surveillance datasets. Future Generation Computer Systems 81 (2018), 314–330.

Digital Library

[3]

Muhammad Umer Anwaar, Egor Labintcev, and Martin Kleinsteuber. 2021. Compositional learning of image-text query for image retrieval. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 1140–1149.

[4]

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Exploiting CLIP-based multi-modal approach for artwork classification and retrieval. In The Future of Heritage Science and Technologies: ICT and Digital Heritage: Third International Conference, Florence Heri-Tech 2022, Florence, Italy, May 16–18, 2022, Proceedings. Springer, 140–149.

[5]

Imon Banerjee, Camille Kurtz, Alon Edward Devorah, Bao Do, Daniel L. Rubin, and Christopher F. Beaulieu. 2018. Relevance feedback for enhancing content based image retrieval and automatic prediction of semantic image features: Application to bone tumor radiographs. Journal of Biomedical Informatics 84 (2018), 123–135.

Digital Library

[6]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.

[7]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).

[8]

Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image search with text feedback by visiolinguistic attention learning. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR).

[9]

Ruizhe Cheng, Bichen Wu, Peizhao Zhang, Peter Vajda, and Joseph E. Gonzalez. 2021. Data-efficient language-supervised zero-shot learning with self-distillation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 3119–3124.

[10]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).

[11]

Mario G. C. A. Cimino, Federico A. Galatolo, and Gigliola Vaglini. 2021. Generating images from caption and vice versa via CLIP-guided generative latent space search. In Proceedings of the International Conference on Image Processing and Vision Engineering. 166–174.

Digital Library

[12]

Claudia Companioni-Brito, Zygred Mariano-Calibjio, Mohamed Elawady, and Sule Yildirim. 2018. Mobile-based painting photo retrieval using combined features. In Proceedings of International Conference on Image Analysis and Recognition (ICIAR), Vol. 10882. Springer, 278.

[13]

Marcos V. Conde and Kerem Turgutlu. 2021. CLIP-Art: Contrastive pre-training for fine-grained art classification. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). 3956–3960.

[14]

Ginger Delmas, Rafael S. Rezende, Gabriela Csurka, and Diane Larlus. 2021. ARTEMIS: Attention-based retrieval with text-explicit matching and implicit similarity. In International Conference on Learning Representations.

[15]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.

[16]

Eric Dodds, Jack Culpepper, Simao Herdade, Yang Zhang, and Kofi Boakye. 2020. Modality-agnostic attention fusion for visual search with text feedback. arXiv preprint arXiv:2007.00145 (2020).

[17]

Xiao Dong, Xunlin Zhan, Yangxin Wu, Yunchao Wei, Xiaoyong Wei, Minlong Lu, and Xiaodan Liang. 2021. M5product: A multi-modal pretraining benchmark for e-commercial product downstream tasks. arXiv preprint arXiv:2109.04275 (2021).

[18]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering video-text retrieval via image CLIP. arXiv preprint arXiv:2106.11097 (2021).

[19]

Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. 2018. Dialog-based interactive image retrieval. Advances in Neural Information Processing Systems 31 (2018).

[20]

Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE International Conference on Computer Vision. 1463–1471.

[21]

Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-focused vision-and-language representation learning. In European Conference on Computer Vision. Springer, 634–651.

Digital Library

[22]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[23]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.

Digital Library

[24]

Bogdan Ionescu, Henning Müller, Renaud Péteri, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Dzmitri Klimuk, Aleh Tarasau, Asma Ben Abacha, Sadid A. Hasan, et al. 2019. ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature. In Proceedings of International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF). Springer, 358–386.

Digital Library

[25]

Bogdan Ionescu, Henning Müller, Renaud Péteri, Duc-Tien Dang-Nguyen, Liting Zhou, Luca Piras, Michael Riegler, Pål Halvorsen, Minh-Triet Tran, Mathias Lux, et al. 2020. ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications. Advances in Information Retrieval 12036 (2020), 533.

[26]

Surgan Jandial, Pinkesh Badjatiya, Pranit Chawla, Ayush Chopra, Mausoom Sarkar, and Balaji Krishnamurthy. 2022. SAC: Semantic attention composition for text-conditioned image retrieval. In Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 4021–4030.

[27]

Surgan Jandial, Ayush Chopra, Pinkesh Badjatiya, Pranit Chawla, Mausoom Sarkar, and Balaji Krishnamurthy. 2020. TRACE: Transform aggregate and compose visiolinguistic representations for image search with text feedback. arXiv preprint arXiv:2009.01485 7 (2020), 7.

[28]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of International Conference on Machine Learning (ICML).

[29]

Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual compositional learning in interactive image retrieval. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), Vol. 35. 1771–1779. https://ojs.aaai.org/index.php/AAAI/article/view/16271

[30]

Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2015. WhittleSearch: Interactive image search with relative attribute feedback. International Journal of Computer Vision (IJCV) 115, 2 (Apr.2015), 185–210.

Digital Library

[31]

Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-style modulation for image retrieval with text feedback. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 802–812.

[32]

Manling Li, Ruochen Xu, Shuohang Wang, Luowei Zhou, Xudong Lin, Chenguang Zhu, Michael Zeng, Heng Ji, and Shih-Fu Chang. 2022. CLIP-event: Connecting text and images with event structures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16420–16429.

[33]

Xiaoqing Li, Jiansheng Yang, and Jinwen Ma. 2021. Recent developments of content-based image retrieval (CBIR). Neurocomputing 452 (2021), 675–689.

[34]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 121–137.

Digital Library

[35]

Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y. Zou. 2022. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems 35 (2022), 17612–17625.

[36]

Yating Liu and Yan Lu. 2021. Multi-grained fusion for conditional image retrieval. In Proceedings of International Conference on Multimedia Modeling (MMM). Springer International Publishing, Cham, 315–327.

Digital Library

[37]

Zheyuan Liu, Cristian Rodriguez-Opazo, Damien Teney, and Stephen Gould. 2021. Image retrieval on real-life images with pre-trained vision-and-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2125–2134.

[38]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7

[39]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2018. Mixed precision training. In International Conference on Learning Representations.

[40]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.

[41]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.

[42]

Yong Rui, T. S. Huang, M. Ortega, and S. Mehrotra. 1998. Relevance feedback: A power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 8, 5 (1998), 644–655.

Digital Library

[43]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (2015), 211–252.

Digital Library

[44]

Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2019. Grad-CAM: Visual explanations from deep networks via gradient-based localization. International Journal of Computer Vision (IJCV) 128, 2 (Oct.2019), 336–359.

Digital Library

[45]

Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. 2021. RTIC: Residual learning for text and image composition using graph convolutional network. arXiv preprint arXiv:2104.03015 (2021).

[46]

Arnold W. M. Smeulders, Marcel Worring, Simone Santini, Amarnath Gupta, and Ramesh Jain. 2000. Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22, 12 (2000), 1349–1380.

Digital Library

[47]

Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6418–6428.

[48]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6439–6448.

[49]

Zhecan Wang, Noel Codella, Yen-Chun Chen, Luowei Zhou, Jianwei Yang, Xiyang Dai, Bin Xiao, Haoxuan You, Shih-Fu Chang, and Lu Yuan. 2022. CLIP-TD: CLIP Targeted distillation for vision-language tasks. arXiv preprint arXiv:2201.05729 (2022). arxiv:cs.CV/2201.05729

[50]

Haokun Wen, Xuemeng Song, Xin Yang, Yibing Zhan, and Liqiang Nie. 2021. Comprehensive linguistic-visual composition network for image retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21). Association for Computing Machinery, New York, NY, USA, 1369–1378.

Digital Library

[51]

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion IQ: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11307–11317.

[52]

Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. 2020. CurlingNet: Compositional learning between images and text for fashion IQ data. arXiv preprint arXiv:2003.12299 (2020).

[53]

Yifei Yuan and Wai Lam. 2021. Conversational fashion image retrieval via multiturn natural language feedback. In Proc. of International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM.

Digital Library

[54]

Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1M: Towards weakly supervised instance-level product retrieval via cross-modal pretraining. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV). 11782–11791.

[55]

Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-augmented attribute manipulation networks for interactive fashion search. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). 6156–6164.

[56]

Liang Zheng, Yi Yang, and Qi Tian. 2017. SIFT meets CNN: A decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40, 5 (2017), 1224–1244.

Cited By

Feng ZZhang RNie ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and NegativesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680808(1632-1641)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680808
Wang YLiu LYuan CLi MLiu J(2024)Negative-Sensitive Framework With Semantic Enhancement for Composed Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.336989826(7608-7621)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3369898
Cheng KZou WGu HOuyang A(2024)BAMG: Text-Based Person Re-identification via Bottlenecks Attention and Masked Graph ModelingComputer Vision – ACCV 202410.1007/978-981-96-0966-6_23(384-401)Online publication date: 7-Dec-2024
https://doi.org/10.1007/978-981-96-0966-6_23

Index Terms

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
        Image representations
      2. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features
MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive ...
Dual-Path Multimodal Optimal Transport for Composed Image Retrieval
Computer Vision – ACCV 2024
Abstract
Unlike cross-modal retrieval tasks like text-to-image and image-to-text, which focus on one-way feature alignment, composed image retrieval emphasizes bidirectional feature alignment to differentiate between features that need to be preserved or ...
Multimodal retrieval with relevance feedback based on genetic programming

This paper presents a framework for multimodal retrieval with relevance feedback based on genetic programming. In this supervised learning-to-rank framework, genetic programming is used for the discovery of effective combination functions of (multimodal)...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 3

March 2024

665 pages

EISSN:1551-6865

DOI:10.1145/3613614

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2023

Online AM: 30 August 2023

Accepted: 20 August 2023

Revised: 23 May 2023

Received: 16 November 2022

Published in TOMM Volume 20, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

European Commission under European Horizon 2020 Programme

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
881
Total Downloads

Downloads (Last 12 months)629
Downloads (Last 6 weeks)56

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Feng ZZhang RNie ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Improving Composed Image Retrieval via Contrastive Learning with Scaling Positives and NegativesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680808(1632-1641)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680808
Wang YLiu LYuan CLi MLiu J(2024)Negative-Sensitive Framework With Semantic Enhancement for Composed Image RetrievalIEEE Transactions on Multimedia10.1109/TMM.2024.336989826(7608-7621)Online publication date: 2024
https://doi.org/10.1109/TMM.2024.3369898
Cheng KZou WGu HOuyang A(2024)BAMG: Text-Based Person Re-identification via Bottlenecks Attention and Masked Graph ModelingComputer Vision – ACCV 202410.1007/978-981-96-0966-6_23(384-401)Online publication date: 7-Dec-2024
https://doi.org/10.1007/978-981-96-0966-6_23

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents