[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3469877.3493593acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
short-paper

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

Published: 10 January 2022 Publication History

Abstract

Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.

References

[1]
Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. 2021. Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications. arXiv preprint arXiv:2108.02818(2021).
[2]
Jamil Ahmad, Khan Muhammad, Sambit Bakshi, and Sung Wook Baik. 2018. Object-oriented convolutional features for fine-grained image retrieval in large surveillance datasets. Future Generation Computer Systems 81 (2018), 314–330.
[3]
Imon Banerjee, Camille Kurtz, Alon Edward Devorah, Bao Do, Daniel L Rubin, and Christopher F Beaulieu. 2018. Relevance feedback for enhancing content based image retrieval and automatic prediction of semantic image features: Application to bone tumor radiographs. Journal of biomedical informatics 84 (2018), 123–135.
[4]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020).
[5]
Yanbei Chen and Loris Bazzani. 2020. Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval. 136–152. https://doi.org/10.1007/978-3-030-58542-6_9
[6]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image Search With Text Feedback by Visiolinguistic Attention Learning. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR).
[7]
Marcos V Conde and Kerem Turgutlu. 2021. CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 3956–3960.
[8]
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv preprint arXiv:2106.11097(2021).
[9]
Federico A Galatolo, Mario GCA Cimino, and Gigliola Vaglini. 2021. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv preprint arXiv:2102.01645(2021).
[10]
Noa Garcia and George Vogiatzis. 2018. How to read paintings: semantic art understanding with multi-modal retrieval. In Proc. of European Conference on Computer Vision (ECCV) Workshops. 0–0.
[11]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. arxiv:1805.00145 [cs.CV]
[12]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-aware Fashion Concept Discovery. arxiv:1708.01311 [cs.CV]
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]
[14]
Bogdan Ionescu, Henning Müller, Renaud Péteri, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Dzmitri Klimuk, Aleh Tarasau, Asma Ben Abacha, Sadid A Hasan, 2019. ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature. In Proc. of International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF). Springer, 358–386.
[15]
Surgan Jandial, Ayush Chopra, Pinkesh Badjatiya, Pranit Chawla, Mausoom Sarkar, and Balaji Krishnamurthy. 2020. TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback. arxiv:2009.01485 [cs.CV]
[16]
Xin Ji, Wei Wang, Meihui Zhang, and Yang Yang. 2017. Cross-domain image retrieval with attention modeling. In Proc. of ACM Multimedia (ACMMM). 1654–1662.
[17]
Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual Compositional Learning in Interactive Image Retrieval. In Proc. of AAAI Conference on Artificial Intelligence (AAAI), Vol. 35. 1771–1779. https://ojs.aaai.org/index.php/AAAI/article/view/16271
[18]
Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2015. WhittleSearch: Interactive Image Search with Relative Attribute Feedback. International Journal of Computer Vision 115, 2 (Apr 2015), 185–210. https://doi.org/10.1007/s11263-015-0814-0
[19]
Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 802–812.
[20]
Adnan Qayyum, Syed Muhammad Anwar, Muhammad Awais, and Muhammad Majid. 2017. Medical image retrieval using deep convolutional neural network. Neurocomputing 266(2017), 8–20.
[21]
Yuanbin Qu, Peihan Liu, Wei Song, Lizhen Liu, and Miaomiao Cheng. 2020. A Text Generation and Prediction System: Pre-training on New Corpora Using BERT and GPT-2. In Proc. of IEEE International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, 323–326.
[22]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]
[23]
Yong Rui, T.S. Huang, M. Ortega, and S. Mehrotra. 1998. Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 8, 5(1998), 644–655. https://doi.org/10.1109/76.718510
[24]
Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. 2021. RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network. arXiv preprint arXiv:2104.03015(2021).
[25]
Haibo Su, Peng Wang, Lingqiao Liu, Hui Li, Zhen Li, and Yanning Zhang. 2020. Where to Look and How to Describe: Fashion Image Retrieval with an Attentional Heterogeneous Bilinear Network. IEEE Transactions on Circuits and Systems for Video Technology (2020).
[26]
Federico Vaccaro, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2020. Image Retrieval Using Multi-Scale CNN Features Pooling. In Proc. of ACM International Conference on Multimedia Retrieval (ICMR) (Dublin, Ireland) (ICMR ’20). Association for Computing Machinery, New York, NY, USA, 311–315. https://doi.org/10.1145/3372278.3390732
[27]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2018. Composing Text and Image for Image Retrieval - An Empirical Odyssey. arxiv:1812.07119 [cs.CV]
[28]
Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2020. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. arxiv:1905.12794 [cs.CV]
[29]
Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. 2020. CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data. arxiv:2003.12299 [cs.CV]
[30]
Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 6156–6164. https://doi.org/10.1109/CVPR.2017.652

Cited By

View all
  • (2024)Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness PerceptionIEEE Transactions on Multimedia10.1109/TMM.2023.327346626(916-928)Online publication date: 1-Jan-2024
  • (2024)Multi-Level Contrastive Learning For Hybrid Cross-Modal RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447444(6390-6394)Online publication date: 14-Apr-2024
  • (2024)Image Retrieval with Composed Query by Multi-Scale Multi-Modal FusionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446291(5950-5954)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia
December 2021
508 pages
ISBN:9781450386074
DOI:10.1145/3469877
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 January 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. deep neural networks
  3. multimodal retrieval

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Funding Sources

  • NVIDIA
  • European Horizon 2020 Programme

Conference

MMAsia '21
Sponsor:
MMAsia '21: ACM Multimedia Asia
December 1 - 3, 2021
Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)97
  • Downloads (Last 6 weeks)7
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness PerceptionIEEE Transactions on Multimedia10.1109/TMM.2023.327346626(916-928)Online publication date: 1-Jan-2024
  • (2024)Multi-Level Contrastive Learning For Hybrid Cross-Modal RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447444(6390-6394)Online publication date: 14-Apr-2024
  • (2024)Image Retrieval with Composed Query by Multi-Scale Multi-Modal FusionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446291(5950-5954)Online publication date: 14-Apr-2024
  • (2024)Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion RetrievalIEEE Access10.1109/ACCESS.2024.338378512(48068-48080)Online publication date: 2024
  • (2024)CAMIR: fine-tuning CLIP and multi-head cross-attention mechanism for multimodal image retrieval with sketch and text featuresInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00352-614:1Online publication date: 24-Dec-2024
  • (2023)InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00112(1063-1072)Online publication date: Jan-2023
  • (2023)Fashion recommendation based on style and social eventsMultimedia Tools and Applications10.1007/s11042-023-15290-482:24(38217-38232)Online publication date: 17-May-2023
  • (2022)Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00543(4955-4964)Online publication date: Jun-2022
  • (2022)Effective conditioned and composed image retrieval combining CLIP-based features2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.02080(21434-21442)Online publication date: Jun-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media