More Web Proxy on the site http://driver.im/

short-paper

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features

Authors:

Alberto Baldrati,

Tiberio Uricchio,

Alberto Del BimboAuthors Info & Claims

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

Article No.: 54, Pages 1 - 5

https://doi.org/10.1145/3469877.3493593

Published: 10 January 2022 Publication History

Abstract

Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.

References

[1]

Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, and Miles Brundage. 2021. Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications. arXiv preprint arXiv:2108.02818(2021).

[2]

Jamil Ahmad, Khan Muhammad, Sambit Bakshi, and Sung Wook Baik. 2018. Object-oriented convolutional features for fine-grained image retrieval in large surveillance datasets. Future Generation Computer Systems 81 (2018), 314–330.

Digital Library

[3]

Imon Banerjee, Camille Kurtz, Alon Edward Devorah, Bao Do, Daniel L Rubin, and Christopher F Beaulieu. 2018. Relevance feedback for enhancing content based image retrieval and automatic prediction of semantic image features: Application to bone tumor radiographs. Journal of biomedical informatics 84 (2018), 123–135.

Digital Library

[4]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165(2020).

[5]

Yanbei Chen and Loris Bazzani. 2020. Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval. 136–152. https://doi.org/10.1007/978-3-030-58542-6_9

Digital Library

[6]

Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image Search With Text Feedback by Visiolinguistic Attention Learning. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR).

[7]

Marcos V Conde and Kerem Turgutlu. 2021. CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 3956–3960.

[8]

Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP. arXiv preprint arXiv:2106.11097(2021).

[9]

Federico A Galatolo, Mario GCA Cimino, and Gigliola Vaglini. 2021. Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search. arXiv preprint arXiv:2102.01645(2021).

[10]

Noa Garcia and George Vogiatzis. 2018. How to read paintings: semantic art understanding with multi-modal retrieval. In Proc. of European Conference on Computer Vision (ECCV) Workshops. 0–0.

[11]

Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. arxiv:1805.00145 [cs.CV]

[12]

Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-aware Fashion Concept Discovery. arxiv:1708.01311 [cs.CV]

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arxiv:1512.03385 [cs.CV]

[14]

Bogdan Ionescu, Henning Müller, Renaud Péteri, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Dzmitri Klimuk, Aleh Tarasau, Asma Ben Abacha, Sadid A Hasan, 2019. ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature. In Proc. of International Conference of the Cross-Language Evaluation Forum for European Languages (CLEF). Springer, 358–386.

Digital Library

[15]

Surgan Jandial, Ayush Chopra, Pinkesh Badjatiya, Pranit Chawla, Mausoom Sarkar, and Balaji Krishnamurthy. 2020. TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback. arxiv:2009.01485 [cs.CV]

[16]

Xin Ji, Wei Wang, Meihui Zhang, and Yang Yang. 2017. Cross-domain image retrieval with attention modeling. In Proc. of ACM Multimedia (ACMMM). 1654–1662.

Digital Library

[17]

Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual Compositional Learning in Interactive Image Retrieval. In Proc. of AAAI Conference on Artificial Intelligence (AAAI), Vol. 35. 1771–1779. https://ojs.aaai.org/index.php/AAAI/article/view/16271

[18]

Adriana Kovashka, Devi Parikh, and Kristen Grauman. 2015. WhittleSearch: Interactive Image Search with Relative Attribute Feedback. International Journal of Computer Vision 115, 2 (Apr 2015), 185–210. https://doi.org/10.1007/s11263-015-0814-0

Digital Library

[19]

Seungmin Lee, Dongwan Kim, and Bohyung Han. 2021. CoSMo: Content-Style Modulation for Image Retrieval With Text Feedback. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 802–812.

[20]

Adnan Qayyum, Syed Muhammad Anwar, Muhammad Awais, and Muhammad Majid. 2017. Medical image retrieval using deep convolutional neural network. Neurocomputing 266(2017), 8–20.

Digital Library

[21]

Yuanbin Qu, Peihan Liu, Wei Song, Lizhen Liu, and Miaomiao Cheng. 2020. A Text Generation and Prediction System: Pre-training on New Corpora Using BERT and GPT-2. In Proc. of IEEE International Conference on Electronics Information and Emergency Communication (ICEIEC). IEEE, 323–326.

[22]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arxiv:2103.00020 [cs.CV]

[23]

Yong Rui, T.S. Huang, M. Ortega, and S. Mehrotra. 1998. Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology 8, 5(1998), 644–655. https://doi.org/10.1109/76.718510

Digital Library

[24]

Minchul Shin, Yoonjae Cho, Byungsoo Ko, and Geonmo Gu. 2021. RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network. arXiv preprint arXiv:2104.03015(2021).

[25]

Haibo Su, Peng Wang, Lingqiao Liu, Hui Li, Zhen Li, and Yanning Zhang. 2020. Where to Look and How to Describe: Fashion Image Retrieval with an Attentional Heterogeneous Bilinear Network. IEEE Transactions on Circuits and Systems for Video Technology (2020).

[26]

Federico Vaccaro, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2020. Image Retrieval Using Multi-Scale CNN Features Pooling. In Proc. of ACM International Conference on Multimedia Retrieval (ICMR) (Dublin, Ireland) (ICMR ’20). Association for Computing Machinery, New York, NY, USA, 311–315. https://doi.org/10.1145/3372278.3390732

Digital Library

[27]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2018. Composing Text and Image for Image Retrieval - An Empirical Odyssey. arxiv:1812.07119 [cs.CV]

[28]

Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2020. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback. arxiv:1905.12794 [cs.CV]

[29]

Youngjae Yu, Seunghwan Lee, Yuncheol Choi, and Gunhee Kim. 2020. CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data. arxiv:2003.12299 [cs.CV]

[30]

Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR). 6156–6164. https://doi.org/10.1109/CVPR.2017.652

Cited By

Zhang GWei SPang HQiu SZhao Y(2024)Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness PerceptionIEEE Transactions on Multimedia10.1109/TMM.2023.327346626(916-928)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3273466
Zhao YLu HZhao SWu HLu Z(2024)Multi-Level Contrastive Learning For Hybrid Cross-Modal RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447444(6390-6394)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447444
Sun ZYang GLu ZJiang HZhu GCao Z(2024)Image Retrieval with Composed Query by Multi-Scale Multi-Modal FusionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446291(5950-5954)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446291
Show More Cited By

Index Terms

Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
  2. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has ...
Multimodal retrieval with relevance feedback based on genetic programming

This paper presents a framework for multimodal retrieval with relevance feedback based on genetic programming. In this supervised learning-to-rank framework, genetic programming is used for the discovery of effective combination functions of (multimodal)...
Multimodal Medical Case-Based Retrieval on the Radiology Image and Report: SNUMedinfo at VISCERAL Retrieval Benchmark
Revised Selected Papers from the First International Workshop on Multimodal Retrieval in the Medical Domain - Volume 9059

This paper describes the participation at the VISCERAL Retrieval benchmark. The task is about retrieving relevant medical cases from radiology image and report. Both query and retrieval datasets are composed of multimodal data. We extracted low-level ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MMAsia '21: Proceedings of the 3rd ACM International Conference on Multimedia in Asia

December 2021

508 pages

ISBN:9781450386074

DOI:10.1145/3469877

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 January 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Funding Sources

NVIDIA
European Horizon 2020 Programme

Conference

MMAsia '21

Sponsor:

SIGMM

MMAsia '21: ACM Multimedia Asia

December 1 - 3, 2021

Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 59 of 204 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
705
Total Downloads

Downloads (Last 12 months)97
Downloads (Last 6 weeks)7

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang GWei SPang HQiu SZhao Y(2024)Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness PerceptionIEEE Transactions on Multimedia10.1109/TMM.2023.327346626(916-928)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3273466
Zhao YLu HZhao SWu HLu Z(2024)Multi-Level Contrastive Learning For Hybrid Cross-Modal RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447444(6390-6394)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447444
Sun ZYang GLu ZJiang HZhu GCao Z(2024)Image Retrieval with Composed Query by Multi-Scale Multi-Modal FusionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446291(5950-5954)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446291
Xiao LYamasaki T(2024)Attribute-Guided Multi-Level Attention Network for Fine-Grained Fashion RetrievalIEEE Access10.1109/ACCESS.2024.338378512(48068-48080)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3383785
Yang FIsmail NPang YAlsayed A(2024)CAMIR: fine-tuning CLIP and multi-head cross-attention mechanism for multimodal image retrieval with sketch and text featuresInternational Journal of Multimedia Information Retrieval10.1007/s13735-024-00352-614:1Online publication date: 24-Dec-2024
https://doi.org/10.1007/s13735-024-00352-6
Kobs KSteininger MHotho A(2023)InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00112(1063-1072)Online publication date: Jan-2023
https://doi.org/10.1109/WACV56688.2023.00112
Becattini FDe Divitiis LBaecchi CBimbo A(2023)Fashion recommendation based on style and social eventsMultimedia Tools and Applications10.1007/s11042-023-15290-482:24(38217-38232)Online publication date: 17-May-2023
https://dl.acm.org/doi/10.1007/s11042-023-15290-4
Baldrati ABertini MUricchio TDel Bimbo A(2022)Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW56347.2022.00543(4955-4964)Online publication date: Jun-2022
https://doi.org/10.1109/CVPRW56347.2022.00543
Baldrati ABertini MUricchio TDel Bimbo A(2022)Effective conditioned and composed image retrieval combining CLIP-based features2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.02080(21434-21442)Online publication date: Jun-2022
https://doi.org/10.1109/CVPR52688.2022.02080

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten