More Web Proxy on the site http://driver.im/

research-article

Comprehensive Linguistic-Visual Composition Network for Image Retrieval

Authors:

Liqiang NieAuthors Info & Claims

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1369 - 1378

https://doi.org/10.1145/3404835.3462967

Published: 11 July 2021 Publication History

Abstract

Composing text and image for image retrieval (CTI-IR) is a new yet challenging task, for which the input query is not the conventional image or text but a composition, i.e., a reference image and its corresponding modification text. The key of CTI-IR lies in how to properly compose the multi-modal query to retrieve the target image. In a sense, pioneer studies mainly focus on composing the text with either the local visual descriptor or global feature of the reference image. However, they overlook the fact that the text modifications are indeed diverse, ranging from the concrete attribute changes, like "change it to long sleeves", to the abstract visual property adjustments, e.g., "change the style to professional". Thus, simply emphasizing the local or global feature of the reference image for the query composition is insufficient. In light of the above analysis, we propose a Comprehensive Linguistic-Visual Composition Network (CLVC-Net) for image retrieval. The core of CLVC-Net is that it designs two composition modules: fine-grained local-wise composition module and fine-grained global-wise composition module, targeting comprehensive multi-modal compositions. Additionally, a mutual enhancement module is designed to promote local-wise and global-wise composition processes by forcing them to share knowledge with each other. Extensive experiments conducted on three real-world datasets demonstrate the superiority of our CLVC-Net. We released the codes to benefit other researchers.

Supplementary Material

MP4 File (SIGIR21_fp1020.mp4)

Presentation video

Download
36.81 MB

References

[1]

Kenan E. Ak, Ashraf A. Kassim, Joo-Hwee Lim, and Jo Yew Tham. 2018. Learning Attribute Representations with Localization for Flexible Fashion Search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7708--7717.

[2]

Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic Attribute Discovery and Characterization from Noisy Web Data. In Proceedings of the European Conference on Computer Vision. Springer, 663--676.

[3]

Hou Pong Chan, Wang Chen, and Irwin King. 2020. A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1191--1200.

Digital Library

[4]

Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image Search with Text Feedback by Visiolinguistic Attention Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2998--3008.

[5]

Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. 2017. Semantic Image Synthesis via Adversarial Learning. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5707--5715.

[6]

Fuli Feng, Xiangnan He, Jie Tang, and Tat-Seng Chua. 2019 a. Graph Adversarial Training: Dynamically Regularizing Based on Graph Structure. IEEE Transactions on Knowledge and Data Engineering (2019).

[7]

Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. 2019 b. Temporal Relational Ranking for Stock Prediction. ACM Transactions on Information Systems, Vol. 37, 2 (2019), 1--30.

Digital Library

[8]

Fuli Feng, Weiran Huang, Xiangnan He, Xin Xin, Qifan Wang, and Tat-Seng Chua. 2021. Should Graph Convolution Trust Neighbors? A Simple Causal Inference Method. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1--11.

Digital Library

[9]

Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2251--2260.

Digital Library

[10]

Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogé rio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 676--686.

[11]

Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogé rio Schmidt Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1--14.

[12]

Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1472--1480.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.

[14]

Geoffrey Hinton, Jeff Dean, and Oriol Vinyals. 2014. Distilling the Knowledge in a Neural Network. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 1--9.

[15]

Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[16]

Mehrdad Hosseinzadeh and Yang Wang. 2020. Composed Query Image Retrieval Using Locally Bounded Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3593--3602.

[17]

Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021 a. Video Moment Localization via Deep Cross-modal Hashing. IEEE Transactions on Image Processing, Vol. 30 (2021), 4667--4677.

Digital Library

[18]

Yupeng Hu, Peng Zhan, Yang Xu, Jia Zhao, Yujun Li, and Xueqing Li. 2021 b. Temporal Representation Learning for Time Series Classification. Neural Computing and Applications, Vol. 33, 8 (2021), 3169--3182.

Digital Library

[19]

Fei Huang, Yong Cheng, Cheng Jin, Yuejie Zhang, and Tao Zhang. 2017. Deep Multimodal Embedding Model for Fine-grained Sketch-based Image Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 929--932.

Digital Library

[20]

Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning. ACM, 448--456.

[21]

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision. Springer, 694--711.

[22]

Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual Compositional Learning in Interactive Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 1--9.

[23]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations. OpenReview.net, 1--15.

[24]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 1106--1114.

Digital Library

[25]

Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip H. S. Torr. 2020. ManiGAN: Text-Guided Image Manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7877--7886.

[26]

Hao Luo, Wei Jiang, Xuan Zhang, Xing Fan, Jingjing Qian, and Chi Zhang. 2019. AlignedReID+: Dynamically Matching Local Information for Person Re-identification. Pattern Recognition, Vol. 94 (2019), 53--61.

Digital Library

[27]

Lingjuan Lyu and Chi-Hua Chen. 2020. Differentially Private Knowledge Distillation for Mobile Analytics. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1809--1812.

Digital Library

[28]

Changyi Ma, Chonglin Gu, Wenye Li, and Shuguang Cui. 2020. Large-scale Image Retrieval with Sparse Binary Projections. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1817--1820.

Digital Library

[29]

David Novak, Michal Batko, and Pavel Zezula. 2015. Large-scale Image Retrieval using Neural Net Descriptors. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1039--1040.

Digital Library

[30]

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic Image Synthesis With Spatially-Adaptive Normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2337--2346.

[31]

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 3942--3951.

[32]

Filip Radenovic, Giorgos Tolias, and Ondrej Chum. 2019. Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 7 (2019), 1655--1668.

[33]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 91--99.

[34]

Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Tim Lillicrap. 2017. A Simple Neural Network Module for Relational Reasoning. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 4967--4976.

[35]

Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. 2018. Neural Compatibility Modeling with Attentive Knowledge Distillation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 5--14.

Digital Library

[36]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.

[37]

Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6439--6448.

[38]

Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning Discriminative Features with Multiple Granularities for Person Re-Identification. In Proceedings of the ACM International Conference on Multimedia. ACM, 274--282.

Digital Library

[39]

Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In Proceedings of the ACM International Conference on Multimedia. ACM, 1437--1445.

Digital Library

[40]

Xin Yang, Xuemeng Song, Fuli Feng, Haokun Wen, Ling-Yu Duan, and Liqiang Nie. 2021. Attribute-wise Explainable Fashion Compatibility Modeling. ACM Transactions on Multimedia Computing, Communications and Application, Vol. 17, 1 (2021), 36:1--36:21.

Digital Library

[41]

Xin Yang, Xuemeng Song, Xianjing Han, Haokun Wen, Jie Nie, and Liqiang Nie. 2020. Generative Attribute Manipulation Scheme for Flexible Fashion Search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 941--950.

Digital Library

[42]

Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 3367--3376.

Digital Library

[43]

Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep Mutual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4320--4328.

[44]

Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6156--6164.

Cited By

Li SXu XJiang XShen FSun ZCichocki A(2024)Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363946920:6(1-22)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3639469
Lin HWen HSong XLiu MHu YNie LHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657831
Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727
Show More Cited By

Index Terms

Comprehensive Linguistic-Visual Composition Network for Image Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

SOA Based Image Retrieval Approach
ICIC '10: Proceedings of the 2010 Third International Conference on Information and Computing - Volume 02

Web services come into view based on Service Oriented Architecture. Web services let applications share data and even use other applications capabilities, without regard to what operating system or platform those applications run on. In this paper, A ...
Localized content based image retrieval
MIR '05: Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval

Classic Content-Based Image Retrieval (CBIR) takes a single non-annotated query image, and retrieves similar images from an image repository. Such a search must rely upon a holistic (or global) view of the image. Yet often the desired content of an ...
Content-based image retrieval using local visual attention feature

Content-based image retrieval (CBIR) has been an active research topic in the last decade. As one of the promising approaches, salient point based image retrieval has attracted many researchers. However, the related work is usually very time consuming, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2021

2998 pages

ISBN:9781450380379

DOI:10.1145/3404835

General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Natural Science Foundation of China
the Key R&D Program of Shandong (Major scientific and technological innovation projects)
new AI project towards the integration of education and industry in QLUT

Conference

SIGIR '21

Sponsor:

SIGIR

SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2021

Virtual Event, Canada

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

37
Total Citations
View Citations
882
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)7

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li SXu XJiang XShen FSun ZCichocki A(2024)Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363946920:6(1-22)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3639469
Lin HWen HSong XLiu MHu YNie LHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657831
Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727
Wen HSong XYin JWu JGuan WNie L(2024)Self-Training Boosted Multi-Factor Matching Network for Composed Image RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.334643446:5(3665-3678)Online publication date: May-2024
https://doi.org/10.1109/TPAMI.2023.3346434
Zhang GWei SPang HQiu SZhao Y(2024)Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness PerceptionIEEE Transactions on Multimedia10.1109/TMM.2023.327346626(916-928)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TMM.2023.3273466
Zhang GLi SWei SGe SCai NZhao Y(2024)Multimodal Composition Example Mining for Composed Query Image RetrievalIEEE Transactions on Image Processing10.1109/TIP.2024.335906233(1149-1161)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1109/TIP.2024.3359062
Xu YWei JBin YYang YMa ZShen H(2024)Set of Diverse Queries With Uncertainty Regularization for Composed Image RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340100634:10(10494-10506)Online publication date: Oct-2024
https://doi.org/10.1109/TCSVT.2024.3401006
Li SXu XJiang XShen FLiu XShen H(2024)Multi-Grained Attention Network With Mutual Exclusion for Composed Query-Based Image RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330673834:4(2959-2972)Online publication date: Apr-2024
https://doi.org/10.1109/TCSVT.2023.3306738
Wang MKe XXu XChen LGao YHuang PZhu R(2024)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00361(4747-4759)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00361
Sun ZYang GLu ZJiang HZhu GCao Z(2024)Image Retrieval with Composed Query by Multi-Scale Multi-Modal FusionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446291(5950-5954)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446291
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents