[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3404835.3462967acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Comprehensive Linguistic-Visual Composition Network for Image Retrieval

Published: 11 July 2021 Publication History

Abstract

Composing text and image for image retrieval (CTI-IR) is a new yet challenging task, for which the input query is not the conventional image or text but a composition, i.e., a reference image and its corresponding modification text. The key of CTI-IR lies in how to properly compose the multi-modal query to retrieve the target image. In a sense, pioneer studies mainly focus on composing the text with either the local visual descriptor or global feature of the reference image. However, they overlook the fact that the text modifications are indeed diverse, ranging from the concrete attribute changes, like "change it to long sleeves", to the abstract visual property adjustments, e.g., "change the style to professional". Thus, simply emphasizing the local or global feature of the reference image for the query composition is insufficient. In light of the above analysis, we propose a Comprehensive Linguistic-Visual Composition Network (CLVC-Net) for image retrieval. The core of CLVC-Net is that it designs two composition modules: fine-grained local-wise composition module and fine-grained global-wise composition module, targeting comprehensive multi-modal compositions. Additionally, a mutual enhancement module is designed to promote local-wise and global-wise composition processes by forcing them to share knowledge with each other. Extensive experiments conducted on three real-world datasets demonstrate the superiority of our CLVC-Net. We released the codes to benefit other researchers.

Supplementary Material

MP4 File (SIGIR21_fp1020.mp4)
Presentation video

References

[1]
Kenan E. Ak, Ashraf A. Kassim, Joo-Hwee Lim, and Jo Yew Tham. 2018. Learning Attribute Representations with Localization for Flexible Fashion Search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7708--7717.
[2]
Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic Attribute Discovery and Characterization from Noisy Web Data. In Proceedings of the European Conference on Computer Vision. Springer, 663--676.
[3]
Hou Pong Chan, Wang Chen, and Irwin King. 2020. A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1191--1200.
[4]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020. Image Search with Text Feedback by Visiolinguistic Attention Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2998--3008.
[5]
Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. 2017. Semantic Image Synthesis via Adversarial Learning. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5707--5715.
[6]
Fuli Feng, Xiangnan He, Jie Tang, and Tat-Seng Chua. 2019 a. Graph Adversarial Training: Dynamically Regularizing Based on Graph Structure. IEEE Transactions on Knowledge and Data Engineering (2019).
[7]
Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. 2019 b. Temporal Relational Ranking for Stock Prediction. ACM Transactions on Information Systems, Vol. 37, 2 (2019), 1--30.
[8]
Fuli Feng, Weiran Huang, Xiangnan He, Xin Xin, Qifan Wang, and Tat-Seng Chua. 2021. Should Graph Convolution Trust Neighbors? A Simple Causal Inference Method. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1--11.
[9]
Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2251--2260.
[10]
Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogé rio Schmidt Feris. 2018. Dialog-based Interactive Image Retrieval. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 676--686.
[11]
Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogé rio Schmidt Feris. 2019. The Fashion IQ Dataset: Retrieving Images by Combining Side Information and Relative Natural Language Feedback. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1--14.
[12]
Xintong Han, Zuxuan Wu, Phoenix X. Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S. Davis. 2017. Automatic Spatially-Aware Fashion Concept Discovery. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1472--1480.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.
[14]
Geoffrey Hinton, Jeff Dean, and Oriol Vinyals. 2014. Distilling the Knowledge in a Neural Network. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 1--9.
[15]
Sepp Hochreiter and Jü rgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, Vol. 9, 8 (1997), 1735--1780.
[16]
Mehrdad Hosseinzadeh and Yang Wang. 2020. Composed Query Image Retrieval Using Locally Bounded Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3593--3602.
[17]
Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021 a. Video Moment Localization via Deep Cross-modal Hashing. IEEE Transactions on Image Processing, Vol. 30 (2021), 4667--4677.
[18]
Yupeng Hu, Peng Zhan, Yang Xu, Jia Zhao, Yujun Li, and Xueqing Li. 2021 b. Temporal Representation Learning for Time Series Classification. Neural Computing and Applications, Vol. 33, 8 (2021), 3169--3182.
[19]
Fei Huang, Yong Cheng, Cheng Jin, Yuejie Zhang, and Tao Zhang. 2017. Deep Multimodal Embedding Model for Fine-grained Sketch-based Image Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 929--932.
[20]
Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning. ACM, 448--456.
[21]
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision. Springer, 694--711.
[22]
Jongseok Kim, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. 2021. Dual Compositional Learning in Interactive Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 1--9.
[23]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations. OpenReview.net, 1--15.
[24]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 1106--1114.
[25]
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip H. S. Torr. 2020. ManiGAN: Text-Guided Image Manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 7877--7886.
[26]
Hao Luo, Wei Jiang, Xuan Zhang, Xing Fan, Jingjing Qian, and Chi Zhang. 2019. AlignedReID+: Dynamically Matching Local Information for Person Re-identification. Pattern Recognition, Vol. 94 (2019), 53--61.
[27]
Lingjuan Lyu and Chi-Hua Chen. 2020. Differentially Private Knowledge Distillation for Mobile Analytics. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1809--1812.
[28]
Changyi Ma, Chonglin Gu, Wenye Li, and Shuguang Cui. 2020. Large-scale Image Retrieval with Sparse Binary Projections. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1817--1820.
[29]
David Novak, Michal Batko, and Pavel Zezula. 2015. Large-scale Image Retrieval using Neural Net Descriptors. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1039--1040.
[30]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic Image Synthesis With Spatially-Adaptive Normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2337--2346.
[31]
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI, 3942--3951.
[32]
Filip Radenovic, Giorgos Tolias, and Ondrej Chum. 2019. Fine-Tuning CNN Image Retrieval with No Human Annotation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 7 (2019), 1655--1668.
[33]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 91--99.
[34]
Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, and Tim Lillicrap. 2017. A Simple Neural Network Module for Relational Reasoning. In Proceedings of the International Conference on Neural Information Processing Systems. MIT Press, 4967--4976.
[35]
Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. 2018. Neural Compatibility Modeling with Attentive Knowledge Distillation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 5--14.
[36]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.
[37]
Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, and James Hays. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6439--6448.
[38]
Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. 2018. Learning Discriminative Features with Multiple Granularities for Person Re-Identification. In Proceedings of the ACM International Conference on Multimedia. ACM, 274--282.
[39]
Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In Proceedings of the ACM International Conference on Multimedia. ACM, 1437--1445.
[40]
Xin Yang, Xuemeng Song, Fuli Feng, Haokun Wen, Ling-Yu Duan, and Liqiang Nie. 2021. Attribute-wise Explainable Fashion Compatibility Modeling. ACM Transactions on Multimedia Computing, Communications and Application, Vol. 17, 1 (2021), 36:1--36:21.
[41]
Xin Yang, Xuemeng Song, Xianjing Han, Haokun Wen, Jie Nie, and Liqiang Nie. 2020. Generative Attribute Manipulation Scheme for Flexible Fashion Search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 941--950.
[42]
Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 3367--3376.
[43]
Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep Mutual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4320--4328.
[44]
Bo Zhao, Jiashi Feng, Xiao Wu, and Shuicheng Yan. 2017. Memory-Augmented Attribute Manipulation Networks for Interactive Fashion Search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6156--6164.

Cited By

View all
  • (2024)Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363946920:6(1-22)Online publication date: 8-Mar-2024
  • (2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Comprehensive Linguistic-Visual Composition Network for Image Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. image retrieval
    2. linguistic-visual composition
    3. mutual learning

    Qualifiers

    • Research-article

    Funding Sources

    • the National Natural Science Foundation of China
    • the Key R&D Program of Shandong (Major scientific and technological innovation projects)
    • new AI project towards the integration of education and industry in QLUT

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)107
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363946920:6(1-22)Online publication date: 8-Mar-2024
    • (2024)Fine-grained Textual Inversion Network for Zero-Shot Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657831(240-250)Online publication date: 10-Jul-2024
    • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
    • (2024)Self-Training Boosted Multi-Factor Matching Network for Composed Image RetrievalIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.334643446:5(3665-3678)Online publication date: May-2024
    • (2024)Enhance Composed Image Retrieval via Multi-Level Collaborative Localization and Semantic Activeness PerceptionIEEE Transactions on Multimedia10.1109/TMM.2023.327346626(916-928)Online publication date: 1-Jan-2024
    • (2024)Multimodal Composition Example Mining for Composed Query Image RetrievalIEEE Transactions on Image Processing10.1109/TIP.2024.335906233(1149-1161)Online publication date: 1-Feb-2024
    • (2024)Set of Diverse Queries With Uncertainty Regularization for Composed Image RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.340100634:10(10494-10506)Online publication date: Oct-2024
    • (2024)Multi-Grained Attention Network With Mutual Exclusion for Composed Query-Based Image RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.330673834:4(2959-2972)Online publication date: Apr-2024
    • (2024)MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00361(4747-4759)Online publication date: 13-May-2024
    • (2024)Image Retrieval with Composed Query by Multi-Scale Multi-Modal FusionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446291(5950-5954)Online publication date: 14-Apr-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media