[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3581807.3581857acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccprConference Proceedingsconference-collections
research-article

Semantic Maximum Relevance and Modal Alignment for Cross-Modal Retrieval

Published: 22 May 2023 Publication History

Abstract

With the increasing abundance of multimedia data resources, researches on mining the relationship between different modalities to achieve refined cross-modal retrieval are gradually emerging. In this paper, we propose a novel Semantic Maximum Relevance and Modal Alignment (SMR-MA) for Cross-Modal Retrieval, which utilizes the pre-trained model with abundant image text information to extract the features of each image text, and further promotes the modal information interaction between the same semantic categories through the modal alignment module and the multi-layer perceptron with shared weights. In addition, multi-modal embedding is distributed to the normalized hypersphere, and angular edge penalty is applied between feature embedding and weight in angular space to maximize the classification boundary, thus increasing both intra-class similarity and inter-class difference. Comprehensive analysis experiments on three benchmark datasets demonstrate that the proposed method has superior performance in cross-modal retrieval tasks and is significantly superior to the state-of-the-art cross-modal retrieval methods.

References

[1]
Shotaro Akaho. 2006. A kernel method for canonical correlation analysis. arXiv preprint cs/0609071 (2006).
[2]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval. 1–9.
[3]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. (2019), 4690–4699.
[4]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[5]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. 7–16.
[6]
Satya Krishna Gorti, Noel Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, and Guangwei Yu. 2022. X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. ArXiv abs/2203.15086 (2022).
[7]
Ftoon Abushaqra Hassan Najadat. 2018. Multimodal Sentiment Analysis of Arabic Videos. Journal of Image and Graphics 6, 1 (2018), 39–43.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
[9]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[10]
Harold Hotelling. 1992. Relations between two sets of variates. In Breakthroughs in statistics. Springer, 162–190.
[11]
Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933 (2016).
[12]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2017), 405–420.
[13]
Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2013. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE transactions on pattern analysis and machine intelligence 36, 3 (2013), 521–535.
[14]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
[15]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon’s mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk. 139–147.
[16]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 251–260.
[17]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[18]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM international conference on Multimedia. 154–162.
[19]
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. 2021. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021).
[20]
Zhixiong Zeng and Wenji Mao. 2022. A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval. arXiv preprint arXiv:2201.02772 (2022).
[21]
Zhixiong Zeng, Ying Sun, and Wenji Mao. 2021. MCCN: Multimodal Coordinated Clustering Network for Large-Scale Cross-modal Retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 5427–5435.
[22]
Zhixiong Zeng, Shuai Wang, Nan Xu, and Wenji Mao. 2021. Pan: Prototype-based adaptive network for robust cross-modal retrieval. (2021), 1125–1134.
[23]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6 (2013), 965–978.
[24]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394–10403.

Index Terms

  1. Semantic Maximum Relevance and Modal Alignment for Cross-Modal Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition
    November 2022
    683 pages
    ISBN:9781450397056
    DOI:10.1145/3581807
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CLIP
    2. Cross-Modal Retrieval
    3. Modal Alignment
    4. Semantic Relevance

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Development Foundation of the 54th Research Institute of China Electronics Technology Group Corporation
    • Natural Science Foundation of Guangxi
    • National Natural Science Foundation of China
    • Guilin Science and Technology Development Program

    Conference

    ICCPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 37
      Total Downloads
    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media