[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3664647.3681577acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Partially Aligned Cross-modal Retrieval via Optimal Transport-based Prototype Alignment Learning

Published: 28 October 2024 Publication History

Abstract

Supervised cross-modal retrieval (CMR) achieves excellent performance thanks to the semantic information provided by its labels, which helps to establish semantic correlations between samples from different modalities. However, in real-world scenarios, there often exists a large amount of unlabeled and unpaired multimodal training data, rendering existing methods unfeasible. To address this issue, we propose a novel partially aligned cross-modal retrieval method called Optimal Transport-based Prototype Alignment Learning (OTPAL). Due to the high computational complexity involved in directly establishing matching correlations between unannotated unaligned cross-modal samples, instead, we establish matching correlations between shared prototypes and samples. To be specific, we employ the optimal transport algorithm to establish cross-modal alignment information between samples and prototypes, and then minimize the distance between samples and their corresponding prototypes through a specially designed prototype alignment loss. As an extension of this paper, we also extensively investigate the influence of incomplete multimodal data on cross-modal retrieval performance under the partially aligned setting proposed above. To further address the above more challenging scenario, we raise a scalable prototype-based neighbor feature completion method, which better captures the correlations between incomplete samples and neighbor samples through a cross-modal self-attention mechanism. Experimental results on four benchmark datasets show that our method can obtain satisfactory accuracy and scalability in various real-world scenarios.

References

[1]
Laetitia Chapel, Rémi Flamary, HaoranWu, Cédric Févotte, and Gilles Gasso. 2021. Unbalanced optimal transport through non-negative penalized linear regression. Advances in Neural Information Processing Systems 34 (2021), 23270--23282.
[2]
Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, and Jingjing Liu. 2020. Graph optimal transport for cross-domain alignment. In International Conference on Machine Learning. PMLR, 1542--1553.
[3]
Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, and Xiaoying Tang. 2023. Prior: Prototype representation joint learning from medical images and reports. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21361--21371.
[4]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval. 1--9.
[5]
Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. 2016. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence 39, 9 (2016), 1853--1865.
[6]
Cuturi M Sinkhorn Distances. 2013. Lightspeed Computation of Optimal Transport. Advances in neural information processing systems 26 (2013), 2292--2300.
[7]
Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. 7--16.
[8]
Tiantian Gong, Kaixiang Chen, Liyan Zhang, and Junsheng Wang. 2023. Debiased contrastive curriculum learning for progressive generalizable person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 33, 10 (2023), 5947--5958.
[9]
Tiantian Gong, Guodong Du, Junsheng Wang, Yongkang Ding, and Liyan Zhang. 2023. Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification. In Proceedings of the 31st ACM International Conference on Multimedia. 5253--5261.
[10]
David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16, 12 (2004), 2639--2664.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[12]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[13]
Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable deep multimodal learning for cross-modal retrieval. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 635-- 644.
[14]
Peng Hu, Hongyuan Zhu, Xi Peng, and Jie Lin. 2020. Semi-supervised multimodal learning with balanced spectral decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 99--106.
[15]
Xin Huang and Yuxin Peng. 2018. Deep cross-media knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8837--8846.
[16]
Mengmeng Jing, Jingjing Li, Lei Zhu, Ke Lu, Yang Yang, and Zi Huang. 2020. Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the 28th ACM International Conference on Multimedia. 3283--3291.
[17]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[18]
Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. 2024. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. Advances in Neural Information Processing Systems 36 (2024).
[19]
Yaxin Liu, Jianlong Wu, Leigang Qu, Tian Gan, Jianhua Yin, and Liqiang Nie. 2022. Self-supervised Correlation Learning for Cross-Modal Retrieval. IEEE Transactions on Multimedia (2022).
[20]
Devraj Mandal, Pramod Rao, and Soma Biswas. 2020. Label prediction framework for semi-supervised cross-modal retrieval. In 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2311--2315.
[21]
Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In International Joint Conference on Artificial Intelligence. 3846--3853.
[22]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (2019), 1--24.
[23]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia (2017), 405--420.
[24]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing (2018), 5585--5599.
[25]
Tianwen Qian, Ran Cui, Jingjing Chen, Pai Peng, Xiaowei Guo, and Yu-Gang Jiang. 2023. Locate before answering: Answer guided question localization for video question answering. IEEE Transactions on Multimedia (2023).
[26]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748--8763.
[27]
Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's Mechanical Turk. 139--147.
[28]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to crossmodal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 251--260.
[29]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).
[30]
Abhishek Sharma and David W Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In CVPR 2011. IEEE, 593--600.
[31]
Dan Shi, Lei Zhu, Jingjing Li, Guohua Dong, and Huaxiang Zhang. 2023. Incomplete Cross-Modal Retrieval with Deep Correlation Transfer. ACM Transactions on Multimedia Computing, Communications and Applications (2023).
[32]
Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition. 20020--20029.
[33]
Wentao Tan, Lei Zhu, Jingjing Li, Zheng Zhang, and Huaxiang Zhang. 2023. Partial multi-modal hashing via neighbor-aware completion learning. IEEE Transactions on Multimedia 25 (2023), 8499--8510.
[34]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[36]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-Modal retrieval. In ACMInternational Conference on Multimedia. 154--162.
[37]
Junsheng Wang, Tiantian Gong, Zhixiong Zeng, Changchang Sun, and Yan Yan. 2022. C3CMR: Cross-Modality Cross-Instance Contrastive Learning for Cross- Media Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4300--4308.
[38]
WeiranWang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multiview representation learning. In International conference on machine learning. PMLR, 1083--1092.
[39]
WeiranWang and Karen Livescu. 2015. Large-scale approximate kernel canonical correlation analysis. arXiv preprint arXiv:1511.04773 (2015).
[40]
Zeqiang Wei, Kai Jin, and Xiuzhuang Zhou. 2023. Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval. arXiv preprint arXiv:2312.15840 (2023).
[41]
Dizhan Xue, Shengsheng Qian, Quan Fang, and Changsheng Xu. 2022. Mmt: Image-guided story ending generation with multimodal memory transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 750--758.
[42]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3441--3450.
[43]
Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2018. Robust classification with convolutional prototype learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3474--3482.
[44]
Zhixiong Zeng andWenji Mao. 2022. A comprehensive empirical study of visionlanguage pre-trained model for supervised cross-modal retrieval. arXiv preprint arXiv:2201.02772 (2022).
[45]
Zhixiong Zeng, ShuaiWang, Nan Xu, andWenji Mao. 2021. PAN: Prototype-based adaptive network for robust cross-modal retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval. 1125--1134.
[46]
Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6 (2013), 965--978.
[47]
Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, and Qi Tian. 2017. Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Transactions on Multimedia 20, 1 (2017), 128--141.
[48]
Ruiyi Zhang, Changyou Chen, Xinyuan Zhang, Ke Bai, and Lawrence Carin. 2020. Semantic matching for sequence-to-sequence learning. In Findings of the Association for Computational Linguistics: EMNLP 2020. 212--222.
[49]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394--10403.

Index Terms

  1. Partially Aligned Cross-modal Retrieval via Optimal Transport-based Prototype Alignment Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
    October 2024
    11719 pages
    ISBN:9798400706868
    DOI:10.1145/3664647
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. optimal transport strategy
    2. partially aligned data
    3. prototype alignment learning
    4. robust cross-modal retrieval

    Qualifiers

    • Research-article

    Conference

    MM '24
    Sponsor:
    MM '24: The 32nd ACM International Conference on Multimedia
    October 28 - November 1, 2024
    Melbourne VIC, Australia

    Acceptance Rates

    MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 98
      Total Downloads
    • Downloads (Last 12 months)98
    • Downloads (Last 6 weeks)74
    Reflects downloads up to 19 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media