More Web Proxy on the site http://driver.im/

research-article

Partially Aligned Cross-modal Retrieval via Optimal Transport-based Prototype Alignment Learning

Authors:

Yan YanAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 701 - 709

https://doi.org/10.1145/3664647.3681577

Published: 28 October 2024 Publication History

Abstract

Supervised cross-modal retrieval (CMR) achieves excellent performance thanks to the semantic information provided by its labels, which helps to establish semantic correlations between samples from different modalities. However, in real-world scenarios, there often exists a large amount of unlabeled and unpaired multimodal training data, rendering existing methods unfeasible. To address this issue, we propose a novel partially aligned cross-modal retrieval method called Optimal Transport-based Prototype Alignment Learning (OTPAL). Due to the high computational complexity involved in directly establishing matching correlations between unannotated unaligned cross-modal samples, instead, we establish matching correlations between shared prototypes and samples. To be specific, we employ the optimal transport algorithm to establish cross-modal alignment information between samples and prototypes, and then minimize the distance between samples and their corresponding prototypes through a specially designed prototype alignment loss. As an extension of this paper, we also extensively investigate the influence of incomplete multimodal data on cross-modal retrieval performance under the partially aligned setting proposed above. To further address the above more challenging scenario, we raise a scalable prototype-based neighbor feature completion method, which better captures the correlations between incomplete samples and neighbor samples through a cross-modal self-attention mechanism. Experimental results on four benchmark datasets show that our method can obtain satisfactory accuracy and scalability in various real-world scenarios.

References

[1]

Laetitia Chapel, Rémi Flamary, HaoranWu, Cédric Févotte, and Gilles Gasso. 2021. Unbalanced optimal transport through non-negative penalized linear regression. Advances in Neural Information Processing Systems 34 (2021), 23270--23282.

[2]

Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Carin, and Jingjing Liu. 2020. Graph optimal transport for cross-domain alignment. In International Conference on Machine Learning. PMLR, 1542--1553.

[3]

Pujin Cheng, Li Lin, Junyan Lyu, Yijin Huang, Wenhan Luo, and Xiaoying Tang. 2023. Prior: Prototype representation joint learning from medical images and reports. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 21361--21371.

[4]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM international conference on image and video retrieval. 1--9.

Digital Library

[5]

Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. 2016. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence 39, 9 (2016), 1853--1865.

[6]

Cuturi M Sinkhorn Distances. 2013. Lightspeed Computation of Optimal Transport. Advances in neural information processing systems 26 (2013), 2292--2300.

[7]

Fangxiang Feng, Xiaojie Wang, and Ruifan Li. 2014. Cross-modal retrieval with correspondence autoencoder. In Proceedings of the 22nd ACM international conference on Multimedia. 7--16.

Digital Library

[8]

Tiantian Gong, Kaixiang Chen, Liyan Zhang, and Junsheng Wang. 2023. Debiased contrastive curriculum learning for progressive generalizable person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 33, 10 (2023), 5947--5958.

Digital Library

[9]

Tiantian Gong, Guodong Du, Junsheng Wang, Yongkang Ding, and Liyan Zhang. 2023. Prototype-guided Cross-modal Completion and Alignment for Incomplete Text-based Person Re-identification. In Proceedings of the 31st ACM International Conference on Multimedia. 5253--5261.

Digital Library

[10]

David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation 16, 12 (2004), 2639--2664.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[12]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).

[13]

Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable deep multimodal learning for cross-modal retrieval. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 635-- 644.

Digital Library

[14]

Peng Hu, Hongyuan Zhu, Xi Peng, and Jie Lin. 2020. Semi-supervised multimodal learning with balanced spectral decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 99--106.

[15]

Xin Huang and Yuxin Peng. 2018. Deep cross-media knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8837--8846.

[16]

Mengmeng Jing, Jingjing Li, Lei Zhu, Ke Lu, Yang Yang, and Zi Huang. 2020. Incomplete cross-modal retrieval with dual-aligned variational autoencoders. In Proceedings of the 28th ACM International Conference on Multimedia. 3283--3291.

Digital Library

[17]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[18]

Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. 2024. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. Advances in Neural Information Processing Systems 36 (2024).

[19]

Yaxin Liu, Jianlong Wu, Leigang Qu, Tian Gan, Jianhua Yin, and Liqiang Nie. 2022. Self-supervised Correlation Learning for Cross-Modal Retrieval. IEEE Transactions on Multimedia (2022).

[20]

Devraj Mandal, Pramod Rao, and Soma Biswas. 2020. Label prediction framework for semi-supervised cross-modal retrieval. In 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2311--2315.

[21]

Yuxin Peng, Xin Huang, and Jinwei Qi. 2016. Cross-media shared representation by hierarchical learning with multiple deep networks. In International Joint Conference on Artificial Intelligence. 3846--3853.

[22]

Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications (2019), 1--24.

Digital Library

[23]

Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia (2017), 405--420.

[24]

Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing (2018), 5585--5599.

Digital Library

[25]

Tianwen Qian, Ran Cui, Jingjing Chen, Pai Peng, Xiaowei Guo, and Yu-Gang Jiang. 2023. Locate before answering: Answer guided question localization for video question answering. IEEE Transactions on Multimedia (2023).

[26]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. 8748--8763.

[27]

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier. 2010. Collecting image annotations using amazon's mechanical turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon's Mechanical Turk. 139--147.

Digital Library

[28]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to crossmodal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 251--260.

Digital Library

[29]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).

[30]

Abhishek Sharma and David W Jacobs. 2011. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch. In CVPR 2011. IEEE, 593--600.

Digital Library

[31]

Dan Shi, Lei Zhu, Jingjing Li, Guohua Dong, and Huaxiang Zhang. 2023. Incomplete Cross-Modal Retrieval with Deep Correlation Transfer. ACM Transactions on Multimedia Computing, Communications and Applications (2023).

[32]

Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio S Feris, David Harwath, James Glass, and Hilde Kuehne. 2022. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition. 20020--20029.

[33]

Wentao Tan, Lei Zhu, Jingjing Li, Zheng Zhang, and Huaxiang Zhang. 2023. Partial multi-modal hashing via neighbor-aware completion learning. IEEE Transactions on Multimedia 25 (2023), 8499--8510.

Digital Library

[34]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[36]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-Modal retrieval. In ACMInternational Conference on Multimedia. 154--162.

Digital Library

[37]

Junsheng Wang, Tiantian Gong, Zhixiong Zeng, Changchang Sun, and Yan Yan. 2022. C3CMR: Cross-Modality Cross-Instance Contrastive Learning for Cross- Media Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 4300--4308.

[38]

WeiranWang, Raman Arora, Karen Livescu, and Jeff Bilmes. 2015. On deep multiview representation learning. In International conference on machine learning. PMLR, 1083--1092.

[39]

WeiranWang and Karen Livescu. 2015. Large-scale approximate kernel canonical correlation analysis. arXiv preprint arXiv:1511.04773 (2015).

[40]

Zeqiang Wei, Kai Jin, and Xiuzhuang Zhou. 2023. Masked Contrastive Reconstruction for Cross-modal Medical Image-Report Retrieval. arXiv preprint arXiv:2312.15840 (2023).

[41]

Dizhan Xue, Shengsheng Qian, Quan Fang, and Changsheng Xu. 2022. Mmt: Image-guided story ending generation with multimodal memory transformer. In Proceedings of the 30th ACM International Conference on Multimedia. 750--758.

Digital Library

[42]

Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3441--3450.

[43]

Hong-Ming Yang, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2018. Robust classification with convolutional prototype learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3474--3482.

[44]

Zhixiong Zeng andWenji Mao. 2022. A comprehensive empirical study of visionlanguage pre-trained model for supervised cross-modal retrieval. arXiv preprint arXiv:2201.02772 (2022).

[45]

Zhixiong Zeng, ShuaiWang, Nan Xu, andWenji Mao. 2021. PAN: Prototype-based adaptive network for robust cross-modal retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval. 1125--1134.

Digital Library

[46]

Xiaohua Zhai, Yuxin Peng, and Jianguo Xiao. 2013. Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Transactions on Circuits and Systems for Video Technology 24, 6 (2013), 965--978.

[47]

Liang Zhang, Bingpeng Ma, Guorong Li, Qingming Huang, and Qi Tian. 2017. Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Transactions on Multimedia 20, 1 (2017), 128--141.

Digital Library

[48]

Ruiyi Zhang, Changyou Chen, Xinyuan Zhang, Ke Bai, and Lawrence Carin. 2020. Semantic matching for sequence-to-sequence learning. In Findings of the Association for Computational Linguistics: EMNLP 2020. 212--222.

[49]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep supervised cross-modal retrieval. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394--10403.

Index Terms

Partially Aligned Cross-modal Retrieval via Optimal Transport-based Prototype Alignment Learning
1. Information systems
  1. Information retrieval

Recommendations

OTCMR: Bridging Heterogeneity Gap with Optimal Transport for Cross-modal Retrieval
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Cross-modal retrieval is a classic task in the multimedia community, which aims to search for semantically similar results from different modalities. The core of cross-modal retrieval is to learn the most correlated features in a common feature space ...
Enhancing Multi-modal Contrastive Learning via Optimal Transport-Based Consistent Modality Alignment
Pattern Recognition and Computer Vision
Abstract
Multi-modal contrastive learning has gained significant attention in recent years due to the rapid growth of multi-modal data and the increasing application demands in practice, e.g., multi-modal pre-training, retrieval, and classification. Most ...
Incomplete Cross-modal Retrieval with Dual-Aligned Variational Autoencoders
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Learning the relationship between the multi-modal data, e.g., texts, images and videos, is a classic task in the multimedia community. Cross-modal retrieval (CMR) is a typical example where the query and the corresponding results are in different ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
98
Total Downloads

Downloads (Last 12 months)98
Downloads (Last 6 weeks)74

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents