[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3581783.3612427acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval

Published: 27 October 2023 Publication History

Abstract

Most existing cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts, e.g., CNN for images and RNN/Transformer for texts. Such discrepancy in architectures may induce different semantic distribution spaces and limit the interactions between images and texts, and further result in inferior alignment between images and texts. To fill this research gap, inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. Specifically, we design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed Hierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module. With such identical architectures, the encoders could produce representations with more similar characteristics for images and texts, and make the interactions and alignments between them much easier. Besides, to leverage the rich semantics, we devise a hierarchical alignment scheme to explore multi-level correspondences of different layers between images and texts. To evaluate the effectiveness of the proposed HAT, we conduct extensive experiments on two benchmark datasets, MSCOCO and Flickr30K. Experimental results demonstrate that HAT outperforms SOTA baselines by a large margin. Specifically, on two key tasks, i.e., image-to-text and text-to-image retrieval, HAT achieves 7.6% and 16.7% relative score improvement of Recall@1 on MSCOCO, and 4.4% and 11.6% on Flickr30k respectively. The code is available at https://github.com/LuminosityX/HAT.

Supplemental Material

MP4 File
Presentation video: I explained the paper in detail according to four modules: research background, research motivation, specific methods, and final results.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In ICCV. 2425--2433.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
[3]
Yi Bin, Xindi Shang, Bo Peng, Yujuan Ding, and Tat-Seng Chua. 2021. Multi-Perspective Video Captioning. In ACM Multimedia. 5110--5118.
[4]
Yi Bin, Yang Yang, Jie Zhou, Zi Huang, and Heng Tao Shen. 2017. Adaptively attending to visual attributes and linguistic knowledge for captioning. In ACM Multimedia. 1345--1353.
[5]
David M Blei and Michael I Jordan. 2003. Modeling annotated data. In SIGIR. 127--134.
[6]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. In NeurIPS.
[7]
Chen Chen, Dan Wang, Bin Song, and Hao Tan. 2023. Inter-Intra Modal Representation Augmentation with DCT-Transformer Adversarial Network for Image-Text Matching. TMM (2023).
[8]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020a. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12655--12663.
[9]
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In CVPR. 15789--15798.
[10]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. 2020c. Generative pretraining from pixels. In ICML. PMLR, 1691--1703.
[11]
Shizhe Chen, Bei Liu, Jianlong Fu, Ruihua Song, Qin Jin, Pingping Lin, Xiaoyu Qi, Chunting Wang, and Jin Zhou. 2019. Neural storyboard artist: Visualizing stories with coherent image sequences. In ACM Multimedia. 2236--2244.
[12]
Tianlang Chen and Jiebo Luo. 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In AAAI, Vol. 34. 10583--10590.
[13]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[14]
Yanbei Chen, Shaogang Gong, and Loris Bazzani. 2020b. Image search with text feedback by visiolinguistic attention learning. In CVPR. 3001--3011.
[15]
Wietse de Vries, Andreas van Cranenburgh, and Malvina Nissim. 2020. What's so special about BERT's layers? A closer look at the NLP pipeline in monolingual and multilingual models. In Findings of EMNLP. 4339--4350.
[16]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
[17]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. Vse: Improving visual-semantic embeddings with hard negatives. In BMVC.
[18]
Andrea Frome, Greg Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. In NeurIPS. 2121--2129.
[19]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR. 580--587.
[20]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, Vol. 106, 2 (2014), 210--233.
[21]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In CVPR. 16000--16009.
[22]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[23]
Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In ACL. 328--339.
[24]
Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. 2011. Learning cross-modality similarity for multinomial data. In ICCV. IEEE, 2407--2414.
[25]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.
[26]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[27]
Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
[28]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
[29]
Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR. 4437--4446.
[30]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In ICLR.
[31]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV. 201--216.
[32]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL. 7871--7880.
[33]
Haoxuan Li, Yi Bin, Junrong Liao, Yang Yang, and Heng Tao Shen. 2023. Your Negative May not Be True Negative: Boosting Image-Text Matching with False Negative Elimination. In ACM Multimedia.
[34]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV. 4654--4662.
[35]
Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2021. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In ACL-IJCNLP. 2592--2607.
[36]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV. Springer, 121--137.
[37]
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM Multimedia. 3--11.
[38]
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In CVPR. 10921--10930.
[39]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV. 10012--10022.
[40]
Aravindh Mahendran and Andrea Vedaldi. 2016. Visualizing deep convolutional neural networks using natural pre-images. IJCV, Vol. 120, 3 (2016), 233--255.
[41]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In ICLR. 1--12.
[42]
Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. 2017. Hierarchical multimodal lstm for dense visual-semantic embedding. In ICCV. 1881--1889.
[43]
Liang Peng, Shuangji Yang, Yi Bin, and Guoqing Wang. 2021. Progressive graph attention network for video question answering. In ACM Multimedia. 2871--2879.
[44]
Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2013. On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE TPAMI, Vol. 36, 3 (2013), 521--535.
[45]
Duangmanee Putthividhy, Hagai T Attias, and Srikantan S Nagarajan. 2010. Topic regression multi-modal latent dirichlet allocation for image annotation. In CVPR. IEEE, 3408--3415.
[46]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In ACM Multimedia. 1047--1055.
[47]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In SIGIR. 1104--1113.
[48]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, Vol. 21 (2020), 1--67.
[49]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM Multimedia. 251--260.
[50]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS.
[51]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE TKDE, Vol. 33, 10 (2020), 3351--3365.
[52]
Betty Van Aken, Benjamin Winter, Alexander Löser, and Felix A Gers. 2019. How does bert answer questions? a layer-wise analysis of transformer representations. In CIKM. 1823--1832.
[53]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. JMLR, Vol. 9, 11 (2008).
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998--6008.
[55]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In ACM Multimedia. 154--162.
[56]
Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. 2021a. Bevt: Bert pretraining of video transformers. arXiv preprint arXiv:2112.01529 (2021).
[57]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In WACV. 1508--1517.
[58]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021b. End-to-end video instance segmentation with transformers. In CVPR. 8741--8750.
[59]
Zheng Wang, Zhenwei Gao, Kangshuai Guo, Yang Yang, Xiaoming Wang, and Heng Tao Shen. 2023 a. Multilateral Semantic Relations Modeling for Image Text Retrieval. In CVPR. 2830--2839.
[60]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. Camp: Cross-modal adaptive message passing for text-image retrieval. In ICCV. 5764--5773.
[61]
Zheng Wang, Xing Xu, Guoqing Wang, Yang Yang, and Heng Tao Shen. 2023 b. Quaternion relation embedding for scene graph generation. TMM (2023).
[62]
Jônatas Wehrmann, Camila Kolling, and Rodrigo C Barros. 2020. Adaptive cross-modal embeddings for image-text alignment. In AAAI, Vol. 34. 12313--12320.
[63]
Xi Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, and Feng Wu. 2020. Multi-modality cross attention network for image and sentence matching. In CVPR. 10941--10950.
[64]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM Multimedia. 2088--2096.
[65]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS.
[66]
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-modal attention with semantic consistence for image-text matching. IEEE TNNLS, Vol. 31, 12 (2020), 5412--5425.
[67]
Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, and Heng Tao Shen. 2023. Multi-Modal Transformer with Global-Local Alignment for Composed Query Image Retrieval. TMM (2023).
[68]
Yang Yang, Jie Zhou, Jiangbo Ai, Yi Bin, Alan Hanjalic, Heng Tao Shen, and Yanli Ji. 2018. Video captioning by adversarial LSTM. TIP, Vol. 27, 11 (2018), 5600--5611.
[69]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS.
[70]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, Vol. 2 (2014), 67--78.
[71]
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV. Springer, 818--833.
[72]
Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022. Negative-aware attention framework for image-text matching. In CVPR. 15661--15670.
[73]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In CVPR. 3536--3545.
[74]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020. Dual-path convolutional image-text embeddings with instance loss. ACM TOMM, Vol. 16, 2 (2020), 1--23.
[75]
Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In CVPR. 8739--8748

Cited By

View all
  • (2024)Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681677(4630-4639)Online publication date: 28-Oct-2024
  • (2024)MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681593(2776-2785)Online publication date: 28-Oct-2024
  • (2024)MagicVFX: Visual Effects Synthesis in Just MinutesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681516(8238-8246)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. discrepancy in architectures
  3. hierarchical alignment transformers

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)230
  • Downloads (Last 6 weeks)21
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative LearningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681677(4630-4639)Online publication date: 28-Oct-2024
  • (2024)MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681593(2776-2785)Online publication date: 28-Oct-2024
  • (2024)MagicVFX: Visual Effects Synthesis in Just MinutesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681516(8238-8246)Online publication date: 28-Oct-2024
  • (2024)Multimodal LLM Enhanced Cross-lingual Cross-modal RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680886(8296-8305)Online publication date: 28-Oct-2024
  • (2024)MPT: Multi-grained Prompt Tuning for Text-Video RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680839(1206-1214)Online publication date: 28-Oct-2024
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)Event Traffic Forecasting with Sparse Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680706(8855-8864)Online publication date: 28-Oct-2024
  • (2024)Align and Retrieve: Composition and Decomposition Learning in Image Retrieval With Text FeedbackIEEE Transactions on Multimedia10.1109/TMM.2024.341769426(9936-9948)Online publication date: 1-Jan-2024
  • (2024)Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified BenchmarkIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2023.334859925:7(7673-7686)Online publication date: 16-Jan-2024
  • (2024)Ump: Unified Modality-Aware Prompt Tuning for Text-Video RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.342919234:11(11954-11964)Online publication date: Nov-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media