[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Published: 16 February 2022 Publication History

Abstract

Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

References

[1]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299–6308.
[2]
Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’16). 32–41.
[3]
Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Chua. 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 1020–1028.
[4]
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, et al. 2019. MMDetection: Open MMLab detection toolbox and benchmark. CoRR abs/1906.07155 (2019). http://arxiv.org/abs/1906.07155.
[5]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’20). 10635–10644.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[7]
Ali Diba, Vivek Sharma, Luc Van Gool, and Rainer Stiefelhagen. 2019. DynamoNet: Dynamic action and motion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 6192–6201.
[8]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9346–9355.
[9]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.
[10]
Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s (2015), Article 26, 22 pages.
[11]
Zerun Feng, Zhimin Zeng, Caili Guo, and Zheng Li. 2020. Exploiting visual semantic reasoning for video-text retrieval. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 1005–1011.
[12]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 5.
[13]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning. 1180–1189.
[14]
Pallabi Ghosh, Yi Yao, Larry Davis, and Ajay Divakaran. 2020. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 576–585.
[15]
Ross Girshick. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 1440–1448.
[16]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680.
[17]
Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 7181–7189.
[18]
Yutian Guo, Jingjing Chen, Hao Zhang, and Yu-Gang Jiang. 2020. Visual relations augmented cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR’20). 9–15.
[19]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNN and ImageNet? In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 6546–6555.
[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.
[21]
Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3–4 (1936), 321–377.
[22]
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 3588–3597.
[23]
Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5308–5317.
[24]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231.
[25]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 3668–3678.
[26]
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).
[27]
Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437–4446.
[28]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012). 1097–1105.
[29]
Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, and Shilei Wen. 2017. Temporal modeling approaches for large-scale YouTube-8m video understanding. arXiv preprint arXiv:1707.04555 (2017).
[30]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.
[31]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (MM’18). 15–24.
[32]
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.
[33]
Effrosyni Mavroudi, Benjamín Béjar Haro, and René Vidal. 2019. Neural message passing on hybrid spatio-temporal visual and symbolic graphs for video understanding. arXiv preprint arXiv:1905.07385.
[34]
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 9876–9886.
[35]
Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516.
[36]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the International Conference on Computer Vision (ICCV’19). 2630–2640.
[37]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR’18). 19–27.
[38]
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10870–10879.
[39]
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029–1038.
[40]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24.
[41]
Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2017), 405–420.
[42]
Florent Perronnin and Christopher Dance. 2007. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07). 1–8.
[43]
Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the ACM International Conference on Multimedia (MM’19). 84–93.
[44]
Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, et al. 2020. AVLnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199.
[45]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[46]
Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 5814–5824.
[47]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the ACM International Conference on Multimedia (MM’17). 1300–1308.
[48]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14). 568–576.
[49]
Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.
[50]
Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the International Conference on Machine Learning. 843–852.
[51]
Damien Teney, Lingqiao Liu, and Anton van Den Hengel. 2017. Graph-structured representations for visual question answering. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 1–9.
[52]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4489–4497.
[53]
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6450–6459.
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[55]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. CoRR abs/1710.10903 (2017). http://arxiv.org/abs/1710.10903.
[56]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162.
[57]
Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-Peng Lim, and Steven C. H. Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11572–11581.
[58]
Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream ConvNets. arXiv preprint arXiv:1507.02159.
[59]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV’20). 1508–1517.
[60]
Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV’18). 399–417.
[61]
Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the International Conference on Computer Vision (ICCV’19). 450–459.
[62]
Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9964–9974.
[63]
Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the International Conference on Computer Vision (ICCV’19). 4592–4601.
[64]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 5410–5419.
[65]
Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 9062–9069.
[66]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 5288–5296.
[67]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1339–1348.
[68]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.
[69]
Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’18). 487–503.
[70]
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3165–3173.
[71]
Luowei Zhou, Nathan Louis, and Jason J. Corso. 2018. Weakly-Supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the British Machine Vision Conference (BMVC’18). 50.
[72]
Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11477–11486.

Cited By

View all
  • (2024)Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681538(748-757)Online publication date: 28-Oct-2024
  • (2024)Backdoor Two-Stream Video Models on Federated LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3651307Online publication date: 7-Mar-2024
  • (2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
  • Show More Cited By

Index Terms

  1. Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 18, Issue 2
    May 2022
    494 pages
    ISSN:1551-6857
    EISSN:1551-6865
    DOI:10.1145/3505207
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 February 2022
    Accepted: 01 August 2021
    Revised: 01 June 2021
    Received: 01 February 2021
    Published in TOMM Volume 18, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Multi-grained fusion
    2. spatial-temporal object relationships
    3. text-video retrieval

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Key R&D Program of China
    • National Natural Science Foundation of China
    • Shanghai Pujiang Program

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)179
    • Downloads (Last 6 weeks)38
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681538(748-757)Online publication date: 28-Oct-2024
    • (2024)Backdoor Two-Stream Video Models on Federated LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3651307Online publication date: 7-Mar-2024
    • (2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
    • (2024)Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744120:5(1-23)Online publication date: 11-Jan-2024
    • (2024)Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336546235:4(693-706)Online publication date: 13-Feb-2024
    • (2024)Multilevel Semantic Interaction Alignment for Video–Text Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.336053034:7(6559-6575)Online publication date: 1-Jul-2024
    • (2024)Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics RetrievalIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.335804832(1248-1260)Online publication date: 2024
    • (2024)Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01566(16551-16560)Online publication date: 16-Jun-2024
    • (2024)Deep contrastive representation learning for multi-modal clusteringNeurocomputing10.1016/j.neucom.2024.127523581:COnline publication date: 7-May-2024
    • (2023)BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362710320:3(1-21)Online publication date: 13-Oct-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media