More Web Proxy on the site http://driver.im/

research-article

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval

Authors:

Hao ChenAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 18, Issue 2

Article No.: 63, Pages 1 - 23

https://doi.org/10.1145/3483381

Published: 16 February 2022 Publication History

Abstract

Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

References

[1]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 6299–6308.

[2]

Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition for cooking recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’16). 32–41.

Digital Library

[3]

Jing-Jing Chen, Chong-Wah Ngo, Fu-Li Feng, and Tat-Seng Chua. 2018. Deep understanding of cooking procedure for cross-modal recipe retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’18). 1020–1028.

Digital Library

[4]

Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, et al. 2019. MMDetection: Open MMLab detection toolbox and benchmark. CoRR abs/1906.07155 (2019). http://arxiv.org/abs/1906.07155.

[5]

Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’20). 10635–10644.

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[7]

Ali Diba, Vivek Sharma, Luc Van Gool, and Rainer Stiefelhagen. 2019. DynamoNet: Dynamic action and motion network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 6192–6201.

[8]

Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9346–9355.

[9]

Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. VSE++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612.

[10]

Fangxiang Feng, Xiaojie Wang, Ruifan Li, and Ibrar Ahmad. 2015. Correspondence autoencoders for cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 12, 1s (2015), Article 26, 22 pages.

Digital Library

[11]

Zerun Feng, Zhimin Zeng, Caili Guo, and Zheng Li. 2020. Exploiting visual semantic reasoning for video-text retrieval. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI’20). 1005–1011.

Digital Library

[12]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’20), Vol. 5.

Digital Library

[13]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In Proceedings of the International Conference on Machine Learning. 1180–1189.

Digital Library

[14]

Pallabi Ghosh, Yi Yao, Larry Davis, and Ajay Divakaran. 2020. Stacked spatio-temporal graph convolutional networks for action segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 576–585.

[15]

Ross Girshick. 2015. Fast R-CNN. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 1440–1448.

Digital Library

[16]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672–2680.

Digital Library

[17]

Jiuxiang Gu, Jianfei Cai, Shafiq R. Joty, Li Niu, and Gang Wang. 2018. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 7181–7189.

[18]

Yutian Guo, Jingjing Chen, Hao Zhang, and Yu-Gang Jiang. 2020. Visual relations augmented cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR’20). 9–15.

Digital Library

[19]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3D CNNs retrace the history of 2D CNN and ImageNet? In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 6546–6555.

[20]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 770–778.

[21]

Harold Hotelling. 1936. Relations between two sets of variates. Biometrika 28, 3–4 (1936), 321–377.

[22]

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’18). 3588–3597.

[23]

Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. 2016. Structural-RNN: Deep learning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 5308–5317.

[24]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 221–231.

Digital Library

[25]

Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’15). 3668–3678.

[26]

Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR’17).

[27]

Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using Fisher vectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4437–4446.

[28]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25 (2012). 1097–1105.

Digital Library

[29]

Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie Zhou, and Shilei Wen. 2017. Temporal modeling approaches for large-scale YouTube-8m video understanding. arXiv preprint arXiv:1707.04555 (2017).

[30]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV’14). 740–755.

[31]

Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (MM’18). 15–24.

Digital Library

[32]

Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487.

[33]

Effrosyni Mavroudi, Benjamín Béjar Haro, and René Vidal. 2019. Neural message passing on hybrid spatio-temporal visual and symbolic graphs for video understanding. arXiv preprint arXiv:1905.07385.

[34]

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’20). 9876–9886.

[35]

Antoine Miech, Ivan Laptev, and Josef Sivic. 2018. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516.

[36]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the International Conference on Computer Vision (ICCV’19). 2630–2640.

[37]

Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval (ICMR’18). 19–27.

Digital Library

[38]

Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, and Juan Carlos Niebles. 2020. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 10870–10879.

[39]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 1029–1038.

[40]

Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications 15, 1 (2019), 1–24.

Digital Library

[41]

Yuxin Peng, Jinwei Qi, Xin Huang, and Yuxin Yuan. 2017. CCL: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Transactions on Multimedia 20, 2 (2017), 405–420.

Digital Library

[42]

Florent Perronnin and Christopher Dance. 2007. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’07). 1–8.

[43]

Xufeng Qian, Yueting Zhuang, Yimeng Li, Shaoning Xiao, Shiliang Pu, and Jun Xiao. 2019. Video relation detection with spatio-temporal graph. In Proceedings of the ACM International Conference on Multimedia (MM’19). 84–93.

Digital Library

[44]

Andrew Rouditchenko, Angie Boggust, David Harwath, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Rogerio Feris, et al. 2020. AVLnet: Learning audio-visual language representations from instructional videos. arXiv preprint arXiv:2006.09199.

[45]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.

Digital Library

[46]

Nikolaos Sarafianos, Xiang Xu, and Ioannis A. Kakadiaris. 2019. Adversarial representation learning for text-to-image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’19). 5814–5824.

[47]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video visual relation detection. In Proceedings of the ACM International Conference on Multimedia (MM’17). 1300–1308.

Digital Library

[48]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’14). 568–576.

Digital Library

[49]

Yale Song and Mohammad Soleymani. 2019. Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 1979–1988.

[50]

Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using LSTMs. In Proceedings of the International Conference on Machine Learning. 843–852.

Digital Library

[51]

Damien Teney, Lingqiao Liu, and Anton van Den Hengel. 2017. Graph-structured representations for visual question answering. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 1–9.

[52]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV’15). 4489–4497.

Digital Library

[53]

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 6450–6459.

[54]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.

Digital Library

[55]

Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. CoRR abs/1710.10903 (2017). http://arxiv.org/abs/1710.10903.

[56]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the ACM International Conference on Multimedia (MM’17). 154–162.

Digital Library

[57]

Hao Wang, Doyen Sahoo, Chenghao Liu, Ee-Peng Lim, and Steven C. H. Hoi. 2019. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11572–11581.

[58]

Limin Wang, Yuanjun Xiong, Zhe Wang, and Yu Qiao. 2015. Towards good practices for very deep two-stream ConvNets. arXiv preprint arXiv:1507.02159.

[59]

Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV’20). 1508–1517.

[60]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision (ECCV’18). 399–417.

[61]

Michael Wray, Diane Larlus, Gabriela Csurka, and Dima Damen. 2019. Fine-grained action retrieval through multiple parts-of-speech embeddings. In Proceedings of the International Conference on Computer Vision (ICCV’19). 450–459.

[62]

Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. 2019. Learning actor relation graphs for group activity recognition. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’19). 9964–9974.

[63]

Yu Xiong, Qingqiu Huang, Lingfeng Guo, Hang Zhou, Bolei Zhou, and Dahua Lin. 2019. A graph-based framework to bridge movies and synopses. In Proceedings of the International Conference on Computer Vision (ICCV’19). 4592–4601.

[64]

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’17). 5410–5419.

[65]

Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI’19). 9062–9069.

Digital Library

[66]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR’16). 5288–5296.

[67]

Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2020. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 1339–1348.

Digital Library

[68]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.

Digital Library

[69]

Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’18). 487–503.

[70]

Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3165–3173.

[71]

Luowei Zhou, Nathan Louis, and Jason J. Corso. 2018. Weakly-Supervised video object grounding from text by loss weighting and object interaction. In Proceedings of the British Machine Vision Conference (BMVC’18). 50.

[72]

Bin Zhu, Chong-Wah Ngo, Jingjing Chen, and Yanbin Hao. 2019. R2GAN: Cross-modal recipe retrieval with generative adversarial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19). 11477–11486.

Cited By

Xu WChen KGao ZWei ZChen JJiang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681538(748-757)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681538
Zhao JYang HHe HPeng JZhang WNi JSangaiah ACastiglione A(2024)Backdoor Two-Stream Video Models on Federated LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3651307Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3651307
Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
Show More Cited By

Index Terms

Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level ...
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
MPT: Multi-grained Prompt Tuning for Text-Video Retrieval
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Recently, significant advancements have been made in supporting text-video retrieval by transferring large-scale image-text pre-training models through model adaptation, i.e., full fine-tuning, or prompt tuning, a parameter-efficient fine-tuning ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 18, Issue 2

May 2022

494 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3505207

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2022

Accepted: 01 August 2021

Revised: 01 June 2021

Received: 01 February 2021

Published in TOMM Volume 18, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Key R&D Program of China
National Natural Science Foundation of China
Shanghai Pujiang Program

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
920
Total Downloads

Downloads (Last 12 months)179
Downloads (Last 6 weeks)38

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu WChen KGao ZWei ZChen JJiang YCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language ModelsProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681538(748-757)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681538
Zhao JYang HHe HPeng JZhang WNi JSangaiah ACastiglione A(2024)Backdoor Two-Stream Video Models on Federated LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3651307Online publication date: 7-Mar-2024
https://dl.acm.org/doi/10.1145/3651307
Zhao FZhang CGeng B(2024)Deep Multimodal Data FusionACM Computing Surveys10.1145/364944756:9(1-36)Online publication date: 24-Apr-2024
https://dl.acm.org/doi/10.1145/3649447
Zeng XWang XXie Y(2024)Multiple Pseudo-Siamese Network with Supervised Contrast Learning for Medical Multi-modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363744120:5(1-23)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3637441
Aksar BSencan ESchwaller BAaziz OLeung VBrandt JKulis BEgele MCoskun A(2024)Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.336546235:4(693-706)Online publication date: 13-Feb-2024
https://dl.acm.org/doi/10.1109/TPDS.2024.3365462
Chen LDeng ZLiu LYin S(2024)Multilevel Semantic Interaction Alignment for Video–Text Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.336053034:7(6559-6575)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1109/TCSVT.2024.3360530
Zhou DLei FLi LZhou YYang A(2024)Cross-Modal Interaction via Reinforcement Feedback for Audio-Lyrics RetrievalIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.335804832(1248-1260)Online publication date: 2024
https://doi.org/10.1109/TASLP.2024.3358048
Wang JWang PSun GLiu DDianat SRao RRabbani MTao Z(2024)Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01566(16551-16560)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01566
Lu YLi QZhang XGao Q(2024)Deep contrastive representation learning for multi-modal clusteringNeurocomputing10.1016/j.neucom.2024.127523581:COnline publication date: 7-May-2024
https://dl.acm.org/doi/10.1016/j.neucom.2024.127523
Han NZeng YShi CXiao GChen HChen J(2023)BiC-Net: Learning Efficient Spatio-temporal Relation for Text-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/362710320:3(1-21)Online publication date: 13-Oct-2023
https://dl.acm.org/doi/10.1145/3627103
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents