[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3474085.3475534acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Cascade Cross-modal Attention Network for Video Actor and Action Segmentation from a Sentence

Published: 17 October 2021 Publication History

Abstract

In this paper, we address the problem that selectively segments the actor and its action in the video clip given the sentence description. The main challenge is to match the local semantic features of the video with the heterogeneous textual features. A widely used language processing method in previous works is to leverage bi-LSTM and self-attention, which fixed the attention of the sentence and neglected the personality of the video, leading the attention of the sentence mismatch the most discriminative feature of the video. The proposed algorithm in this paper allows the sentence to learn the most discriminative features of the video, remarkably improving the accuracy of matching and segmentation. Specifically, we propose a cascade cross-modal attention to leverage two perspectives visual features to attend language from coarse to fine to generate the discriminative vision-aware language features. Moreover, equipping our framework with a contrastive learning method and a designed hard negative mining strategy benefits our proposed network from identifying the positive sample from numbers of negatives, and further improving the performance. To demonstrate the effectiveness of our approach, we conduct experiments on two datasets: A2D Sentences and J-HMDB Sentences. Experimental results show that our method significantly improves the performance over recent state-of-the-art methods.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803--5812.
[2]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[3]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 358--373.
[4]
Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3146--3154.
[5]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267--5275.
[6]
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6639--6648.
[7]
Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. 2018. Ac- tor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5958--5966.
[8]
Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.
[9]
Xinzhe Han, Shuhui Wang, Chi Su, Weigang Zhang, Qingming Huang, and Qi Tian. 2020. Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision. In European Conference on Computer Vision. Springer, 553--570.
[10]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[12]
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In European Conference on Computer Vision. Springer, 108--124.
[13]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
[14]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787--798.
[15]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564--1574.
[16]
Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.
[17]
Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. 2017. Tracking by natural language specification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6495--6503.
[18]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[19]
Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4673--4682.
[20]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qing- ming Huang. 2019. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. In Proceedings of the IEEE International Confer- ence on Computer Vision. 2611--2620.
[21]
Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 539--547.
[22]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems. 289--297.
[23]
Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In Thirty-Second AAAI Conference on Artificial Intelligence.
[24]
Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7102--7111.
[25]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11--20.
[26]
Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. 2020. Visual- textual capsule routing for text-based video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9942--9951.
[27]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[28]
Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar Relative Positional Encod- ing for Video-Language Segmentation. In Proceedings of the International Joint Conference on Artificial Intelligence.
[29]
Heqian Qiu, Hongliang Li, Qingbo Wu, Fanman Meng, Hengcan Shi, Taijin Zhao, and King Ngi Ngan. 2020. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. In Proceedings of the 28th ACM International Conference on Multimedia. 4171--4180.
[30]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211--252.
[31]
Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 38--54.
[32]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. arXiv preprint arXiv:1906.00295 (2019).
[33]
Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12152--12159.
[34]
Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019. Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query. In Proceedings of the IEEE International Conference on Computer Vision. 3939--3948.
[35]
Peng Wang, Dongyang Liu, Hui Li, and Qi Wu. 2020. Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge. In Proceedings of the 28th ACM International Conference on Multimedia. 28--36.
[36]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.
[37]
Chenliang Xu and Jason J Corso. 2016. Actor-action semantic segmentation with grouping process models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3083--3092.
[38]
Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J Corso. 2015. Can humans fly? action understanding with multiple classes of actors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2264--2273.
[39]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.
[40]
Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the IEEE International Conference on Computer Vision. 1453--1462.
[41]
Yan Yan, Chenliang Xu, Dawen Cai, and Jason J Corso. 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1298--1307.
[42]
Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4145--4154.
[43]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 21--29.
[44]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self- Attention Network for Referring Image Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10502--10511.
[45]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.
[46]
Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence (2019).
[47]
Chao Zhang, Weiming Li, Wanli Ouyang, Qiang Wang, Woo-Shik Kim, and Sunghoon Hong. 2019. Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping. In Proceedings of the 27th ACM International Conference on Multimedia. 1258--1266.
[48]
Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158--4166.
[49]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278--13288.
[50]
Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408--417.
[51]
Roland S Zimmermann and Julien N Siems. 2018. Faster Training of Mask R-CNN by Focusing on Instance Boundaries. arXiv preprint arXiv:1809.07069 (2018

Cited By

View all
  • (2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
  • (2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024
  • (2024)Fully Transformer-Equipped Architecture for end-to-end Referring Video Object SegmentationInformation Processing & Management10.1016/j.ipm.2023.10356661:1(103566)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cascade structure
  2. contrastive learning
  3. multi-modal attention
  4. vision and language

Qualifiers

  • Research-article

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)3
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
  • (2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024
  • (2024)Fully Transformer-Equipped Architecture for end-to-end Referring Video Object SegmentationInformation Processing & Management10.1016/j.ipm.2023.10356661:1(103566)Online publication date: Jan-2024
  • (2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
  • (2023)Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351425019:1(1-22)Online publication date: 5-Jan-2023
  • (2022)Multi-Attention Network for Compressed Video Referring Object SegmentationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547761(4416-4425)Online publication date: 10-Oct-2022
  • (2022)Learning Linguistic Association Towards Efficient Text-Video RetrievalComputer Vision – ECCV 202210.1007/978-3-031-20059-5_15(254-270)Online publication date: 23-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media