More Web Proxy on the site http://driver.im/

research-article

Cascade Cross-modal Attention Network for Video Actor and Action Segmentation from a Sentence

Authors:

Qingming HuangAuthors Info & Claims

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

Pages 4053 - 4062

https://doi.org/10.1145/3474085.3475534

Published: 17 October 2021 Publication History

Abstract

In this paper, we address the problem that selectively segments the actor and its action in the video clip given the sentence description. The main challenge is to match the local semantic features of the video with the heterogeneous textual features. A widely used language processing method in previous works is to leverage bi-LSTM and self-attention, which fixed the attention of the sentence and neglected the personality of the video, leading the attention of the sentence mismatch the most discriminative feature of the video. The proposed algorithm in this paper allows the sentence to learn the most discriminative features of the video, remarkably improving the accuracy of matching and segmentation. Specifically, we propose a cascade cross-modal attention to leverage two perspectives visual features to attend language from coarse to fine to generate the discriminative vision-aware language features. Moreover, equipping our framework with a contrastive learning method and a designed hard negative mining strategy benefits our proposed network from identifying the positive sample from numbers of negatives, and further improving the performance. To demonstrate the effectiveness of our approach, we conduct experiments on two datasets: A2D Sentences and J-HMDB Sentences. Experimental results show that our method significantly improves the performance over recent state-of-the-art methods.

References

[1]

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803--5812.

[2]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[3]

Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV). 358--373.

Digital Library

[4]

Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3146--3154.

[5]

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267--5275.

[6]

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic Fusion With Intra-and Inter-Modality Attention Flow for Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6639--6648.

[7]

Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. 2018. Ac- tor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5958--5966.

[8]

Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440--1448.

Digital Library

[9]

Xinzhe Han, Shuhui Wang, Chi Su, Weigang Zhang, Qingming Huang, and Qi Tian. 2020. Interpretable Visual Reasoning via Probabilistic Formulation Under Natural Supervision. In European Conference on Computer Vision. Springer, 553--570.

[10]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision. 2961--2969.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[12]

Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016. Segmentation from natural language expressions. In European Conference on Computer Vision. Springer, 108--124.

[13]

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).

[14]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 787--798.

[15]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564--1574.

Digital Library

[16]

Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017. Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1970--1979.

[17]

Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. 2017. Tracking by natural language specification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6495--6503.

[18]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[19]

Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4673--4682.

[20]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Dechao Meng, and Qing- ming Huang. 2019. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding. In Proceedings of the IEEE International Confer- ence on Computer Vision. 2611--2620.

[21]

Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the 27th ACM International Conference on Multimedia. ACM, 539--547.

Digital Library

[22]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems. 289--297.

Digital Library

[23]

Pan Lu, Hongsheng Li, Wei Zhang, Jianyong Wang, and Xiaogang Wang. 2018. Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. In Thirty-Second AAAI Conference on Artificial Intelligence.

[24]

Ruotian Luo and Gregory Shakhnarovich. 2017. Comprehension-guided referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7102--7111.

[25]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 11--20.

[26]

Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. 2020. Visual- textual capsule routing for text-based video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9942--9951.

[27]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

Digital Library

[28]

Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar Relative Positional Encod- ing for Video-Language Segmentation. In Proceedings of the International Joint Conference on Artificial Intelligence.

[29]

Heqian Qiu, Hongliang Li, Qingbo Wu, Fanman Meng, Hengcan Shi, Taijin Zhao, and King Ngi Ngan. 2020. Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension. In Proceedings of the 28th ACM International Conference on Multimedia. 4171--4180.

Digital Library

[30]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision 115, 3 (2015), 211--252.

Digital Library

[31]

Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 38--54.

[32]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. arXiv preprint arXiv:1906.00295 (2019).

[33]

Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12152--12159.

[34]

Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019. Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query. In Proceedings of the IEEE International Conference on Computer Vision. 3939--3948.

[35]

Peng Wang, Dongyang Liu, Hui Li, and Qi Wu. 2020. Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge. In Proceedings of the 28th ACM International Conference on Multimedia. 28--36.

Digital Library

[36]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.

[37]

Chenliang Xu and Jason J Corso. 2016. Actor-action semantic segmentation with grouping process models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3083--3092.

[38]

Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J Corso. 2015. Can humans fly? action understanding with multiple classes of actors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2264--2273.

[39]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.

Digital Library

[40]

Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In Proceedings of the IEEE International Conference on Computer Vision. 1453--1462.

[41]

Yan Yan, Chenliang Xu, Dawen Cai, and Jason J Corso. 2017. Weakly supervised actor-action segmentation via robust multi-task ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1298--1307.

[42]

Sibei Yang, Guanbin Li, and Yizhou Yu. 2019. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4145--4154.

[43]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 21--29.

[44]

Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self- Attention Network for Referring Image Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10502--10511.

[45]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1307--1315.

[46]

Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong Zhang, and Feng Wu. 2019. Context-aware visual policy network for fine-grained image captioning. IEEE transactions on pattern analysis and machine intelligence (2019).

Digital Library

[47]

Chao Zhang, Weiming Li, Wanli Ouyang, Qiang Wang, Woo-Shik Kim, and Sunghoon Hong. 2019. Referring Expression Comprehension with Semantic Visual Relationship and Word Mapping. In Proceedings of the 27th ACM International Conference on Multimedia. 1258--1266.

Digital Library

[48]

Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. 2018. Grounding referring expressions in images by variational context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4158--4166.

[49]

Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278--13288.

[50]

Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. 2017. Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision. 408--417.

[51]

Roland S Zimmermann and Julien N Siems. 2018. Faster Training of Mask R-CNN by Focusing on Instance Boundaries. arXiv preprint arXiv:1809.07069 (2018

Cited By

Ye CChen WLi JZhang LMao ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681603
Jin YChen WTian YSong YYan C(2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024
https://doi.org/10.1016/j.neucom.2024.128122
Li PZhang YYuan LXu X(2024)Fully Transformer-Equipped Architecture for end-to-end Referring Video Object SegmentationInformation Processing & Management10.1016/j.ipm.2023.10356661:1(103566)Online publication date: Jan-2024
https://doi.org/10.1016/j.ipm.2023.103566
Show More Cited By

Index Terms

Cascade Cross-modal Attention Network for Video Actor and Action Segmentation from a Sentence
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Matching
        Video segmentation
      2. Computer vision tasks
        Activity recognition and understanding

Recommendations

Cross-modal alignment and translation for missing modality action recognition
Abstract
Multimodal data provide complementary information on the same context, leading to performance improvement in video action recognition. However, in reality, not all modalities are available at test time. To this end, we propose Cross-Modal ...
Graphical abstract

Display Omitted
Highlights
- Considering missing modality in multimodal action recognition.
- Cross-modal feature alignment that alleviates the bias toward a dominant modality.
- Cross-modal feature translation that generates information about missing modalities.
Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge Management

Cross-modal hashing (CMH) has been widely used in multimedia retrieval applications for its low storage cost and fast indexing speed. Thanks to the success of deep learning, cross-modal hashing has made significant progress with high-quality deep ...
Multi-modal and Multi-perspective Machine Translation by Collecting Diverse Alignments
PRICAI 2021: Trends in Artificial Intelligence
Abstract
Multi-modal machine translation (MMT) is one of the most active research directions in the natural language processing. Recently, Seq2Seq translation model with images shows promising performance in enhancing translation quality. However, the ... $^{}$

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '21: Proceedings of the 29th ACM International Conference on Multimedia

October 2021

5796 pages

ISBN:9781450386517

DOI:10.1145/3474085

General Chairs:
Heng Tao Shen
University of Electronic Science&Technology of China, China
,
Yueting Zhuang
Zhejiang University, China
,
John R. Smith
IBM, USA
,
Program Chairs:
Yang Yang
University of Electronic Science and Technology of China, China
,
Pablo Cesar
CWI&TU Delft, The Netherlands
,
Florian Metze
FACEBOOK, Inc., USA
,
Balakrishnan Prabhakaran
University of Texas at Dallas, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '21

Sponsor:

SIGMM

MM '21: ACM Multimedia Conference

October 20 - 24, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
289
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ye CChen WLi JZhang LMao ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681603
Jin YChen WTian YSong YYan C(2024)Improving radiology report generation with multi-grained abnormality predictionNeurocomputing10.1016/j.neucom.2024.128122600(128122)Online publication date: Oct-2024
https://doi.org/10.1016/j.neucom.2024.128122
Li PZhang YYuan LXu X(2024)Fully Transformer-Equipped Architecture for end-to-end Referring Video Object SegmentationInformation Processing & Management10.1016/j.ipm.2023.10356661:1(103566)Online publication date: Jan-2024
https://doi.org/10.1016/j.ipm.2023.103566
Zhang WQi ZWang SSu CSu LHuang Q(2023)Temporal Dynamic Concept Modeling Network for Explainable Video Event RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356831219:6(1-22)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3568312
Chen WLi GZhang XWang SLi LHuang Q(2023)Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/351425019:1(1-22)Online publication date: 5-Jan-2023
https://dl.acm.org/doi/10.1145/3514250
Chen WHong DQi YHan ZWang SQing LHuang QLi GMagalhães Jdel Bimbo ASatoh SSebe NAlameda-Pineda XJin QOria VToni L(2022)Multi-Attention Network for Compressed Video Referring Object SegmentationProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3547761(4416-4425)Online publication date: 10-Oct-2022
https://dl.acm.org/doi/10.1145/3503161.3547761
Fang SWang SZhuo JHan XHuang Q(2022)Learning Linguistic Association Towards Efficient Text-Video RetrievalComputer Vision – ECCV 202210.1007/978-3-031-20059-5_15(254-270)Online publication date: 23-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-20059-5_15

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents