[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503161.3548333acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding

Published: 10 October 2022 Publication History

Abstract

<u>V</u>ideo <u>O</u>bject <u>G</u>rounding (VOG) is the problem of associating spatial object regions in the video to a descriptive natural language query. This is a challenging vision-language task that necessitates constructing the correct cross-modal correspondence and modeling the appropriate spatio-temporal context of the query video and caption, thereby localizing the specific objects accurately. In this paper, we tackle this task by a novel framework called <u>H</u>i<u>E</u>rarchical spatio-tempo<u>R</u>al reas<u>O</u>ning (HERO) with contrastive action correspondence. We study the VOG task at two aspects that prior works overlooked: (1) Contrastive Action Correspondence-aware Retrieval. Notice that the fine-grained video semantics (e.g., multiple actions) is not totally aligned with the annotated language query (e.g., single action), we first introduce the weakly-supervised contrastive learning that classifies the video as action-consistent and action-independent frames relying on the video-caption action semantic correspondence. Such a design can build the fine-grained cross-modal correspondence for more accurate subsequent VOG. (2) Hierarchical Spatio-temporal Modeling Improvement. While transformer-based VOG models present their potential in sequential modality (i.e., video and caption) modeling, existing evidence also indicates that the transformer suffers from the issue of the insensitive spatio-temporal locality. Motivated by that, we carefully design the hierarchical reasoning layers to decouple fully connected multi-head attention and remove the redundant interfering correlations. Furthermore, our proposed pyramid and shifted alignment mechanisms are effective to improve the cross-modal information utilization of neighborhood spatial regions and temporal frames. We conducted extensive experiments to show our HERO outperforms existing techniques by achieving significant improvement on two benchmark datasets.

Supplementary Material

MP4 File (meeting.mp4)
Presentation video

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In ICCV.
[2]
Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018. Temporally grounding natural sentence in video. In EMNLP.
[3]
Zhenfang Chen, Lin Ma, Wenhan Luo, and Kwan-Yee K Wong. 2019. Weakly supervised spatio-temporally grounding natural sentence in video. arXiv (2019).
[4]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In CVPR.
[5]
Leilei Gan, Yuxian Meng, Kun Kuang, Xiaofei Sun, Chun Fan, Fei Wu, and Jiwei Li. 2021. Dependency parsing as mrc-based span-span prediction. arXiv (2021).
[6]
Leilei Gan and Yue Zhang. 2020. Investigating self-attention network for Chinese word segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2020).
[7]
Shikang Gan, Yong Luo, Yonggang Wen, Tongliang Liu, and Han Hu. 2020. Deep Heterogeneous Multi-Task Metric Learning for Visual Recognition and Retrieval. In ACM MM.
[8]
Xuri Ge, Fuhai Chen, JoemonMJose, Zhilong Ji, ZhongqinWu, and Xiao Liu. 2021. Structured Multi-modal Feature Embedding and Alignment for Image-Sentence Retrieval. In ACM MM.
[9]
Jiannan Guo, Yangyang Kang, Yu Duan, Xiaozhong Liu, Siliang Tang, Wenqiao Zhang, Kun Kuang, Changlong Sun, and FeiWu. 2022. Collaborative Intelligence Orchestration: Inconsistency-Based Fusion of Semi-Supervised Learning and Active Learning. arXiv (2022).
[10]
Jiannan Guo, Haochen Shi, Yangyang Kang, Kun Kuang, Siliang Tang, Zhuoren Jiang, Changlong Sun, Fei Wu, and Yueting Zhuang. 2021. Semi-supervised active learning for semi-supervised models: Exploit adversarial examples with graph-based virtual labels. In ICCV.
[11]
Liang Han, PichaoWang, Zhaozheng Yin, FanWang, and Hao Li. 2020. Exploiting better feature aggregation for video object detection. In ACM MM.
[12]
Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained Cross-modal Alignment Network for Text-Video Retrieval. In ACM MM.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
[14]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation (1997).
[15]
Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. 2019. Learning to compose and reason with language tree structures for visual grounding. IEEE TPAMI (2019).
[16]
Zhijian Hou, Chong-Wah Ngo, and Wing Kwong Chan. 2021. CONQUER: Contextual query-aware ranking for video corpus moment retrieval. In ACM MM.
[17]
Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. 2019. Language conditioned graph networks for relational reasoning. In ICCV.
[18]
De-An Huang, Shyamal Buch, Lucio Dery, Animesh Garg, Li Fei-Fei, and Juan Carlos Niebles. 2018. Finding" it":Weakly-supervised reference-aware visual grounding in instructional videos. In CVPR.
[19]
Jiayi Ji, Yunpeng Luo, Xiaoshuai Sun, Fuhai Chen, Gen Luo, Yongjian Wu, Yue Gao, and Rongrong Ji. 2021. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In AAAI.
[20]
Chenchen Jing, Yuwei Wu, Mingtao Pei, Yao Hu, Yunde Jia, and Qi Wu. 2020. Visual-semantic graph matching for visual grounding. In ACM MM.
[21]
Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. MDETR-modulated detection for end-to-end multimodal understanding. In ICCV.
[22]
Ming Kong, Qing Guo, Shuowen Zhou, Mengze Li, Kun Kuang, Zhengxing Huang, Fei Wu, Xiaohong Chen, and Qiang Zhu. 2022. Attribute-aware interpretation learning for thyroid ultrasound diagnosis. Artificial Intelligence in Medicine (2022).
[23]
Juncheng Li, Siliang Tang, Fei Wu, and Yueting Zhuang. 2019. Walking with mind: Mental imagery enhanced embodied qa. In ACM MM.
[24]
Juncheng Li, Siliang Tang, Linchao Zhu, Haochen Shi, Xuanwen Huang, Fei Wu, Yi Yang, and Yueting Zhuang. 2021. Adaptive hierarchical graph reasoning with semantic coherence for video-and-language inference. In ICCV.
[25]
Juncheng Li, Xin Wang, Siliang Tang, Haizhou Shi, Fei Wu, Yueting Zhuang, and William Yang Wang. 2020. Unsupervised reinforcement learning of transferable meta-skills for embodied navigation. In CVPR.
[26]
Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, and Xin Eric Wang. 2022. Compositional temporal grounding with structured variational cross-graph correspondence learning. In CVPR.
[27]
Juncheng Li, Junlin Xie, Linchao Zhu, Long Qian, Siliang Tang, Wenqiao Zhang, Haochen Shi, Shengyu Zhang, Longhui Wei, Qi Tian, and Yueting Zhuang. 2022. Dilated context integrated network with cross-modal consensus for temporal emotion localization in videos. In ACM MM.
[28]
Mengze Li, Ming Kong, Kun Kuang, Qiang Zhu, and Fei Wu. 2020. Multi-task attribute-fusion model for fine-grained image recognition. In Optoelectronic Imaging and Multimedia Technology VII.
[29]
Mengze Li, Kun Kuang, Qiang Zhu, Xiaohong Chen, Qing Guo, and Fei Wu. 2020. IB-M: A Flexible Framework to Align an Interpretable Model and a Black-box Model. In BIBM.
[30]
Mengze Li, Tianbao Wang, Haoyu Zhang, Shengyu Zhang, Zhou Zhao, Jiaxu Miao, Wenqiao Zhang, Wenming Tan, Jin Wang, Peng Wang, et al. 2022. End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding. ACL (2022).
[31]
Zhenyang Li, Ran Tao, Efstratios Gavves, Cees GM Snoek, and Arnold WM Smeulders. 2017. Tracking by natural language specification. In CVPR.
[32]
Yue Liao, Si Liu, Guanbin Li, Fei Wang, Yanjie Chen, Chen Qian, and Bo Li. 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In CVPR.
[33]
Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan, and Hanzi Wang. 2020. Dual semantic fusion network for video object detection. In ACM MM.
[34]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv (2019).
[35]
Ding Ma and Xiangqian Wu. 2021. Capsule-based Object Tracking with Natural Language Specification. In ACM MM.
[36]
Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In CVPR.
[37]
Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. 2016. Modeling context between objects for referring expression understanding. In ECCV.
[38]
Arka Sadhu, Kan Chen, and Ram Nevatia. 2020. Video object grounding using semantic roles in language description. In CVPR.
[39]
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. 2019. Annotating objects and relations in user-generated videos. In ICMR.
[40]
Rui Su, Qian Yu, and Dong Xu. 2021. Stvgbert: A visual-linguistic transformer based framework for spatio-temporal video grounding. In ICCV.
[41]
Zongheng Tang, Yue Liao, Si Liu, Guanbin Li, Xiaojie Jin, Hongxu Jiang, Qian Yu, and Dong Xu. 2021. Human-centric spatio-temporal video grounding with visual transformers. IEEE TCSVT (2021).
[42]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NIPS (2017).
[43]
Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. 2019. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR.
[44]
Wei Wang, Junyu Gao, and Changsheng Xu. 2021. Weakly-Supervised Video Object Grounding via Stable Context Learning. In ACM MM.
[45]
Daniel E Worrall, Stephan J Garbin, Daniyar Turmukhambetov, and Gabriel J Brostow. 2017. Harmonic networks: Deep translation and rotation equivariance. In CVPR.
[46]
Anpeng Wu, Kun Kuang, Bo Li, and Fei Wu. 2022. Instrumental Variable Regression with Confounder Balancing. In ICML.
[47]
Yiquan Wu, Kun Kuang, Yating Zhang, Xiaozhong Liu, Changlong Sun, Jun Xiao, Yueting Zhuang, Luo Si, and Fei Wu. 2020. De-biased court's view generation with causality. In EMNLP.
[48]
Yufei Xu, Qiming Zhang, Jing Zhang, and Dacheng Tao. 2021. Vitae: Vision transformer advanced by exploring intrinsic inductive bias. NIPS (2021).
[49]
Masataka Yamaguchi, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. 2017. Spatio-temporal person retrieval via natural language queries. In ICCV.
[50]
Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In ICCV.
[51]
Zhengyuan Yang, Tushar Kumar, Tianlang Chen, Jingsong Su, and Jiebo Luo. 2020. Grounding-tracking-integration. IEEE TCSVT (2020).
[52]
Jiabo Ye, Xin Lin, Liang He, Dingbang Li, and Qin Chen. 2021. One-Stage Visual Grounding via Semantic-Aware Feature Filter. In ACM MM.
[53]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In CVPR.
[54]
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv (2014).
[55]
Feifei Zhang, Mingliang Xu, Qirong Mao, and Changsheng Xu. 2020. Joint attribute manipulation and modality alignment learning for composing text and image to image retrieval. In ACM MM.
[56]
Shengyu Zhang, Tan Jiang, Tan Wang, Kun Kuang, Zhou Zhao, Jianke Zhu, Jin Yu, Hongxia Yang, and Fei Wu. 2020. DeVLBert: Learning Deconfounded Visio-Linguistic Representations. In ACM MM.
[57]
Shengyu Zhang, Ziqi Tan, Zhou Zhao, Jin Yu, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, and Fei Wu. 2020. Comprehensive Information Integration Modeling Framework for Video Titling. In KDD.
[58]
Shengyu Zhang, Lingxiao Yang, Dong Yao, Yujie Lu, Fuli Feng, Zhou Zhao, Tat-Seng Chua, and Fei Wu. 2022. Re4: Learning to Re-contrast, Re-attend, Re-construct for Multi-interest Recommendation. In WWW.
[59]
Wenqiao Zhang, Haochen Shi, Jiannan Guo, Shengyu Zhang, Qingpeng Cai, Juncheng Li, Sihui Luo, and Yueting Zhuang. 2022. Magic: Multimodal relational graph adversarial inference for diverse and unpaired text-based image captioning. In AAAI.
[60]
Wenqiao Zhang, Haochen Shi, Siliang Tang, Jun Xiao, Qiang Yu, and Yueting Zhuang. 2021. Consensus graph representation learning for better grounded image captioning. In AAAI.
[61]
Wenqiao Zhang, Siliang Tang, Yanpeng Cao, Shiliang Pu, Fei Wu, and Yueting Zhuang. 2019. Frame augmented alternating attention network for video question answering. IEEE TMM (2019).
[62]
Wenqiao Zhang, Xin Eric Wang, Siliang Tang, Haizhou Shi, Haochen Shi, Jun Xiao, Yueting Zhuang, and William Yang Wang. 2020. Relational graph learning for grounded video description generation. In ACM MM.
[63]
Wenqiao Zhang, Lei Zhu, James Hallinan, Andrew Makmur, Shengyu Zhang, Qingpeng Cai, and Beng Chin Ooi. 2022. BoostMIS: Boosting Medical Image Semi-supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation. arXiv (2022).
[64]
Yu Zhang, Xinyu Shi, Siya Mi, and Xu Yang. 2021. Image captioning with transformer and knowledge graph. Pattern Recognition Letters (2021).
[65]
Zhu Zhang, Zhou Zhao, Zhijie Lin, Baoxing Huai, and Nicholas Jing Yuan. 2020. Object-aware multi-branch relation networks for spatio-temporal video grounding. arXiv (2020).
[66]
Zhu Zhang, Zhou Zhao, Yang Zhao, QiWang, Huasheng Liu, and Lianli Gao. 2020. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR.
[67]
Yiyi Zhou, Tianhe Ren, Chaoyang Zhu, Xiaoshuai Sun, Jianzhuang Liu, Xinghao Ding, Mingliang Xu, and Rongrong Ji. 2021. TRAR: Routing the Attention Spans in Transformer for Visual Question Answering. In ICCV.

Cited By

View all
  • (2024)Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681632(9271-9280)Online publication date: 28-Oct-2024
  • (2024)Cross-modal Observation Hypothesis InferenceProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681591(466-475)Online publication date: 28-Oct-2024
  • (2024)Cognitive Traffic Accident AnticipationIEEE Intelligent Transportation Systems Magazine10.1109/MITS.2024.337846016:5(17-32)Online publication date: Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-head self attention
  2. multi-modal video object grounding
  3. weakly-supervision

Qualifiers

  • Research-article

Funding Sources

  • Key R & D Projects of the Ministry of Science and Technology
  • Zhejiang Natural Science Foundation
  • National Key R&D Program of China under Grant
  • Program of Zhejiang Province Science and Technology
  • NSFC

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)86
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Hierarchical Debiasing and Noisy Correction for Cross-domain Video Tube RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681632(9271-9280)Online publication date: 28-Oct-2024
  • (2024)Cross-modal Observation Hypothesis InferenceProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681591(466-475)Online publication date: 28-Oct-2024
  • (2024)Cognitive Traffic Accident AnticipationIEEE Intelligent Transportation Systems Magazine10.1109/MITS.2024.337846016:5(17-32)Online publication date: Sep-2024
  • (2023)Deconfounded Multimodal Learning for Spatio-temporal Video GroundingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613822(7521-7529)Online publication date: 26-Oct-2023
  • (2023)Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature DecompositionProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612441(4867-4876)Online publication date: 26-Oct-2023
  • (2023)Video Entailment via Reaching a Structure-Aware Cross-modal ConsensusProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612345(4240-4249)Online publication date: 26-Oct-2023
  • (2023)Unsupervised Domain Adaptation for Video Object Grounding with Cascaded Debiasing LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612314(3807-3816)Online publication date: 26-Oct-2023
  • (2023)Quantitatively Measuring and Contrastively Exploring Heterogeneity for Domain GeneralizationProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599481(2189-2200)Online publication date: 6-Aug-2023
  • (2023)Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00241(2551-2562)Online publication date: 1-Oct-2023
  • (2023)Learning Trajectory-Word Alignments for Video-Language Tasks2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00237(2504-2514)Online publication date: 1-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media