More Web Proxy on the site http://driver.im/

research-article

Video Relation Detection with Spatio-Temporal Graph

Authors:

Yueting Zhuang,

Jun XiaoAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 84 - 93

https://doi.org/10.1145/3343031.3351058

Published: 15 October 2019 Publication History

Abstract

What we perceive from visual content are not only collections of objects but the interactions between them. Visual relations, denoted by the triplet <subject, predicate, object>, could convey a wealth of information for visual understanding. Different from static images and because of the additional temporal channel, dynamic relations in videos are often correlated in both spatial and temporal dimensions, which make the relation detection in videos a more complex and challenging task. In this paper, we abstract videos into fully-connected spatial-temporal graphs. We pass message and conduct reasoning in these 3D graphs with a novel VidVRD model using graph convolution network. Our model can take advantage of spatial-temporal contextual cues to make better predictions on objects as well as their dynamic relationships. Furthermore, an online association method with a siamese network is proposed for accurate relation instances association. By combining our model (VRD-GCN) and the proposed association method, our framework for video relation detection achieves the best performance in the latest benchmarks. We validate our approach on benchmark ImageNet-VidVRD dataset. The experimental results show that our framework outperforms the state-of-the-art by a large margin and a series of ablation studies demonstrate our method's effectiveness.

References

[1]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2015. Deep Compositional Question Answering with Neural Module Networks. CoRR, Vol. abs/1511.02799 (2015).

[2]

Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normalization. CoRR, Vol. abs/1607.06450 (2016).

[3]

Luca Bertinetto, Jack Valmadre, Jo a o F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2016. Fully-Convolutional Siamese Networks for Object Tracking. In Computer Vision - ECCV 2016 Workshops - Amsterdam, The Netherlands, October 8--10 and 15--16, 2016, Proceedings, Part II. 850--865.

[4]

Alex Bewley, ZongYuan Ge, Lionel Ott, Fabio Tozeto Ramos, and Ben Upcroft. 2016. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing, ICIP 2016, Phoenix, AZ, USA, September 25--28, 2016 . 3464--3468.

[5]

David S. Bolme, J. Ross Beveridge, Bruce A. Draper, and Yui Man Lui. 2010. Visual object tracking using adaptive correlation filters. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13--18 June 2010. 2544--2550.

[6]

Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, and Shih-Fu Chang. 2019. Counterfactual Critic Multi-Agent Training for Scene Graph Generation. In ICCV .

[7]

Zhiyong Cui, Kristian Henrickson, Ruimin Ke, and Yinhai Wang. 2018. High-Order Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. CoRR, Vol. abs/1802.07007 (2018).

[8]

Martin Danelljan, Gustav H"a ger, Fahad Shahbaz Khan, and Michael Felsberg. 2014. Accurate Scale Estimation for Robust Visual Tracking. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1--5, 2014 .

[9]

Michaë l Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5--10, 2016, Barcelona, Spain . 3837--3845.

[10]

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, and Bryan Russell. 2017. ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3165--3174.

[11]

William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4--9 December 2017, Long Beach, CA, USA. 1025--1035.

[12]

Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. 2018. A Twofold Siamese Network for Real-Time Object Tracking. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 4834--4843.

[13]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016 . 770--778.

[14]

Mikael Henaff, Joan Bruna, and Yann LeCun. 2015. Deep Convolutional Networks on Graph-Structured Data. CoRR, Vol. abs/1506.05163 (2015).

[15]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24--26, 2017, Conference Track Proceedings .

[16]

Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene Graph Generation from Objects, Phrases and Region Captions. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 1270--1279.

[17]

Cewu Lu, Ranjay Krishna, Michael S. Bernstein, and Fei-Fei Li. 2016. Visual Relationship Detection with Language Priors. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I. 852--869.

[18]

Kenneth Marino, Ruslan Salakhutdinov, and Abhinav Gupta. 2017. The More You Know: Using Knowledge Graphs for Image Classification. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 20--28.

[19]

Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodolà, Jan Svoboda, and Michael M. Bronstein. 2017. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 5425--5434.

[20]

Julia Peyre, Ivan Laptev, Cordelia Schmid, and Josef Sivic. 2017. Weakly-Supervised Learning of Visual Relations. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 5189--5198.

[21]

Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. 2018. Semi-supervised User Geolocation via Graph Convolutional Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15--20, 2018, Volume 1: Long Papers . 2009--2019.

[22]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7--12, 2015, Montreal, Quebec, Canada. 91--99.

Digital Library

[23]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Fei-Fei Li. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, Vol. 115, 3 (2015), 211--252.

Digital Library

[24]

Mohammad Amin Sadeghi and Ali Farhadi. 2011. Recognition using visual phrases. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20--25 June 2011. 1745--1752.

Digital Library

[25]

Victor Garcia Satorras and Joan Bruna Estrach. 2018. Few-Shot Learning with Graph Neural Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings .

[26]

Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video Visual Relation Detection. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23--27, 2017 . 1300--1308.

Digital Library

[27]

Abhinav Shrivastava, Abhinav Gupta, and Ross B. Girshick. 2016. Training Region-Based Object Detectors with Online Hard Example Mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 761--769.

[28]

Damien Teney, Lingqiao Liu, and Anton van den Hengel. 2017. Graph-Structured Representations for Visual Question Answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3233--3241.

[29]

Jack Valmadre, Luca Bertinetto, Jo a o F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. 2017. End-to-End Representation Learning for Correlation Filter Based Tracking. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 5000--5008.

[30]

Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2017. Graph Convolutional Matrix Completion. CoRR, Vol. abs/1706.02263 (2017).

[31]

Heng Wang and Cordelia Schmid. 2013. Action Recognition with Improved Trajectories. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1--8, 2013 . 3551--3558.

[32]

Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-Local Neural Networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 7794--7803.

[33]

Xiaolong Wang and Abhinav Gupta. 2018. Videos as Space-Time Region Graphs. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part V. 413--431.

[34]

Nicolai Wojke, Alex Bewley, and Dietrich Paulus. 2017. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing, ICIP 2017, Beijing, China, September 17--20, 2017. 3645--3649.

[35]

Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene Graph Generation by Iterative Message Passing. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3097--3106.

[36]

Ning Xu, An-An Liu, Yongkang Wong, Yongdong Zhang, Weizhi Nie, Yuting Su, and Mohan Kankanhalli. 2018. Dual-stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology (2018).

[37]

Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph R-CNN for Scene Graph Generation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part I. 690--706.

[38]

Guojun Yin, Lu Sheng, Bin Liu, Nenghai Yu, Xiaogang Wang, Jing Shao, and Chen Change Loy. 2018. Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part III. 330--347.

[39]

Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19--23, 2018 . 974--983.

Digital Library

[40]

Ruichi Yu, Ang Li, Vlad I. Morariu, and Larry S. Davis. 2017. Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017 . 1068--1076.

[41]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural Motifs: Scene Graph Parsing With Global Context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 5831--5840.

[42]

Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 3107--3115.

[43]

Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. 2018. Temporal Relational Reasoning in Videos. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part I. 831--846.

[44]

Bohan Zhuang, Lingqiao Liu, Chunhua Shen, and Ian D. Reid. 2017. Towards Context-Aware Interaction Recognition for Visual Relationship Detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22--29, 2017. 589--598.

Cited By

Wu ZGao JXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681061
Jiang XZheng CXu XLiu BZheng WZhang HHe SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)VrdONE: One-stage Video Visual Relation DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680833(1437-1446)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680833
Wei MChen LJi WYue XZimmermann R(2024)In Defense of Clip-Based Video Relation DetectionIEEE Transactions on Image Processing10.1109/TIP.2024.337993533(2759-2769)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3379935
Show More Cited By

Index Terms

Video Relation Detection with Spatio-Temporal Graph
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Activity recognition and understanding

Recommendations

Video Visual Relation Detection
MM '17: Proceedings of the 25th ACM international conference on Multimedia

As a bridge to connect vision and language, visual relations between objects in the form of relation triplet $łangle subject,predicate,object\rangle$, such as "person-touch-dog'' and "cat-above-sofa'', provide a more comprehensive visual content ...
Visual Relation Detection with Multi-Level Attention
MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Visual relations, which describe various types of interactions between two objects in the image, can provide critical information for comprehensive semantic understanding of the image. Multiple cues related to the objects can contribute to visual ...
Video Relation Detection with Trajectory-aware Multi-modal Features
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Video relation detection problem refers to the detection of the relationship between different objects in videos, such as spatial relationship and action relationship. In this paper, we present video relation detection with trajectory-aware multi-modal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Zhejiang Natural Science Foundation
National Key Research and Development Program of China
the Fundamental Research Funds for the Central Universities and Chinese Knowledge Center for Engineering Sciences and Technology

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
1,254
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)5

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu ZGao JXu CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Open-Vocabulary Video Scene Graph Generation via Union-aware Semantic AlignmentProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681061(8566-8575)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681061
Jiang XZheng CXu XLiu BZheng WZhang HHe SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)VrdONE: One-stage Video Visual Relation DetectionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680833(1437-1446)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680833
Wei MChen LJi WYue XZimmermann R(2024)In Defense of Clip-Based Video Relation DetectionIEEE Transactions on Image Processing10.1109/TIP.2024.337993533(2759-2769)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3379935
Pu TChen TWu HLu YLin L(2024)Spatial–Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing10.1109/TIP.2023.334565233(556-568)Online publication date: 2024
https://doi.org/10.1109/TIP.2023.3345652
Zhang GTang YZhang CZheng XZhao Y(2024)Entity Dependency Learning Network With Relation Prediction for Video Visual Relation DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.343743734:12(12425-12436)Online publication date: Dec-2024
https://doi.org/10.1109/TCSVT.2024.3437437
Qian RFu ZLiu XZhang KLv ZLan X(2024)Video Visual Relation Detection Based on Trajectory Fusion2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650663(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650663
Khandelwal A(2024)FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00258(2516-2526)Online publication date: 17-Jun-2024
https://doi.org/10.1109/CVPRW63382.2024.00258
Wu THe RWu GWang L(2024)SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.01754(18537-18546)Online publication date: 16-Jun-2024
https://doi.org/10.1109/CVPR52733.2024.01754
Li HZhu GZhang LJiang YDang YHou HShen PZhao XShah SBennamoun M(2024)Scene Graph Generation: A comprehensive surveyNeurocomputing10.1016/j.neucom.2023.127052566(127052)Online publication date: Jan-2024
https://doi.org/10.1016/j.neucom.2023.127052
He YGan MMa Q(2024)Online video visual relation detection with hierarchical multi-modal fusionMultimedia Tools and Applications10.1007/s11042-023-15310-383:24(65707-65727)Online publication date: 18-Jan-2024
https://doi.org/10.1007/s11042-023-15310-3
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents