[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503161.3548324acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Dynamic Scene Graph Generation via Temporal Prior Inference

Published: 10 October 2022 Publication History

Abstract

Real-world videos are composed of complex actions with inherent temporal continuity (eg "person-touching-bottle" is usually followed by "person-holding-bottle"). In this work, we propose a novel method to mine such temporal continuity for dynamic scene graph generation (DSGG), namely Temporal Prior Inference (TPI). As opposed to current DSGG methods, which individually capture the temporal dependence of each video by refining representations, we make the first attempt to explore the temporal continuity by extracting the entire co-occurrence patterns of action categories from a variety of videos in Action Genome (AG) dataset. Then, these inherent patterns are organized as Temporal Prior Knowledge (TPK) which serves as prior knowledge for models' learning and inference. Furthermore, given the prior knowledge, human-object relationships in current frames can be effectively inferred from adjacent frames via the robust Temporal Prior Inference algorithm with tiny computation cost. Specifically, to efficiently guide the generating of temporal-consistent dynamic scene graphs, we incorporate the temporal prior inference into a DSGG framework by introducing frame enhancement, continuity loss, and fast inference. The proposed model-agnostic strategies significantly boost the performances of existing state-of-the-art models on the Action Genome dataset, achieving 69.7 and 72.6 for R@10 and R@20 on PredCLS. In addition, the inference speed can be significantly reduced by 41% with an acceptable drop on R@10 (69.7 to 66.8) by utilizing fast inference.

Supplementary Material

MP4 File (MM22_fp2536.mp4)
This video presents the main content of the paper "dynamic scene graph generation via temporary prior reference". First, we concisely analyze the motivation of the thesis. Secondly, it shows the method to solve the problem, i.e. "Temporal Prior Inference (TPI) algorithm". Then, based on the TPI algorithm, Temporal Prior Frame Enhancement (TPFE), Temporal Prior Continuity Loss (TPCL) and Temporal Prior Fast Inference (TPFI) modules are proposed to improve the accuracy of the scene graph generated by the model. Finally, based on the experimental results, we analyze the function of each module in detail, and prove the effectiveness of the proposed method.

References

[1]
Oron Ashual and Lior Wolf. 2019. Specifying object attributes and relations in interactive scene generation. In ICCV. 4561--4569.
[2]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR. 961--970.
[3]
Joao Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. 2019. A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987 (2019).
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR. 6299--6308.
[5]
Yuren Cong, Wentong Liao, Hanno Ackermann, Bodo Rosenhahn, and Michael Ying Yang. 2021. Spatial-temporal transformer for dynamic scene graph generation. In ICCV. 16372--16382.
[6]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slowfast networks for video recognition. In ICCV. 6202--6211.
[7]
Lizhao Gao, Bo Wang, and Wenmin Wang. 2018. Image captioning with scene-graph based semantic concepts. In ICMLC. 225--229.
[8]
Jiuxiang Gu, Handong Zhao, Zhe Lin, Sheng Li, Jianfei Cai, and Mingyang Ling. 2019. Scene graph generation with external knowledge and image reconstruction. In CVPR. 1969--1978.
[9]
Yuyu Guo, Jingqiu Zhang, and Lianli Gao. 2019. Exploiting long-term temporal dynamics for video captioning. World Wide Web, Vol. 22, 2 (2019), 735--749.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770--778.
[11]
Jingwei Ji, Rishi Desai, and Juan Carlos Niebles. 2021. Detecting Human-Object Relationships in Videos. In ICCV. 8106--8116.
[12]
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. 2020. Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR. 10236--10247.
[13]
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. In CVPR. 1219--1228.
[14]
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Inferring and executing programs for visual reasoning. In ICCV. 2989--2998.
[15]
Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. 2015. Image retrieval using scene graphs. In CVPR. 3668--3678.
[16]
Florian Kluger, Hanno Ackermann, Michael Ying Yang, and Bodo Rosenhahn. 2020. Temporally consistent horizon lines. In ICRA. 3161--3167.
[17]
Christopher A Kurby and Jeffrey M Zacks. 2008. Segmentation in the perception and memory of events. Trends in cognitive sciences, Vol. 12, 2 (2008), 72--79.
[18]
Xiangpeng Li, Lianli Gao, Xuanhan Wang, Wu Liu, Xing Xu, Heng Tao Shen, and Jingkuan Song. 2019. Learnable aggregating net with diversity learning for video question answering. In ACM Multimedia. 1166--1174.
[19]
Yicong Li, Xun Yang, Xindi Shang, and Tat-Seng Chua. 2021. Interventional Video Relation Detection. In ACM Multimedia. 4091--4099.
[20]
Ji Lin, Chuang Gan, and Song Han. 2019. Tsm: Temporal shift module for efficient video understanding. In ICCV. 7083--7093.
[21]
Xin Lin, Changxing Ding, Jinquan Zeng, and Dacheng Tao. 2020. Gps-net: Graph property sensing network for scene graph generation. In CVPR. 3746--3753.
[22]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[23]
Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2016. Visual relationship detection with language priors. In ECCV. 852--869.
[24]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. TPAMI, Vol. 28 (2015).
[25]
Xindi Shang, Tongwei Ren, Jingfan Guo, Hanwang Zhang, and Tat-Seng Chua. 2017. Video Visual Relation Detection. In ACM Multimedia. ACM, 1300--1308.
[26]
Sahand Sharifzadeh, Sina Moayed Baharlou, and Volker Tresp. 2020. Classification by attention: Scene graph classification with prior knowledge. arXiv preprint arXiv:2011.10084 (2020).
[27]
Kaihua Tang, Hanwang Zhang, Baoyuan Wu, Wenhan Luo, and Wei Liu. 2019. Learning to compose dynamic tree structures for visual contexts. In CVPR. 6619--6628.
[28]
Yao Teng, Limin Wang, Zhifeng Li, and Gangshan Wu. 2021. Target adaptive context aggregation for video scene graph generation. In ICCV. 13688--13697.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeuralIPS, Vol. 30 (2017).
[30]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR. 5410--5419.
[31]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In ECCV. 670--685.
[32]
Michael Ying Yang, Wentong Liao, Hanno Ackermann, and Bodo Rosenhahn. 2017. On support relations and semantic scene graphs. ISPRS journal of photogrammetry and remote sensing, Vol. 131 (2017), 15--25.
[33]
Alireza Zareian, Svebor Karaman, and Shih-Fu Chang. 2020. Bridging knowledge graphs to generate scene graphs. In ECCV. 606--623.
[34]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. 5831--5840.
[35]
Pengpeng Zeng, Lianli Gao, Xinyu Lyu, Shuaiqi Jing, and Jingkuan Song. 2021. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching. In ACM Multimedia. 2205--2213.
[36]
Pengpeng Zeng, Haonan Zhang, Jingkuan Song, and Lianli Gao. 2022. S2 Transformer for Image Captioning. In IJCAI.
[37]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual translation embedding network for visual relation detection. In CVPR. 5532--5540.
[38]
Ji Zhang, Kevin J Shih, Ahmed Elgammal, Andrew Tao, and Bryan Catanzaro. 2019. Graphical contrastive losses for scene graph parsing. In CVPR. 11535--11543.
[39]
Ji Zhang, Jingkuan Song, Lianli Gao, Ye Liu, and Heng Tao Shen. 2022. Progressive Meta-learning with Curriculum. TCSVT (2022).
[40]
Ji Zhang, Jingkuan Song, Yazhou Yao, and Lianli Gao. 2021. Curriculum-based meta-learning. In ACM Multimedia. 1838--1846.

Cited By

View all
  • (2024)Caption-Aware Multimodal Relation Extraction with Mutual Information MaximizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681219(1148-1157)Online publication date: 28-Oct-2024
  • (2024)Spatial–Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing10.1109/TIP.2023.334565233(556-568)Online publication date: 1-Jan-2024
  • (2024)Dynamic Scene Graph Generation with Unified Temporal Modeling2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687612(1-6)Online publication date: 15-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic scene graph generation
  2. temporal prior knowledge
  3. vision and language

Qualifiers

  • Research-article

Funding Sources

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)182
  • Downloads (Last 6 weeks)28
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Caption-Aware Multimodal Relation Extraction with Mutual Information MaximizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681219(1148-1157)Online publication date: 28-Oct-2024
  • (2024)Spatial–Temporal Knowledge-Embedded Transformer for Video Scene Graph GenerationIEEE Transactions on Image Processing10.1109/TIP.2023.334565233(556-568)Online publication date: 1-Jan-2024
  • (2024)Dynamic Scene Graph Generation with Unified Temporal Modeling2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687612(1-6)Online publication date: 15-Jul-2024
  • (2024)FloCoDe: Unbiased Dynamic Scene Graph Generation with Temporal Consistency and Correlation Debiasing2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00258(2516-2526)Online publication date: 17-Jun-2024
  • (2024)OED: Towards One-stage End-to-End Dynamic Scene Graph Generation2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52733.2024.02639(27938-27947)Online publication date: 16-Jun-2024
  • (2023)Iterative Learning with Extra and Inner Knowledge for Long-tail Dynamic Scene Graph GenerationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612430(4707-4715)Online publication date: 26-Oct-2023
  • (2023)Prior Knowledge-driven Dynamic Scene Graph Generation with Causal InferenceProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612249(4877-4885)Online publication date: 26-Oct-2023
  • (2023)Improving Scene Graph Generation with Superpixel-Based Interaction LearningProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611889(1809-1820)Online publication date: 26-Oct-2023
  • (2023)Unbiased Scene Graph Generation in Videos2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.02184(22803-22813)Online publication date: Jun-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media