[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3372278.3390742acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do

Published: 08 June 2020 Publication History

Abstract

In this paper we propose a new evaluation challenge and direction in the area of High-level Video Understanding. The challenge we are proposing is designed to test automatic video analysis and understanding, and how accurately systems can comprehend a movie in terms of actors, entities, events and their relationship to each other. A pilot High-Level Video Understanding (HLVU) dataset of open source movies were collected for human assessors to build a knowledge graph representing each of them. A set of queries will be derived from the knowledge graph to test systems on retrieving relationships among actors, as well as reasoning and retrieving non-visual concepts. The objective is to benchmark if a computer system can "understand" non-explicit but obvious relationships the same way humans do when they watch the same movies. This is long-standing problem that is being addressed in the text domain and this project moves similar research to the video domain. Work of this nature is foundational to future video analytics and video understanding technologies. This work can be of interest to streaming services and broadcasters hoping to provide more intuitive ways for their customers to interact with and consume video content.

References

[1]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425--2433.
[2]
George Awad, Asad Butt, Keith Curtis, Yooyoung Lee, Jonathan Fiscus, Afzal Godil, Andrew Delgado, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, and Georges Quénot. 2019. TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval. In Proceedings of TRECVID 2019. NIST, USA.
[3]
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 961--970.
[4]
Paola Cascante-Bonilla, Kalpathy Sitaraman, Mengjia Luo, and Vicente Ordonez. 2019. Moviescope: Large-scale Analysis of Movies using Multiple Modalities. arXiv preprint arXiv:1908.03180 (2019).
[5]
Creative Commons. 2019. About The Licenses. https://creativecommons.org/licenses/, Last accessed on 2019--11-06.
[6]
Jeremy Debattista, Fahim A Salim, Fasih Haider, Clare Conran, Owen Conlan, Keith Curtis, Wang Wei, Ademar Crotti Junior, and Declan O'Sullivan. 2018. Expressing Multimedia Content Using Semantics-A Vision. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC). IEEE, 302--303.
[7]
Bhavan Jasani, Rohit Girdhar, and Deva Ramanan. 2019. Are we asking the right questions in MovieQA?. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 0--0.
[8]
Anna Rohrbach and Jae Sung Park. 2019. Large Scale Movie Description Challenge (LSMDC) 2019. https://sites.google.com/site/describingmovies/lsmdc-2019, Last accessed on 2019--11-06.
[9]
Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, and Florian Metze. 2018. How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347 (2018).
[10]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition . 4631--4640.
[11]
Cornelis Joost Van Rijsbergen. 1979. Information retrieval. (1979).
[12]
Ellen M Voorhees et al. 1999. The TREC-8 question answering track report. In Trec, Vol. 99. Citeseer, 77--82.

Cited By

View all
  • (2024)A Survey of Video Datasets for Grounded Event Understanding2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00727(7314-7327)Online publication date: 17-Jun-2024
  • (2023)Deep Video Understanding with Video-Language ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612863(9551-9555)Online publication date: 26-Oct-2023
  • (2023)A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612838(9425-9429)Online publication date: 26-Oct-2023
  • Show More Cited By

Index Terms

  1. HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval
    June 2020
    605 pages
    ISBN:9781450370875
    DOI:10.1145/3372278
    © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 June 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. information retrieval
    2. multimedia
    3. video ontology
    4. video understanding

    Qualifiers

    • Research-article

    Conference

    ICMR '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)46
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 24 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Survey of Video Datasets for Grounded Event Understanding2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00727(7314-7327)Online publication date: 17-Jun-2024
    • (2023)Deep Video Understanding with Video-Language ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612863(9551-9555)Online publication date: 26-Oct-2023
    • (2023)A Hierarchical Deep Video Understanding Method with Shot-Based Instance Search and Large Language ModelProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612838(9425-9429)Online publication date: 26-Oct-2023
    • (2023)The ACM Multimedia 2023 Deep Video Understanding Grand ChallengeProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612829(9606-9609)Online publication date: 26-Oct-2023
    • (2023)Shifted GCN-GAT and Cumulative-Transformer based Social Relation Recognition for Long VideosProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612175(67-76)Online publication date: 26-Oct-2023
    • (2023)A Deep Understanding Video Q&A System for Film Education in Acting Department2023 International Conference on Intelligent Education and Intelligent Research (IEIR)10.1109/IEIR59294.2023.10391232(1-7)Online publication date: 5-Nov-2023
    • (2023)ROAD-R: the autonomous driving dataset with logical requirementsMachine Learning10.1007/s10994-023-06322-z112:9(3261-3291)Online publication date: 1-May-2023
    • (2022)Leveraging Text Representation and Face-head Tracking for Long-form Multimodal Semantic Relation UnderstandingProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551610(7215-7219)Online publication date: 10-Oct-2022
    • (2022)Unified QA-aware Knowledge Graph Generation Based on Multi-modal ModelingProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551604(7185-7189)Online publication date: 10-Oct-2022
    • (2022)Multimodal Analysis for Deep Video Understanding with Video Language TransformerProceedings of the 30th ACM International Conference on Multimedia10.1145/3503161.3551600(7165-7169)Online publication date: 10-Oct-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media