[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3323873.3325019acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention

Published: 05 June 2019 Publication History

Abstract

Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for "Spatial and Language-Temporal Attention") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features "girl", "cup") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction "pour" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.

References

[1]
Jean-Baptiste Alayrac, Piotr Bojanowski, Nishant Agrawal, Josef Sivic, Ivan Laptev, and Simon Lacoste-Julien. 2016. Unsupervised Learning from Narrated Instruction Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4575--4583.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 6077--6086.
[3]
Da Cao, Xiangnan He, Lianhai Miao, Yahui An, Chao Yang, and Richang Hong. 2018. Attentive Group Recommendation. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 645--654.
[4]
Zhiyong Cheng, Xuanchong Li, Jialie Shen, and Alexander G Hauptmann. 2016. Which Information Sources are More Effective and Reliable in Video Search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1069--1072.
[5]
Maaike de Boer, Yi-Jie Lu, Hao Zhang, Klamer Schutte, Chong-Wah Ngo, and Wessel Kraaij. 2017. Semantic Reasoning in Zero Example Video Event Retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 13,4 (2017), 1--17.
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018).
[7]
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep Action Proposals for Action Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 768--784.
[8]
Fuli Feng, Xiangnan He, Yiqun Liu, Liqiang Nie, and Tat-Seng Chua. 2018. Learning on Partial-order Hypergraphs. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1523--1532.
[9]
Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua. 2017. Computational Social Indicators: A Case Study of Chinese University Ranking. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 455--464.
[10]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. 2013. Devise: A Deep Visual-semantic Embedding Model. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 2121--2129.
[11]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal Activity Localization via Language Query. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5267--5275.
[12]
Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 3628--3636.
[13]
Xiangnan He and Tat-Seng Chua. 2017. Neural Factorization Machines for Sparse Predictive Analytics. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 355--364.
[14]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173--182.
[15]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan C. Russell. 2017. Localizing Moments in Video with Natural Language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 5804--5813.
[16]
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Skip-thought Vectors. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 3294--3302.
[17]
Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual Semantic Search: Retrieving Videos via Complex Textual Queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2657--2664.
[18]
Tianwei Lin, Xu Zhao, and Zheng Shou. 2017. Single Shot Temporal Action Detection. In Proceedings of the ACM International Conference on Multimedia. ACM, 988--996.
[19]
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards Micro-video Understanding by Joint Sequential-Sparse Modeling. In Proceedings of the ACM International Conference on Multimedia. ACM, 970--978.
[20]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive Moment Retrieval in Videos. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 15--24.
[21]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal Moment Localization in Videos. In Proceedings of the ACM International Conference on Multimedia. ACM, 843--851.
[22]
Sheng Liu, Zhou Ren, and Junsong Yuan. 2018. SibNet: Sibling Convolutional Encoder for Video Captioning. In Proceedings of the ACM International Conference on Multimedia. ACM, 1425--1434.
[23]
Shugao Ma, Leonid Sigal, and Stan Sclaroff. 2016. Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1942--1950.
[24]
Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 55--60.
[25]
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkila, and Naokazu Yokoya. 2016. Learning Joint Representations of Videos and Sentences with Web Image Search. In Proceedings of the European Conference on Computer Vision. Springer, 651--667.
[26]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. NIPS, 1532--1543.
[27]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding Action Descriptions in Videos. Transactions of the Association of Computational Linguistics, Vol. 1 (2013), 25--36.
[28]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 91--99.
[29]
Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012. Script Data for Attribute-Based Recognition of Composite Activities. In Proceedings of the European Conference on Computer Vision. Springer, 144--157.
[30]
Klaus Schoeffmann and Frank Hopfgartner. 2015. Interactive Video Search. In Proceedings of the ACM International Conference on Multimedia. ACM, 1321--1322.
[31]
Rakshith Shetty and Jorma Laaksonen. 2015. Video Captioning with Recurrent Networks based on Frame-and Video-level Features and Visual Content Classification. arXiv preprint arXiv:1512.02949 (2015).
[32]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1049--1058.
[33]
Gunnar A Sigurdsson, Gul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. In Proceedings of the European Conference on Computer Vision. Springer, 510--526.
[34]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream Convolutional Networks for Action Recognition in Videos. In Proceedings of the Advances in Neural Information Processing Systems. NIPS, 568--576.
[35]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556 (2014).
[36]
Bharat Singh, Tim K. Marks, Michael J. Jones, Oncel Tuzel, and Ming Shao. 2016. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1961--1970.
[37]
Richard Socher, Andrej Karpathy, Quoc V Le, Christopher D Manning, and Andrew Y Ng. 2014. Grounded Compositional Semantics for Finding and Describing Images with Sentences. Transactions of the Association of Computational Linguistics, Vol. 2, 1 (2014), 207--218.
[38]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple Feature Hashing for Real-time Large Scale Near-duplicate Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 423--432.
[39]
Xuemeng Song, Fuli Feng, Jinhuan Liu, Zekun Li, Liqiang Nie, and Jun Ma. 2017. Neurostylist: Neural Compatibility Modeling for Clothing Matching. In Proceedings of the ACM International Conference on Multimedia. ACM, 753--761.
[40]
Stefanie Tellex and Deb Roy. 2009. Towards surveillance video search by natural language query. In Proceedings of the ACM International Conference on Image and Video Retrieval. ACM, 38.
[41]
Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4489--4497.
[42]
Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 185--194.
[43]
Rong Yan, Jun Yang, and Alexander G Hauptmann. 2004. Learning Query-class Dependent Weights in Automatic Video Retrieval. In Proceedings of the ACM International Conference on Multimedia. ACM, 548--555.
[44]
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 4584--4593.

Cited By

View all
  • (2024)M2DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment LocalizationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.326192735:8(11448-11462)Online publication date: Aug-2024
  • (2024)Temporally Language Grounding With Multi-Modal Multi-Prompt TuningIEEE Transactions on Multimedia10.1109/TMM.2023.331028226(3366-3377)Online publication date: 2024
  • (2024)Modality-Aware Heterogeneous Graph for Joint Video Moment Retrieval and Highlight DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338902434:9(8896-8911)Online publication date: Sep-2024
  • Show More Cited By

Index Terms

  1. Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '19: Proceedings of the 2019 on International Conference on Multimedia Retrieval
    June 2019
    427 pages
    ISBN:9781450367653
    DOI:10.1145/3323873
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 June 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. deep learning
    2. multimedia retrieval

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ICMR '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)51
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 31 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)M2DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment LocalizationIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.326192735:8(11448-11462)Online publication date: Aug-2024
    • (2024)Temporally Language Grounding With Multi-Modal Multi-Prompt TuningIEEE Transactions on Multimedia10.1109/TMM.2023.331028226(3366-3377)Online publication date: 2024
    • (2024)Modality-Aware Heterogeneous Graph for Joint Video Moment Retrieval and Highlight DetectionIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.338902434:9(8896-8911)Online publication date: Sep-2024
    • (2024)The Deep Learning-Based Semantic Cross-Modal Moving-Object Moment Retrieval System2024 10th International Conference on Applied System Innovation (ICASI)10.1109/ICASI60819.2024.10547990(134-136)Online publication date: 17-Apr-2024
    • (2024)Deep Learning for Video LocalizationDeep Learning for Video Understanding10.1007/978-3-031-57679-9_4(39-68)Online publication date: 28-Mar-2024
    • (2023)Explainable Activity Recognition in Videos using Deep Learning and Tractable Probabilistic ModelsACM Transactions on Interactive Intelligent Systems10.1145/362696113:4(1-32)Online publication date: 12-Oct-2023
    • (2023)Mixup-Augmented Temporally Debiased Video Grounding with Content-Location DisentanglementProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612401(4450-4459)Online publication date: 26-Oct-2023
    • (2023)A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and ApproachACM Transactions on Multimedia Computing, Communications, and Applications10.1145/356557319:6(1-23)Online publication date: 12-Jul-2023
    • (2023)A Survey on Video Moment LocalizationACM Computing Surveys10.1145/355653755:9(1-37)Online publication date: 16-Jan-2023
    • (2023)Progressive Localization Networks for Language-Based Moment LocalizationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354385719:2(1-21)Online publication date: 6-Feb-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media