[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3461353.3461386acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciaiConference Proceedingsconference-collections
research-article

Divided Caption Model with Global Attention

Published: 04 September 2021 Publication History

Abstract

Dense video captioning is a newly emerging task that aims at both locating and describing all events in a video. We identify and tackle two challenges on this task, namely, 1) the limitation of just attending local features; 2) the severely degraded description and increased training complexity caused by the redundant information. In this paper, we propose a new divided caption model, where two different attention mechanisms are introduced to rectify the captioning process in a unified framework. Firstly, we employ a global attention mechanism to encode video features in the proposal module, which can obtain a better temporal boundary. Second, we design bidirectional Long short-term memory (LSTM) with a common-attention mechanism to counterpoise 3d-convolutional neural network (c3d) features and global attention video content effectively in caption module to generate coherent natural language descriptions. Besides, we divide forward and backward video features in an event into segments to relieve the stress for degraded description and increased complexity. Extensive experiments demonstrate the competitive performance of the proposed Divided Caption Model with Global Attention (DCM-GA) over state-of-the-art methods on the ActivityNet Captions dataset.

References

[1]
Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. 2018. Convolutional image captioning. (2018), 5561–5570.
[2]
Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. 2017. Sst: Single-stream temporal action proposals. (2017), 2911–2920.
[3]
Victor Escorcia, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. 2016. Daps: Deep action proposals for action understanding. (2016), 768–784.
[4]
Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram Nevatia. 2017. Turn tap: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628–3636.
[5]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. (2017), 1243–1252.
[6]
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. (2017), 706–715.
[7]
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. 2018. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7492–7500.
[8]
Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han. 2019. Streamlined dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6588–6597.
[9]
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. (2017), 6504–6512.
[10]
Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, and Bernt Schiele. 2016. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision 119, 3 (2016), 346–373.
[11]
Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. (2017), 1916–1924.
[12]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage cnns. (2016), 1049–1058.
[13]
Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. 2016. A multi-stream bi-directional recurrent neural network for fine-grained action detection. (2016), 1961–1970.
[14]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402(2012).
[15]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. (2017), 5998–6008.
[16]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. 2015. Sequence to sequence-video to text. (2015), 4534–4542.
[17]
Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. 2018. Bidirectional attentive fusion with context gating for dense video captioning. (2018), 7190–7198.
[18]
Limin Wang, Wei Li, Wen Li, and Luc Van Gool. 2018. Appearance-and-relation networks for video classification. (2018), 1430–1439.
[19]
Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, and Kate Saenko. 2019. Joint event detection and description in continuous video streams. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 396–405.
[20]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. (2016), 4651–4659.
[21]
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. (2016), 4584–4593.
[22]
Dapeng Zhang, Feng Xiao, Lu Li, and Gang Kou. 2019. Learning Spatiotemporal Features of Ride-sourcing Services with Fusion Convolutional Network. arXiv preprint arXiv:1904.06823(2019).
[23]
Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards automatic learning of procedures from web instructional videos. (2018).
[24]
Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. (2018), 8739–8748.

Index Terms

  1. Divided Caption Model with Global Attention
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        ICIAI '21: Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence
        March 2021
        246 pages
        ISBN:9781450388634
        DOI:10.1145/3461353
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 September 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Bidirectional LSTM
        2. Global Attention
        3. Video Caption

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        ICIAI 2021

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 79
          Total Downloads
        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 09 Jan 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media