More Web Proxy on the site http://driver.im/

research-article

Generating Video Descriptions with Topic Guidance

Authors:

Qin JinAuthors Info & Claims

ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

Pages 5 - 13

https://doi.org/10.1145/3078971.3079000

Published: 06 June 2017 Publication History

Abstract

Generating video descriptions in natural language (a.k.a. video captioning) is a more challenging task than image captioning as the videos are intrinsically more complicated than images in two aspects. First, videos cover a broader range of topics, such as news, music, sports and so on. Second, multiple topics could coexist in the same video. In this paper, we propose a novel caption model, topic-guided model (TGM), to generate topic-oriented descriptions for videos in the wild via exploiting topic information. In addition to predefined topics, i.e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model. We show that data-driven topics reflect a better topic schema than the predefined topics. As for testing video topic prediction, we treat the topic mining model as teacher to train the student, the topic prediction model, by utilizing the full multi-modalities in the video especially the speech modality. We propose a series of caption models to exploit topic guidance, including implicitly using the topics as input features to generate words related to the topic and explicitly modifying the weights in the decoder with topics to function as an ensemble of topic-aware language decoders. Our comprehensive experimental results on the current largest video caption dataset MSR-VTT prove the effectiveness of our topic-guided model, which significantly surpasses the winning performance in the 2016 MSR video to language challenge.

References

[1]

Rémi Lebret, Pedro H. O. Pinheiro, and Ronan Collobert. Phrase-based image captioning. In ICML, pages 2085--2094, 2015.

Digital Library

[2]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156--3164, 2015.

[3]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044, 2(3):5, 2015.

[4]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. arXiv:1603.03925, 2016.

[5]

Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. Describing videos using multi-modal fusion. In ACM, pages 1087--1091, 2016.

Digital Library

[6]

Lei Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? NIPS, pages 2654--2662, 2013.

Digital Library

[7]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.

[8]

Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In ICCV, pages 433--440, 2013.

Digital Library

[9]

Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. Computer Science, 2014.

[10]

Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, pages 4594--4602, 2016.

[11]

Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 203--212, 2016.

[12]

Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.

[13]

Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, pages 2634--2641, 2013.

Digital Library

[14]

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos. TACL, 1:25--36, 2013.

[15]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.

[16]

Msr video to language challenge. http://www.acmmm.org/2016/wp-content/uploads/2016/04/ACMMM16_GC_MSR_Video_to_Language_Updated.pdf.

[17]

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. Describing videos by exploiting temporal structure. In ICCV, pages 4507--4515, 2015.

Digital Library

[18]

Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In ICCV, pages 4534--4542, 2015.

Digital Library

[19]

Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv:1511.03476, 2015.

[20]

Qin Jin and Junwei Liang. Video description generation using audio and visual cues. In ICMR, pages 239--242. ACM, 2016.

Digital Library

[21]

Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. Multimodal video description. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1092--1096. ACM, 2016.

Digital Library

[22]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.

Digital Library

[23]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2016.

[24]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770--778, 2016.

[25]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. Places: An image database for deep scene understanding. arXiv:1610.02055, 2016.

[26]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489--4497. IEEE, 2015.

Digital Library

[27]

Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 28(4):357--366, 1980.

[28]

Stephanie Pancoast and Murat Akbacak. Softening quantization in bag-of-audio-words. In ICASSP, pages 1370--1374. IEEE, 2014.

[29]

Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International journal of computer vision, 105(3):222--245, 2013.

Digital Library

[30]

Ibm watson speech to text api. http://www.ibm.com/watson/developercloud/speech-to-text.html.

[31]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. Computer Science, 14(7):38--39, 2015.

[32]

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.

Digital Library

[33]

Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2407--2415, 2015.

Digital Library

[34]

R Memisevic and G Hinton. Unsupervised learning of image transformations. In CVPR, pages 1--8, 2007.

[35]

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[36]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311--318. Association for Computational Linguistics, 2002.

Digital Library

[37]

Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In In Proceedings of the Ninth Workshop on Statistical Machine Translation. Citeseer, 2014.

[38]

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.

[39]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566--4575, 2015.

Cited By

Vaishnavi JNarmatha V(2024)Video captioning – a surveyMultimedia Tools and Applications10.1007/s11042-024-18886-6Online publication date: 9-Apr-2024
https://doi.org/10.1007/s11042-024-18886-6
Liu HWu JYuan J(2021)Chinese description of videos incorporating multimodal features and attention mechanismProceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence10.1145/3461353.3461361(49-54)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3461353.3461361
Pal RSekh ADogra DKar SRoy PPrasad D(2021)Topic-based Video AnalysisACM Computing Surveys10.1145/345908954:6(1-34)Online publication date: 13-Jul-2021
https://dl.acm.org/doi/10.1145/3459089
Show More Cited By

Index Terms

Generating Video Descriptions with Topic Guidance
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Video summarization
    2. Natural language processing
      1. Natural language generation

Recommendations

Video Captioning with Guidance of Multimodal Latent Topics
MM '17: Proceedings of the 25th ACM international conference on Multimedia

The topic diversity of open-domain videos leads to various vocabularies and linguistic expressions in describing video contents, and therefore, makes the video captioning task even more challenging. In this paper, we propose an unified caption framework,...
Contrastive topic-enhanced network for video captioning
Abstract
In the field of video captioning, recent works usually focus on multi-modal video content understanding, in which transcripts are extracted from speech and are often adopted as an informational supplement. However, most existing works only ...
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

June 2017

524 pages

ISBN:9781450347013

DOI:10.1145/3078971

General Chairs:
Bogdan Ionescu
University Politehnica of Bucharest, Romania
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Jiashi Feng
National University of Singapore, Singapore
,
Martha Larson
Radboud University & Delft University of Technology, The Netherlands
,
Rainer Lienhart
University of Augsburg, Germany
,
Cees Snoek
University of Amsterdam & Qualcomm Research Netherlands, The Netherlands

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Plan

Conference

ICMR '17

Sponsor:

SIGMM

ICMR '17: International Conference on Multimedia Retrieval

June 6 - 9, 2017

Bucharest, Romania

Acceptance Rates

ICMR '17 Paper Acceptance Rate 33 of 95 submissions, 35%;

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
292
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)9

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vaishnavi JNarmatha V(2024)Video captioning – a surveyMultimedia Tools and Applications10.1007/s11042-024-18886-6Online publication date: 9-Apr-2024
https://doi.org/10.1007/s11042-024-18886-6
Liu HWu JYuan J(2021)Chinese description of videos incorporating multimodal features and attention mechanismProceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence10.1145/3461353.3461361(49-54)Online publication date: 5-Mar-2021
https://dl.acm.org/doi/10.1145/3461353.3461361
Pal RSekh ADogra DKar SRoy PPrasad D(2021)Topic-based Video AnalysisACM Computing Surveys10.1145/345908954:6(1-34)Online publication date: 13-Jul-2021
https://dl.acm.org/doi/10.1145/3459089
Liu SRen ZYuan J(2021)SibNet: Sibling Convolutional Encoder for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.294000743:9(3259-3272)Online publication date: 1-Sep-2021
https://doi.org/10.1109/TPAMI.2019.2940007
Wang TZheng HYu MTian QHu H(2021)Event-Centric Hierarchical Representation for Dense Video CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.301460631:5(1890-1900)Online publication date: 1-May-2021
https://dl.acm.org/doi/10.1109/TCSVT.2020.3014606
Sharghi Ada Vitoria Lobo NShah M(2021)Text Synopsis Generation for Egocentric Videos2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412111(4252-4259)Online publication date: 10-Jan-2021
https://doi.org/10.1109/ICPR48806.2021.9412111
Hemalatha MSekhar C(2020)Domain-Specific Semantics Guided Approach to Video Captioning2020 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV45572.2020.9093344(1576-1585)Online publication date: Mar-2020
https://doi.org/10.1109/WACV45572.2020.9093344
Chen SJin QChen JHauptmann A(2019)Generating Video Descriptions With Latent Topic GuidanceIEEE Transactions on Multimedia10.1109/TMM.2019.289651521:9(2407-2418)Online publication date: Sep-2019
https://doi.org/10.1109/TMM.2019.2896515
Joshi PSaharia CSingh VGautam DRamakrishnan GJyothi P(2019)A Tale of Two Modalities for Video Captioning2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00459(3708-3712)Online publication date: Oct-2019
https://doi.org/10.1109/ICCVW.2019.00459
Vyas NRallabandi SMorishetti LHovy EBlack A(2019)Learning Disentangled Representation in Latent Stochastic Models: A Case Study with Image CaptioningICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2019.8683370(4010-4014)Online publication date: May-2019
https://doi.org/10.1109/ICASSP.2019.8683370
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents