[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3078971.3079000acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Generating Video Descriptions with Topic Guidance

Published: 06 June 2017 Publication History

Abstract

Generating video descriptions in natural language (a.k.a. video captioning) is a more challenging task than image captioning as the videos are intrinsically more complicated than images in two aspects. First, videos cover a broader range of topics, such as news, music, sports and so on. Second, multiple topics could coexist in the same video. In this paper, we propose a novel caption model, topic-guided model (TGM), to generate topic-oriented descriptions for videos in the wild via exploiting topic information. In addition to predefined topics, i.e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model. We show that data-driven topics reflect a better topic schema than the predefined topics. As for testing video topic prediction, we treat the topic mining model as teacher to train the student, the topic prediction model, by utilizing the full multi-modalities in the video especially the speech modality. We propose a series of caption models to exploit topic guidance, including implicitly using the topics as input features to generate words related to the topic and explicitly modifying the weights in the decoder with topics to function as an ensemble of topic-aware language decoders. Our comprehensive experimental results on the current largest video caption dataset MSR-VTT prove the effectiveness of our topic-guided model, which significantly surpasses the winning performance in the 2016 MSR video to language challenge.

References

[1]
Rémi Lebret, Pedro H. O. Pinheiro, and Ronan Collobert. Phrase-based image captioning. In ICML, pages 2085--2094, 2015.
[2]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156--3164, 2015.
[3]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv:1502.03044, 2(3):5, 2015.
[4]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with semantic attention. arXiv:1603.03925, 2016.
[5]
Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. Describing videos using multi-modal fusion. In ACM, pages 1087--1091, 2016.
[6]
Lei Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? NIPS, pages 2654--2662, 2013.
[7]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.
[8]
Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In ICCV, pages 433--440, 2013.
[9]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. Computer Science, 2014.
[10]
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, pages 4594--4602, 2016.
[11]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton van den Hengel. What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 203--212, 2016.
[12]
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In CVPR, 2017.
[13]
Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In CVPR, pages 2634--2641, 2013.
[14]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos. TACL, 1:25--36, 2013.
[15]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
[16]
Msr video to language challenge. http://www.acmmm.org/2016/wp-content/uploads/2016/04/ACMMM16_GC_MSR_Video_to_Language_Updated.pdf.
[17]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher J. Pal, Hugo Larochelle, and Aaron C. Courville. Describing videos by exploiting temporal structure. In ICCV, pages 4507--4515, 2015.
[18]
Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In ICCV, pages 4534--4542, 2015.
[19]
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. arXiv:1511.03476, 2015.
[20]
Qin Jin and Junwei Liang. Video description generation using audio and visual cues. In ICMR, pages 239--242. ACM, 2016.
[21]
Vasili Ramanishka, Abir Das, Dong Huk Park, Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, and Kate Saenko. Multimodal video description. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1092--1096. ACM, 2016.
[22]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003.
[23]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. 2016.
[24]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770--778, 2016.
[25]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. Places: An image database for deep scene understanding. arXiv:1610.02055, 2016.
[26]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489--4497. IEEE, 2015.
[27]
Steven Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 28(4):357--366, 1980.
[28]
Stephanie Pancoast and Murat Akbacak. Softening quantization in bag-of-audio-words. In ICASSP, pages 1370--1374. IEEE, 2014.
[29]
Jorge Sánchez, Florent Perronnin, Thomas Mensink, and Jakob Verbeek. Image classification with the fisher vector: Theory and practice. International journal of computer vision, 105(3):222--245, 2013.
[30]
Ibm watson speech to text api. http://www.ibm.com/watson/developercloud/speech-to-text.html.
[31]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. Computer Science, 14(7):38--39, 2015.
[32]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.
[33]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2407--2415, 2015.
[34]
R Memisevic and G Hinton. Unsupervised learning of image transformations. In CVPR, pages 1--8, 2007.
[35]
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[36]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311--318. Association for Computational Linguistics, 2002.
[37]
Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In In Proceedings of the Ninth Workshop on Statistical Machine Translation. Citeseer, 2014.
[38]
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.
[39]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566--4575, 2015.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval
June 2017
524 pages
ISBN:9781450347013
DOI:10.1145/3078971
  • General Chairs:
  • Bogdan Ionescu,
  • Nicu Sebe,
  • Program Chairs:
  • Jiashi Feng,
  • Martha Larson,
  • Rainer Lienhart,
  • Cees Snoek
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-driven topics
  2. multi-modalities
  3. teacher-student learning
  4. video captioning

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Plan

Conference

ICMR '17
Sponsor:

Acceptance Rates

ICMR '17 Paper Acceptance Rate 33 of 95 submissions, 35%;
Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)9
Reflects downloads up to 04 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Video captioning – a surveyMultimedia Tools and Applications10.1007/s11042-024-18886-6Online publication date: 9-Apr-2024
  • (2021)Chinese description of videos incorporating multimodal features and attention mechanismProceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence10.1145/3461353.3461361(49-54)Online publication date: 5-Mar-2021
  • (2021)Topic-based Video AnalysisACM Computing Surveys10.1145/345908954:6(1-34)Online publication date: 13-Jul-2021
  • (2021)SibNet: Sibling Convolutional Encoder for Video CaptioningIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2019.294000743:9(3259-3272)Online publication date: 1-Sep-2021
  • (2021)Event-Centric Hierarchical Representation for Dense Video CaptioningIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2020.301460631:5(1890-1900)Online publication date: 1-May-2021
  • (2021)Text Synopsis Generation for Egocentric Videos2020 25th International Conference on Pattern Recognition (ICPR)10.1109/ICPR48806.2021.9412111(4252-4259)Online publication date: 10-Jan-2021
  • (2020)Domain-Specific Semantics Guided Approach to Video Captioning2020 IEEE Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV45572.2020.9093344(1576-1585)Online publication date: Mar-2020
  • (2019)Generating Video Descriptions With Latent Topic GuidanceIEEE Transactions on Multimedia10.1109/TMM.2019.289651521:9(2407-2418)Online publication date: Sep-2019
  • (2019)A Tale of Two Modalities for Video Captioning2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)10.1109/ICCVW.2019.00459(3708-3712)Online publication date: Oct-2019
  • (2019)Learning Disentangled Representation in Latent Stochastic Models: A Case Study with Image CaptioningICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2019.8683370(4010-4014)Online publication date: May-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media