More Web Proxy on the site http://driver.im/

Article

Less Is More: Picking Informative Frames for Video Captioning

Authors:

Qingming HuangAuthors Info & Claims

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII

Pages 367 - 384

https://doi.org/10.1007/978-3-030-01261-8_22

Published: 08 September 2018 Publication History

Abstract

In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard encoder-decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing discrepancy between generated caption and the ground-truth. The rewarded candidate will be selected and the corresponding latent representation of encoder-decoder will be updated for future trials. This procedure goes on until the end of the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation. Experiment results show that our model can achieve competitive performance across popular benchmarks while only 6–8 frames are used.

References

[1]

Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL, pp. 65–72 (2005)

[2]

Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 3185–3194 (2017)

[3]

Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015)

[4]

Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)

[5]

Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1734 (2014)

[6]

Cromwell HC, Mears RP, Wan L, and Boutros NN Sensory gating: a translational effort from basic to clinical science Clinical EEG Neurosci. 2008 39 2 69-72

[7]

Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.M.: Early embedding and late reranking for video captioning. In: ACM Multimedia, pp. 1082–1086 (2016)

[8]

Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)

[9]

Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: ECCV, pp. 15–29 (2010)

[10]

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

[11]

Hochreiter S and Schmidhuber JJJ Long short-term memory Neural Comput. 1997 9 8 1735-1780

[12]

Hori, C., Hori, T., Lee, T.Y., Sumi, K., Hershey, J.R., Marks, T.K.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4203–4212 (2017)

[13]

Itti L, Koch C, and Niebur E A model of saliency-based visual attention for rapid scene analysis IEEE Trans. Pattern Anal. Mach. Intell. 1998 20 11 1254-1259

[14]

Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)

[15]

Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)

[16]

Kojima A, Tamura T, and Fukunaga K Natural language description of human activities from video images based on concept hierarchy of actions IJCV 2002 50 2 171-184

[17]

Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR, pp. 3337–3345 (2017)

[18]

Kulkarni, G., et al.: Baby talk: Understanding and generating image descriptions. In: CVPR, pp. 1601–1608 (2011)

[19]

Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL (2004)

[20]

Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical co-attention for visual question answering. In: NIPS, pp. 289–297 (2016)

[21]

Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS, pp. 2204–2212 (2014)

[22]

Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)

[23]

Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016)

[24]

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)

[25]

Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016)

[26]

Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 1179–1195 (2017)

[27]

Shen, Z., et al.: Weakly supervised dense video captioning. In: CVPR, pp. 5159–5167 (2017)

[28]

Shetty, R., Laaksonen, J.: Frame-and segment-level features and candidate pool evaluation for video caption generation. In: ACM Multimedia, pp. 1073–1076 (2016)

[29]

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)

[30]

Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. arXiv (2017)

[31]

Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)

[32]

Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)

[33]

Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)

[34]

Venugopalan, S., Rohrbach, M., Darrell, T., Donahue, J., Saenko, K., Mooney, R.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015)

[35]

Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: CVPR, pp. 7622–7631 (2018)

[36]

Wang, J., Jiang, W., Ma, L., Liu, W., Xu, Y.: Bidirectional attentive fusion with context gating for dense video captioning. In: CVPR, pp. 7190–7198 (2018)

[37]

Williams RJ Simple statistical gradient-following algorithms for connectionist reinforcement learning Mach. Learn. 1992 8 3–4 229-256

[38]

Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)

[39]

Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)

[40]

Yang, Y., Teo, C.L., Daumé III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: EMNLP, pp. 444–454 (2011)

[41]

Yao, L., Cho, K., Ballas, N., Paí, C., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)

[42]

Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR, pp. 2678–2687 (2016)

[43]

You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)

[44]

Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)

[45]

Yu, Y., et al.: Supervising neural attention models for video captioning by human gaze data. In: CVPR, pp. 6119–6127 (2017)

[46]

Zeng, K., Chen, T., Niebles, J.C., Sun, M.: Title generation for user generated videos. In: ECCV, pp. 609–625 (2016)

[47]

Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: CVPR, pp. 2513–2520 (2014)

[48]

Zheng, H., Fu, J., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR, pp. 4476–4484 (2017)

Cited By

Gao LZhang HLiu YSheng NFeng HXu H(2024)PGCLExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124011251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124011
Ramakrishnan SAl-Halah ZGrauman KKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)SpotEMProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619596(28618-28636)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619596
Hu ZWang ZSong ZHong RElkind E(2023)Dual video summarizationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/94(846-854)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/94
Show More Cited By

Index Terms

Less Is More: Picking Informative Frames for Video Captioning
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
  2. Machine learning
    1. Learning paradigms
    2. Machine learning approaches
      1. Neural networks

Index terms have been assigned to the content through auto-classification.

Recommendations

FRVidSwin:A Novel Video Captioning Model with Automatical Removal of Redundant Frames
Advanced Intelligent Computing Technology and Applications
Abstract
Video captioning aims to generate natural language sentences that describe the visual content of given videos, which requires long-range temporal modeling and consumes significant computational resources. Existing methods typically operate on ...
Video Captioning using Hierarchical Multi-Attention Model
ICAIP '18: Proceedings of the 2nd International Conference on Advances in Image Processing

Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such ...
Hierarchical & multimodal video captioning

In this paper, we proposed to discover and integrate the rich and primeval external knowledge (i.e., frame-based image caption) to benefit the video caption task.We propose a Hierarchical & Multimodal Video Caption (HMVC) model to jointly learn the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII

Sep 2018

843 pages

ISBN:978-3-030-01260-1

DOI:10.1007/978-3-030-01261-8

Editors:
Vittorio Ferrari
Google Research, Zurich, Switzerland
,
Martial Hebert
Carnegie Mellon University, Pittsburgh, PA, USA
,
Cristian Sminchisescu
Google Research, Zurich, Switzerland
,
Yair Weiss
Hebrew University of Jerusalem, Jerusalem, Israel

© Springer Nature Switzerland AG 2018.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 September 2018

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao LZhang HLiu YSheng NFeng HXu H(2024)PGCLExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124011251:COnline publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1016/j.eswa.2024.124011
Ramakrishnan SAl-Halah ZGrauman KKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)SpotEMProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619596(28618-28636)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619596
Hu ZWang ZSong ZHong RElkind E(2023)Dual video summarizationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/94(846-854)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/94
Yan LHan CXu ZLiu DWang QElkind E(2023)Prompt learns promptProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/180(1622-1630)Online publication date: 19-Aug-2023
https://dl.acm.org/doi/10.24963/ijcai.2023/180
Zeng ZLi YZheng YLi SWang S(2023)FTAN: Exploring Frame-Text Attention for lightweight Video CaptioningProceedings of the 2023 12th International Conference on Computing and Pattern Recognition10.1145/3633637.3633643(40-44)Online publication date: 27-Oct-2023
https://dl.acm.org/doi/10.1145/3633637.3633643
Selbes BSert MKankanhalli MPatras ILiu JWong YKomamizu T(2023)Multimodal Video Captioning using Object-Auditory Information Fusion with TransformersProceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos10.1145/3607540.3617141(51-56)Online publication date: 29-Oct-2023
https://dl.acm.org/doi/10.1145/3607540.3617141
Chen XLi MGao SCheng XYang QZhang QGao XZhang XChen HDuh WHuang HKato MMothe JPoblete B(2023)A Topic-aware Summarization Framework with Different Modal Side InformationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591630(1416-1425)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591630
Rafiq GRafiq MChoi G(2023)Video description: A comprehensive survey of deep learning approachesArtificial Intelligence Review10.1007/s10462-023-10414-656:11(13293-13372)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s10462-023-10414-6
Dong ZChen YCao YZhao Y(2023)FRVidSwin:A Novel Video Captioning Model with Automatical Removal of Redundant FramesAdvanced Intelligent Computing Technology and Applications10.1007/978-981-99-4761-4_5(51-62)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1007/978-981-99-4761-4_5
Yang YLu X(2022)Differentiate Visual Features with Guidance Signals for Video CaptioningProceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System10.1145/3562007.3562052(235-240)Online publication date: 26-Aug-2022
https://dl.acm.org/doi/10.1145/3562007.3562052
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Table of Contents