[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-030-01261-8_22guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Less Is More: Picking Informative Frames for Video Captioning

Published: 08 September 2018 Publication History

Abstract

In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, which may bring about redundant visual information, sensitivity to content noise and unnecessary computation cost. We propose a plug-and-play PickNet to perform informative frame picking in video captioning. Based on a standard encoder-decoder framework, we develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by maximizing visual diversity and minimizing discrepancy between generated caption and the ground-truth. The rewarded candidate will be selected and the corresponding latent representation of encoder-decoder will be updated for future trials. This procedure goes on until the end of the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation. Experiment results show that our model can achieve competitive performance across popular benchmarks while only 6–8 frames are used.

References

[1]
Banerjee, S., Lavie, A.: Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL, pp. 65–72 (2005)
[2]
Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR, pp. 3185–3194 (2017)
[3]
Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS, pp. 1171–1179 (2015)
[4]
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL, pp. 190–200 (2011)
[5]
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP, pp. 1724–1734 (2014)
[6]
Cromwell HC, Mears RP, Wan L, and Boutros NN Sensory gating: a translational effort from basic to clinical science Clinical EEG Neurosci. 2008 39 2 69-72
[7]
Dong, J., Li, X., Lan, W., Huo, Y., Snoek, C.G.M.: Early embedding and late reranking for video captioning. In: ACM Multimedia, pp. 1082–1086 (2016)
[8]
Fang, H., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)
[9]
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: Generating sentences from images. In: ECCV, pp. 15–29 (2010)
[10]
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
[11]
Hochreiter S and Schmidhuber JJJ Long short-term memory Neural Comput. 1997 9 8 1735-1780
[12]
Hori, C., Hori, T., Lee, T.Y., Sumi, K., Hershey, J.R., Marks, T.K.: Attention-based multimodal fusion for video description. In: ICCV, pp. 4203–4212 (2017)
[13]
Itti L, Koch C, and Niebur E A model of saliency-based visual attention for rapid scene analysis IEEE Trans. Pattern Anal. Mach. Intell. 1998 20 11 1254-1259
[14]
Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: CVPR, pp. 4565–4574 (2016)
[15]
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)
[16]
Kojima A, Tamura T, and Fukunaga K Natural language description of human activities from video images based on concept hierarchy of actions IJCV 2002 50 2 171-184
[17]
Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR, pp. 3337–3345 (2017)
[18]
Kulkarni, G., et al.: Baby talk: Understanding and generating image descriptions. In: CVPR, pp. 1601–1608 (2011)
[19]
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL (2004)
[20]
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical co-attention for visual question answering. In: NIPS, pp. 289–297 (2016)
[21]
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. In: NIPS, pp. 2204–2212 (2014)
[22]
Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR, pp. 1029–1038 (2016)
[23]
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR, pp. 4594–4602 (2016)
[24]
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
[25]
Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: ICLR (2016)
[26]
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR, pp. 1179–1195 (2017)
[27]
Shen, Z., et al.: Weakly supervised dense video captioning. In: CVPR, pp. 5159–5167 (2017)
[28]
Shetty, R., Laaksonen, J.: Frame-and segment-level features and candidate pool evaluation for video caption generation. In: ACM Multimedia, pp. 1073–1076 (2016)
[29]
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
[30]
Song, J., Guo, Y., Gao, L., Li, X., Hanjalic, A., Shen, H.T.: From deterministic to generative: multi-modal stochastic RNNs for video captioning. arXiv (2017)
[31]
Song, Y., Redi, M., Vallmitjana, J., Jaimes, A.: To click or not to click: automatic selection of beautiful thumbnails from videos. In: CIKM, pp. 659–668 (2016)
[32]
Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)
[33]
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
[34]
Venugopalan, S., Rohrbach, M., Darrell, T., Donahue, J., Saenko, K., Mooney, R.: Sequence to sequence - video to text. In: ICCV, pp. 4534–4542 (2015)
[35]
Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: CVPR, pp. 7622–7631 (2018)
[36]
Wang, J., Jiang, W., Ma, L., Liu, W., Xu, Y.: Bidirectional attentive fusion with context gating for dense video captioning. In: CVPR, pp. 7190–7198 (2018)
[37]
Williams RJ Simple statistical gradient-following algorithms for connectionist reinforcement learning Mach. Learn. 1992 8 3–4 229-256
[38]
Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)
[39]
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
[40]
Yang, Y., Teo, C.L., Daumé III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: EMNLP, pp. 444–454 (2011)
[41]
Yao, L., Cho, K., Ballas, N., Paí, C., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)
[42]
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR, pp. 2678–2687 (2016)
[43]
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR, pp. 4651–4659 (2016)
[44]
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR, pp. 4584–4593 (2016)
[45]
Yu, Y., et al.: Supervising neural attention models for video captioning by human gaze data. In: CVPR, pp. 6119–6127 (2017)
[46]
Zeng, K., Chen, T., Niebles, J.C., Sun, M.: Title generation for user generated videos. In: ECCV, pp. 609–625 (2016)
[47]
Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: CVPR, pp. 2513–2520 (2014)
[48]
Zheng, H., Fu, J., Mei, T.: Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: CVPR, pp. 4476–4484 (2017)

Cited By

View all

Index Terms

  1. Less Is More: Picking Informative Frames for Video Captioning
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Guide Proceedings
          Computer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII
          Sep 2018
          843 pages
          ISBN:978-3-030-01260-1
          DOI:10.1007/978-3-030-01261-8

          Publisher

          Springer-Verlag

          Berlin, Heidelberg

          Publication History

          Published: 08 September 2018

          Qualifiers

          • Article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 20 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)PGCLExpert Systems with Applications: An International Journal10.1016/j.eswa.2024.124011251:COnline publication date: 24-Jul-2024
          • (2023)SpotEMProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619596(28618-28636)Online publication date: 23-Jul-2023
          • (2023)Dual video summarizationProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/94(846-854)Online publication date: 19-Aug-2023
          • (2023)Prompt learns promptProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence10.24963/ijcai.2023/180(1622-1630)Online publication date: 19-Aug-2023
          • (2023)FTAN: Exploring Frame-Text Attention for lightweight Video CaptioningProceedings of the 2023 12th International Conference on Computing and Pattern Recognition10.1145/3633637.3633643(40-44)Online publication date: 27-Oct-2023
          • (2023)Multimodal Video Captioning using Object-Auditory Information Fusion with TransformersProceedings of the 2nd Workshop on User-centric Narrative Summarization of Long Videos10.1145/3607540.3617141(51-56)Online publication date: 29-Oct-2023
          • (2023)A Topic-aware Summarization Framework with Different Modal Side InformationProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591630(1416-1425)Online publication date: 19-Jul-2023
          • (2023)Video description: A comprehensive survey of deep learning approachesArtificial Intelligence Review10.1007/s10462-023-10414-656:11(13293-13372)Online publication date: 1-Nov-2023
          • (2023)FRVidSwin:A Novel Video Captioning Model with Automatical Removal of Redundant FramesAdvanced Intelligent Computing Technology and Applications10.1007/978-981-99-4761-4_5(51-62)Online publication date: 10-Aug-2023
          • (2022)Differentiate Visual Features with Guidance Signals for Video CaptioningProceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System10.1145/3562007.3562052(235-240)Online publication date: 26-Aug-2022
          • Show More Cited By

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media