[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-981-97-8792-0_30guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

PosCap: Boosting Video Captioning with Part-of-Speech Guidance

Published: 09 November 2024 Publication History

Abstract

Video captioning aims to automatically generate textual descriptions of video content, enhancing accessibility, comprehension, and searchability of videos. Recent advancements in deep learning, particularly in object recognition and encoder-decoder architectures, have significantly propelled the field forward. However, existing models may generate incomplete or grammatically incorrect sentences, particularly when it comes to non-visual words such as prepositions and conjunctions. To alleviate this problem, we introduce PosCap, a part-of-speech (POS) assisted video captioning model. Leveraging POS information as the prior knowledge, PosCap enhances word prediction by focusing on distinctive multimodal features for different word generation. Specifically, we introduce a POS prediction module to predict the POS of the next word. The predicted POS information guides the attention mechanism to better integrate different modalities information, therefore effectively improving word generation. Experimental results on two benchmark datasets, convincingly demonstrate that PosCap outperforms existing methods in generating coherent and grammatically correct video descriptions.

References

[1]
Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: end-to-end transformers with sparse attention for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17928–17937 (2021)
[2]
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Annual Meeting of the Association for Computational Linguistics, pp. 190–200 (2011)
[3]
Deng J, Li L, Zhang B, Wang S, Zha Z, and Huang Q Syntax-guided hierarchical attention network for video captioning IEEE Trans. Circuits Syst. Video Technol. 2022 32 880-892
[4]
Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Annual Meeting of the Association for Computational Linguistics (2014)
[5]
Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.A.: Fast, diverse and accurate image captioning guided by part-of-speech. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10687–10696 (2018)
[6]
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)
[7]
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1141–1150 (2016)
[8]
He X, Shi B, Bai X, Xia G-S, Zhang Z, and Dong W Image caption generation with part of speech guidance Pattern Recogn. Lett. 2017 119 229-237
[9]
Hou, J., Wu, X., Zhao, W., Luo, J., Jia, Y.: Joint syntax representation learning and visual cue translation for video captioning. In: IEEE International Conference on Computer Vision, pp. 8917–8926 (2019)
[10]
Jiang, W., Cheng, Y., Liu, L., Fang, Y., Peng, Y., Liu, Y.: Comprehensive visual grounding for video description. In: AAAI Conference on Artificial Intelligence (2024)
[11]
Jiang W, Li Q, Zhan K, Fang Y, and Shen F Hybrid attention network for image captioning Displays 2022 73
[12]
Jiang W, Zhan K, Cheng Y, Xia X, and Fang Y The integrated mechanism of hierarchical decoders and dynamic fusion for image captioning J. Image Graph. 2022 27 9 2775-2787
[13]
Jiang W, Zhu M, Fang Y, Shi G, Zhao X, and Liu Y Visual cluster grounding for image captioning IEEE Trans. Image Process. 2022 31 3920-3934
[14]
Jing S, Zhang H, Zeng P, and Gao L Memory-based augmentation network for video captioning IEEE Trans. Multimedia 2024 26 2367-2379
[15]
Ko, D., Choi, J.-Y., Choi, H.K., On, K.-W., Roh, B., Kim, H.J.: MELTR: meta loss transformer for learning to fine-tune video foundation models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 20105–20115 (2023)
[16]
Li G, Ye H, Qi Y, Wang S, Qing L, Huang Q, and Yang M-H Learning hierarchical modular networks for video captioning IEEE Trans. Pattern Anal. Mach. Intell. 2024 46 1049-1064
[17]
Li L, Gao X, Deng J, Tu Y, Zha Z, and Huang Q Long short-term relation transformer with global gating for video captioning IEEE Trans. Image Process. 2022 31 2726-2738
[18]
Li, X., Zhao, B., Lu, X.: Mam-RNN: multi-level attention model based RNN for video captioning. In: International Joint Conference on Artificial Intelligence (2017)
[19]
Liang Y, Zhu L, Wang X, and Yang Y IcoCap: improving video captioning by compounding images IEEE Trans. Multimedia 2024 26 4389-4400
[20]
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. Annu. Meet. Assoc. Comput. Linguist. (2004)
[21]
Lin, K., Gan, Z., Wang, L.: Augmented partial mutual learning with frame masking for video captioning. In: AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2047–2055 (2021)
[22]
Liu, W., Gilani, S.Z., Mian, A.S.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12479–12488 (2019)
[23]
Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3192–3201 (2021)
[24]
Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–992 (2016)
[25]
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics (2002)
[26]
Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y.-W.: Memory-attended recurrent network for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8339–8348 (2019)
[27]
Qian R, Lin W, See J, and Li D Controllable augmentations for video representation learning Vis. Intell. 2022 2 1-15
[28]
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, and Cucchiara R From show to tell: a survey on deep learning-based image captioning IEEE Trans. Pattern Anal. Mach. Intell. 2021 45 539-559
[29]
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2014)
[30]
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: North American Chapter of the Association for Computational Linguistics (2014)
[31]
Wan, B., Jiang, W., Fang, Y.: Informative attention supervision for grounded video description. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1955–1959 (2022)
[32]
Wei, X.-S., Xu, Y., Zhang, C.-L., Xia, G., Peng, Y.X.: CAT: a coarse-to-fine attention tree for semantic change detection. Vis. Intell. 1, 1–12 (2023)
[33]
Wu A, Han Y, Yang Y, Hu Q, and Wu F Convolutional reconstruction-to-sequence for video captioning IEEE Trans. Circuits Syst. Video Technol. 2020 30 4299-4308
[34]
Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conference on Artificial Intelligence (2022)
[35]
Xiao, X., Wang, L., Fan, B., Xiang, S., Pan, C.: Guiding the flowing of semantics: interpretable video captioning via POS tag. In: Annual Meeting of the Association for Computational Linguistics, pp. 2068–2077 (2019)
[36]
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
[37]
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, vol. 37, pp. 2048–2057 (2015)
[38]
Yan, L., Han, C., Xu, Z., Liu, D., Wang, Q.: Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In: International Joint Conference on Artificial Intelligence (2023)
[39]
Yu, H., Siskind, J.M.: Grounded language learning from video described with sentences. In: Annual Meeting of the Association for Computational Linguistics (2013)
[40]
Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2015)
[41]
Zhang, H., Zeng, P., Gao, L., Lyu, X.: SPT: spatial pyramid transformer for image captioning. IEEE Trans. Circuits Syst. Video Technol. 14 (2023)
[42]
Zhang Z, Xu D, Ouyang W, and Tan C Show, tell and summarize: dense video captioning using visual cue aided sentence summarization IEEE Trans. Circuits Syst. Video Technol. 2020 30 3130-3139
[43]
Zheng, Q., Wang, C., Tao, D.: Syntax-aware action targeting for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13093–13102 (2020)
[44]
Zhong, X., Li, Z., Chen, S., Jiang, K., Chen, C., Ye, M.: Refined semantic enhancement towards frequency diffusion for video captioning. In: AAAI Conference on Artificial Intelligence, vol. 37. no. 3, pp. 3724–3732 (2023)
[45]
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748 (2018)

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
Pattern Recognition and Computer Vision: 7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part X
Oct 2024
584 pages
ISBN:978-981-97-8791-3
DOI:10.1007/978-981-97-8792-0

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 November 2024

Author Tags

  1. Video captioning
  2. Attention mechanism
  3. Part-of-speech

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media