More Web Proxy on the site http://driver.im/

Article

PosCap: Boosting Video Captioning with Part-of-Speech Guidance

Authors:

Fei ShenAuthors Info & Claims

Pattern Recognition and Computer Vision: 7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part X

Pages 430 - 444

https://doi.org/10.1007/978-981-97-8792-0_30

Published: 09 November 2024 Publication History

Abstract

Video captioning aims to automatically generate textual descriptions of video content, enhancing accessibility, comprehension, and searchability of videos. Recent advancements in deep learning, particularly in object recognition and encoder-decoder architectures, have significantly propelled the field forward. However, existing models may generate incomplete or grammatically incorrect sentences, particularly when it comes to non-visual words such as prepositions and conjunctions. To alleviate this problem, we introduce PosCap, a part-of-speech (POS) assisted video captioning model. Leveraging POS information as the prior knowledge, PosCap enhances word prediction by focusing on distinctive multimodal features for different word generation. Specifically, we introduce a POS prediction module to predict the POS of the next word. The predicted POS information guides the attention mechanism to better integrate different modalities information, therefore effectively improving word generation. Experimental results on two benchmark datasets, convincingly demonstrate that PosCap outperforms existing methods in generating coherent and grammatically correct video descriptions.

References

[1]

Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: end-to-end transformers with sparse attention for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17928–17937 (2021)

[2]

Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Annual Meeting of the Association for Computational Linguistics, pp. 190–200 (2011)

[3]

Deng J, Li L, Zhang B, Wang S, Zha Z, and Huang Q Syntax-guided hierarchical attention network for video captioning IEEE Trans. Circuits Syst. Video Technol. 2022 32 880-892

[4]

Denkowski, M.J., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Annual Meeting of the Association for Computational Linguistics (2014)

[5]

Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.A.: Fast, diverse and accurate image captioning guided by part-of-speech. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10687–10696 (2018)

[6]

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: North American Chapter of the Association for Computational Linguistics (2019)

[7]

Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin, L., Deng, L.: Semantic compositional networks for visual captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1141–1150 (2016)

[8]

He X, Shi B, Bai X, Xia G-S, Zhang Z, and Dong W Image caption generation with part of speech guidance Pattern Recogn. Lett. 2017 119 229-237

Digital Library

[9]

Hou, J., Wu, X., Zhao, W., Luo, J., Jia, Y.: Joint syntax representation learning and visual cue translation for video captioning. In: IEEE International Conference on Computer Vision, pp. 8917–8926 (2019)

[10]

Jiang, W., Cheng, Y., Liu, L., Fang, Y., Peng, Y., Liu, Y.: Comprehensive visual grounding for video description. In: AAAI Conference on Artificial Intelligence (2024)

[11]

Jiang W, Li Q, Zhan K, Fang Y, and Shen F Hybrid attention network for image captioning Displays 2022 73

[12]

Jiang W, Zhan K, Cheng Y, Xia X, and Fang Y The integrated mechanism of hierarchical decoders and dynamic fusion for image captioning J. Image Graph. 2022 27 9 2775-2787

[13]

Jiang W, Zhu M, Fang Y, Shi G, Zhao X, and Liu Y Visual cluster grounding for image captioning IEEE Trans. Image Process. 2022 31 3920-3934

Digital Library

[14]

Jing S, Zhang H, Zeng P, and Gao L Memory-based augmentation network for video captioning IEEE Trans. Multimedia 2024 26 2367-2379

Digital Library

[15]

Ko, D., Choi, J.-Y., Choi, H.K., On, K.-W., Roh, B., Kim, H.J.: MELTR: meta loss transformer for learning to fine-tune video foundation models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 20105–20115 (2023)

[16]

Li G, Ye H, Qi Y, Wang S, Qing L, Huang Q, and Yang M-H Learning hierarchical modular networks for video captioning IEEE Trans. Pattern Anal. Mach. Intell. 2024 46 1049-1064

Digital Library

[17]

Li L, Gao X, Deng J, Tu Y, Zha Z, and Huang Q Long short-term relation transformer with global gating for video captioning IEEE Trans. Image Process. 2022 31 2726-2738

Digital Library

[18]

Li, X., Zhao, B., Lu, X.: Mam-RNN: multi-level attention model based RNN for video captioning. In: International Joint Conference on Artificial Intelligence (2017)

[19]

Liang Y, Zhu L, Wang X, and Yang Y IcoCap: improving video captioning by compounding images IEEE Trans. Multimedia 2024 26 4389-4400

Digital Library

[20]

Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. Annu. Meet. Assoc. Comput. Linguist. (2004)

[21]

Lin, K., Gan, Z., Wang, L.: Augmented partial mutual learning with frame masking for video captioning. In: AAAI Conference on Artificial Intelligence, vol. 35, no. 3, pp. 2047–2055 (2021)

[22]

Liu, W., Gilani, S.Z., Mian, A.S.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 12479–12488 (2019)

[23]

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3192–3201 (2021)

[24]

Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 984–992 (2016)

[25]

Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Annual Meeting of the Association for Computational Linguistics (2002)

[26]

Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y.-W.: Memory-attended recurrent network for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8339–8348 (2019)

[27]

Qian R, Lin W, See J, and Li D Controllable augmentations for video representation learning Vis. Intell. 2022 2 1-15

[28]

Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, and Cucchiara R From show to tell: a survey on deep learning-based image captioning IEEE Trans. Pattern Anal. Mach. Intell. 2021 45 539-559

[29]

Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2014)

[30]

Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R.J., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: North American Chapter of the Association for Computational Linguistics (2014)

[31]

Wan, B., Jiang, W., Fang, Y.: Informative attention supervision for grounded video description. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1955–1959 (2022)

[32]

Wei, X.-S., Xu, Y., Zhang, C.-L., Xia, G., Peng, Y.X.: CAT: a coarse-to-fine attention tree for semantic change detection. Vis. Intell. 1, 1–12 (2023)

[33]

Wu A, Han Y, Yang Y, Hu Q, and Wu F Convolutional reconstruction-to-sequence for video captioning IEEE Trans. Circuits Syst. Video Technol. 2020 30 4299-4308

[34]

Wu, W., Sun, Z., Ouyang, W.: Revisiting classifier: transferring vision-language models for video recognition. In: AAAI Conference on Artificial Intelligence (2022)

[35]

Xiao, X., Wang, L., Fan, B., Xiang, S., Pan, C.: Guiding the flowing of semantics: interpretable video captioning via POS tag. In: Annual Meeting of the Association for Computational Linguistics, pp. 2068–2077 (2019)

[36]

Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)

[37]

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, vol. 37, pp. 2048–2057 (2015)

[38]

Yan, L., Han, C., Xu, Z., Liu, D., Wang, Q.: Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In: International Joint Conference on Artificial Intelligence (2023)

[39]

Yu, H., Siskind, J.M.: Grounded language learning from video described with sentences. In: Annual Meeting of the Association for Computational Linguistics (2013)

[40]

Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4584–4593 (2015)

[41]

Zhang, H., Zeng, P., Gao, L., Lyu, X.: SPT: spatial pyramid transformer for image captioning. IEEE Trans. Circuits Syst. Video Technol. 14 (2023)

[42]

Zhang Z, Xu D, Ouyang W, and Tan C Show, tell and summarize: dense video captioning using visual cue aided sentence summarization IEEE Trans. Circuits Syst. Video Technol. 2020 30 3130-3139

Digital Library

[43]

Zheng, Q., Wang, C., Tao, D.: Syntax-aware action targeting for video captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13093–13102 (2020)

[44]

Zhong, X., Li, Z., Chen, S., Jiang, K., Chen, C., Ye, M.: Refined semantic enhancement towards frequency diffusion for video captioning. In: AAAI Conference on Artificial Intelligence, vol. 37. no. 3, pp. 3724–3732 (2023)

[45]

Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8739–8748 (2018)

Index Terms

PosCap: Boosting Video Captioning with Part-of-Speech Guidance
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval
    2. Natural language processing
      1. Information extraction
      2. Natural language generation
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Index terms have been assigned to the content through auto-classification.

Recommendations

A token centric part-of-speech tagger for biomedical text
AIME'11: Proceedings of the 13th conference on Artificial intelligence in medicine

A difficulty with part-of-speech (POS) tagging of biomedical text is accessing and annotating appropriate training corpora. The latter may result in POS taggers trained on corpora that differ from the tagger's target biomedical text. In such cases where ...
Adapting a WSJ trained part-of-speech tagger to noisy text: preliminary results
MOCR_AND '11: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data

With the increase in the number of people communicating through internet, there has been a steady increase in the amount of text available online. Most such text is different from the standard language, as people try to use various kinds of short forms ...
Improving word vector model with part‐of‐speech and dependency grammar information

Part‐of‐speech (POS) and dependency grammar (DG) are the basic components of natural language processing. However, current word vector models have not made full use of both POS information and DG information, and hence the models’ performances are limited ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

Pattern Recognition and Computer Vision: 7th Chinese Conference, PRCV 2024, Urumqi, China, October 18–20, 2024, Proceedings, Part X

Oct 2024

584 pages

ISBN:978-981-97-8791-3

DOI:10.1007/978-981-97-8792-0

Editors:
Zhouchen Lin
Peking University, Beijing, China
,
Ming-Ming Cheng
Nankai University, Tianjin, China
,
Ran He
Chinese Academy of Sciences, Beijing, China
,
Kurban Ubul
Xinjiang University, Ürümqi, Xinjiang, China
,
Wushouer Silamu
Xinjiang University, Ürümqi, China
,
Hongbin Zha
https://ror.org/02v51f717Peking University, Beijing, China
,
Jie Zhou
Tsinghua University, Beijing, China
,
Cheng-Lin Liu
Chinese Academy of Sciences, Beijing, China

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 09 November 2024

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten