[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3461353.3461361acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciaiConference Proceedingsconference-collections
research-article

Chinese description of videos incorporating multimodal features and attention mechanism

Published: 04 September 2021 Publication History

Abstract

Video description is a hot topic in the area of computer vision and natural language processing, which has made remarkable achievements in recent years. But most researches on video description are to generate English description while few on Chinese description. This paper explores the generation process of video Chinese description and proposes a model for video Chinese description, which introduces three complementary modal features and temporal attention mechanism based on the general encoder-decoder framework. The optimized video description model combined with an appropriate Chinese preprocessing method further improves Chinese descriptions' richness and accuracy. These works provide a valuable reference for future research on multilingual video description. We tested the proposed Chinese model on an expanded Chinese corpus of standard English dataset MSVD (Microsoft Research video description corpus) and studied the special processing methods of Chinese description generation. Experimental results show that the highest METEOR value obtained by the Chinese model proposed is still 6.6% higher than that of the best result on MSVD's Chinese corpus, and the model also has an advanced result in English environment.

References

[1]
X. Li, W. Lan, J.Dong, H. Liu, “Adding Chinese captions to images,” In: ACM on International Conference on Multimedia Retrieval, ACM, 2016, pp. 271-275.
[2]
W. Lan, X. Li, J. Dong, “Fluency-Guided Cross-Lingual Image Captioning, ” In: ACM on Multimedia Conference, 2017, pp. 1549-1557.
[3]
C. Zhang, Y. Dai, Y. Cheng, Z. Y. Jia, K. Hirota, “Recurrent Attention LSTM Model for Image Chinese Caption Generation,” In: Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), 2018, pp. 808-813.
[4]
X. Du, J. Yuan, Y. Dai, “Chinese Description Generation of Open-domain Videos,” The Fifth International Academic Conference for Graduates, NUAA, 2018.
[5]
D. L. Chen, W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” In Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011,pp. 190-200.
[6]
N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, K. Saenko, S. Guadarrama, “Generating natural-language video Descriptions using text-mined knowledge,” In: Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013, pp. 541-547.
[7]
J. Liu, C. Chen, Y. Zhu, W. Liu, D. N. Metaxas, “Video classification via weakly supervised sequence modeling,” In Computer Vision and Image Understanding, 2016, pp. 79-87.
[8]
I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” In: Annual Conference on Neural Information Processing Systems, 2014, pp. 3104-3112.
[9]
V. Patraucean, A. Handa, R. Cipolla, “Spatio-temporal video autoencoder with differentiable memory,” CoRR, arXiv:1511.06309, 2015.
[10]
C. Hori, T. Hori, G. Wichern, : “Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description,” In: Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2528-2531.
[11]
C. Wu, Y. Wei, X. Chu, W. C. Sun, F. Su, L. Q. Wang, “Hierarchical attention-based multimodal fusion for video captioning,” In Neurocomputing, 2018, pp. 362-370.
[12]
S. Chen, J. Chen, Q. Jin, “Generating Video Descriptions with Topic Guidance,” In: International Conference on Multimedia Retrieval, 2017, pp. 5-13.
[13]
S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, K. Saenko, “Sequence to sequence-video to text,” In: IEEE International Conference on Computer Vision, 2015, pp. 4534-4542.
[14]
A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” In: International Conference on Neural Information Processing Systems, vol. 25, pp. 1097-1105, 2012.
[15]
W. Kay, J. Carreira, K. Simonyan, :“The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
[16]
J. Carreira, A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” In Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 4724-4733.
[17]
Q. Jin, J. Liang, “Video Description Generation using Audio and Visual Cues,” In: International Conference on Multimedia Retrieval, ICMR, 2016, pp. 239-242.
[18]
W. Hao, J. Zhang, “Integrating Both Visual and Audio Cues for Enhanced Video Caption,” In: the Thirty-Second Conference on Artificial Intelligence, AAAI, 2018, pp. 6894-6901.
[19]
T. Giannakopoulos, Pyaudioanalysis, “an open-source python library for audio signal analysis,” PLoS ONE 10(12), 2015, e0144610.
[20]
Y. Bin, Y. Yang, F. Shen, : “Bidirectional Long-Short Term Memory for Video Description,” In: the Multimedia Conference, 2016, pp. 436-440.
[21]
J. Ma, K. Ganchev, D. Weiss, “State-of-the-art Chinese Word Segmentation with Bi-LSTMs,” In: Conference on Empirical Methods in Natural Language Processing, EMNLP, 2018, pp. 4902-4908.
[22]
M. Wang, X. Li, Z. Wei, : “Chinese Word Segmentation Based on Deep Learning,” the 10th International Conference on Machine Learning and Computing, 2018, pp. 16-20.
[23]
X. Aafaq, S. Z. Gilani, W. Liu, A. Mian, “Video Description: A Survey of Methods, Datasets and Evaluation Metrics,” CoRR, arXiv:1806.00186, 2018.
[24]
L. Yao, A. Torabi, K. Cho, : Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, IEEE Computer Society, 2015, pp. 4507-4515.
[25]
M. Nabati, A. Behrad: “Video captioning using boosted and parallel Long Short-Term Memory networks,” In: Comput. Vis. Image Underst, 2020, v. 190.
[26]
Wang, J., Wang, W., Huang, Y., : “M3: Multimodal Memory Modelling for Video Captioning,” In: Conference on Computer Vision and Pattern Recognition, CVPR, 2018, pp. 7512-7520
[27]
X. Wang, J. Wu, J. Chen, L. Li, Y. F. Wang, W. Y. Wang, “VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019, pp. 4580-4590
[28]
A. Singh, T. D. Singh, S. Bandyopadhyay, “NITS-VC System for VATEX Video Captioning Challenge 2020,” CoRR, 2020, abs/2006.04058
[29]
M. Rohrbach, T. Darrell, A. Rohrbach, “Adversarial Inference for Multi-Sentence Video Description,” Conference on Computer Vision and Pattern Recognition, CVPR, 2019, pp. 6598-6608.

Index Terms

  1. Chinese description of videos incorporating multimodal features and attention mechanism
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        ICIAI '21: Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence
        March 2021
        246 pages
        ISBN:9781450388634
        DOI:10.1145/3461353
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 04 September 2021

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Attention mechanism
        2. Multimodal features
        3. Natural language processing
        4. Video Chinese description

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        ICIAI 2021

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 45
          Total Downloads
        • Downloads (Last 12 months)3
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 20 Jan 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media