More Web Proxy on the site http://driver.im/

research-article

Chinese description of videos incorporating multimodal features and attention mechanism

Authors:

Jiabin YuanAuthors Info & Claims

ICIAI '21: Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence

Pages 49 - 54

https://doi.org/10.1145/3461353.3461361

Published: 04 September 2021 Publication History

Abstract

Video description is a hot topic in the area of computer vision and natural language processing, which has made remarkable achievements in recent years. But most researches on video description are to generate English description while few on Chinese description. This paper explores the generation process of video Chinese description and proposes a model for video Chinese description, which introduces three complementary modal features and temporal attention mechanism based on the general encoder-decoder framework. The optimized video description model combined with an appropriate Chinese preprocessing method further improves Chinese descriptions' richness and accuracy. These works provide a valuable reference for future research on multilingual video description. We tested the proposed Chinese model on an expanded Chinese corpus of standard English dataset MSVD (Microsoft Research video description corpus) and studied the special processing methods of Chinese description generation. Experimental results show that the highest METEOR value obtained by the Chinese model proposed is still 6.6% higher than that of the best result on MSVD's Chinese corpus, and the model also has an advanced result in English environment.

References

[1]

X. Li, W. Lan, J.Dong, H. Liu, “Adding Chinese captions to images,” In: ACM on International Conference on Multimedia Retrieval, ACM, 2016, pp. 271-275.

Digital Library

[2]

W. Lan, X. Li, J. Dong, “Fluency-Guided Cross-Lingual Image Captioning, ” In: ACM on Multimedia Conference, 2017, pp. 1549-1557.

Digital Library

[3]

C. Zhang, Y. Dai, Y. Cheng, Z. Y. Jia, K. Hirota, “Recurrent Attention LSTM Model for Image Chinese Caption Generation,” In: Joint 10th International Conference on Soft Computing and Intelligent Systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS), 2018, pp. 808-813.

[4]

X. Du, J. Yuan, Y. Dai, “Chinese Description Generation of Open-domain Videos,” The Fifth International Academic Conference for Graduates, NUAA, 2018.

[5]

D. L. Chen, W. B. Dolan, “Collecting highly parallel data for paraphrase evaluation,” In Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011,pp. 190-200.

Digital Library

[6]

N. Krishnamoorthy, G. Malkarnenkar, R. Mooney, K. Saenko, S. Guadarrama, “Generating natural-language video Descriptions using text-mined knowledge,” In: Twenty-Seventh AAAI Conference on Artificial Intelligence, 2013, pp. 541-547.

Digital Library

[7]

J. Liu, C. Chen, Y. Zhu, W. Liu, D. N. Metaxas, “Video classification via weakly supervised sequence modeling,” In Computer Vision and Image Understanding, 2016, pp. 79-87.

Digital Library

[8]

I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” In: Annual Conference on Neural Information Processing Systems, 2014, pp. 3104-3112.

[9]

V. Patraucean, A. Handa, R. Cipolla, “Spatio-temporal video autoencoder with differentiable memory,” CoRR, arXiv:1511.06309, 2015.

[10]

C. Hori, T. Hori, G. Wichern, : “Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description,” In: Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 2528-2531.

[11]

C. Wu, Y. Wei, X. Chu, W. C. Sun, F. Su, L. Q. Wang, “Hierarchical attention-based multimodal fusion for video captioning,” In Neurocomputing, 2018, pp. 362-370.

[12]

S. Chen, J. Chen, Q. Jin, “Generating Video Descriptions with Topic Guidance,” In: International Conference on Multimedia Retrieval, 2017, pp. 5-13.

Digital Library

[13]

S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney, T. Darrell, K. Saenko, “Sequence to sequence-video to text,” In: IEEE International Conference on Computer Vision, 2015, pp. 4534-4542.

Digital Library

[14]

A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” In: International Conference on Neural Information Processing Systems, vol. 25, pp. 1097-1105, 2012.

[15]

W. Kay, J. Carreira, K. Simonyan, :“The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.

[16]

J. Carreira, A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” In Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 4724-4733.

[17]

Q. Jin, J. Liang, “Video Description Generation using Audio and Visual Cues,” In: International Conference on Multimedia Retrieval, ICMR, 2016, pp. 239-242.

Digital Library

[18]

W. Hao, J. Zhang, “Integrating Both Visual and Audio Cues for Enhanced Video Caption,” In: the Thirty-Second Conference on Artificial Intelligence, AAAI, 2018, pp. 6894-6901.

[19]

T. Giannakopoulos, Pyaudioanalysis, “an open-source python library for audio signal analysis,” PLoS ONE 10(12), 2015, e0144610.

[20]

Y. Bin, Y. Yang, F. Shen, : “Bidirectional Long-Short Term Memory for Video Description,” In: the Multimedia Conference, 2016, pp. 436-440.

Digital Library

[21]

J. Ma, K. Ganchev, D. Weiss, “State-of-the-art Chinese Word Segmentation with Bi-LSTMs,” In: Conference on Empirical Methods in Natural Language Processing, EMNLP, 2018, pp. 4902-4908.

[22]

M. Wang, X. Li, Z. Wei, : “Chinese Word Segmentation Based on Deep Learning,” the 10th International Conference on Machine Learning and Computing, 2018, pp. 16-20.

Digital Library

[23]

X. Aafaq, S. Z. Gilani, W. Liu, A. Mian, “Video Description: A Survey of Methods, Datasets and Evaluation Metrics,” CoRR, arXiv:1806.00186, 2018.

[24]

L. Yao, A. Torabi, K. Cho, : Describing videos by exploiting temporal structure. In: IEEE International Conference on Computer Vision, IEEE Computer Society, 2015, pp. 4507-4515.

Digital Library

[25]

M. Nabati, A. Behrad: “Video captioning using boosted and parallel Long Short-Term Memory networks,” In: Comput. Vis. Image Underst, 2020, v. 190.

[26]

Wang, J., Wang, W., Huang, Y., : “M3: Multimodal Memory Modelling for Video Captioning,” In: Conference on Computer Vision and Pattern Recognition, CVPR, 2018, pp. 7512-7520

[27]

X. Wang, J. Wu, J. Chen, L. Li, Y. F. Wang, W. Y. Wang, “VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. ICCV, 2019, pp. 4580-4590

[28]

A. Singh, T. D. Singh, S. Bandyopadhyay, “NITS-VC System for VATEX Video Captioning Challenge 2020,” CoRR, 2020, abs/2006.04058

[29]

M. Rohrbach, T. Darrell, A. Rohrbach, “Adversarial Inference for Multi-Sentence Video Description,” Conference on Computer Vision and Pattern Recognition, CVPR, 2019, pp. 6598-6608.

Index Terms

Chinese description of videos incorporating multimodal features and attention mechanism
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
      2. Computer vision tasks
        Video summarization
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Multi-attention mechanism for Chinese description of videos
CSAI '20: Proceedings of the 2020 4th International Conference on Computer Science and Artificial Intelligence

Using natural language to describe videos is a hot topic in the field of natural language processing and computer vision. However, most of the video description tasks are to generate English descriptions now, rarely to generate Chinese descriptions. ...
A Structure Character Modeling for Chinese Character Glyph Description
ICECT '09: Proceedings of the 2009 International Conference on Electronic Computer Technology

The major problem existing in current Chinese character glyph descriptions is the standard character set can't cover all possible Chinese characters, some particular variants of characters that are unified also can not describe and display. Character ...
Incorporating Sememes into Chinese Definition Modeling
Chinese definition modeling is a challenging task that generates a dictionary definition in Chinese for a given Chinese word. To accomplish this task, we built two novel datasets based on Chinese Concept Dictionary (CCD) and Chinese WordNet (CWN) ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIAI '21: Proceedings of the 2021 5th International Conference on Innovation in Artificial Intelligence

March 2021

246 pages

ISBN:9781450388634

DOI:10.1145/3461353

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICIAI 2021

ICIAI 2021: 2021 the 5th International Conference on Innovation in Artificial Intelligence

March 5 - 8, 2021

Xia men, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
45
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents