融合word2vec和注意力机制的图像描述模型

计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 268-273.doi: 10.11896/j.issn.1002-137X.2019.04.042

融合word2vec和注意力机制的图像描述模型

邓珍荣^1,2, 张宝军¹, 蒋周琴¹, 黄文明^1,2

桂林电子科技大学计算机与信息安全学院广西桂林541004¹
广西高校云计算与复杂系统重点实验室广西桂林541004²

收稿日期:2018-06-03 出版日期:2019-04-15 发布日期:2019-04-23
通讯作者: 邓珍荣(1977-),女,硕士,研究员,硕士生导师,主要研究方向为计算机软件架构及计算机视觉,E-mail:799349175@qq.com(通信作者)
作者简介:张宝军(1992-),男,硕士生,主要研究方向为计算机视觉、深度学习;蒋周琴(1994-),女,硕士生,主要研究方向为计算机视觉、机器学习;黄文明(1963-),男,教授,硕士生导师,主要研究方向为大数据处理、图形图像处理。
基金资助:
本文受广西高校云计算与复杂系统重点实验室项目(yf17106),广西自然科学基金(2018GXNSFAA138132),桂林电子科技大学研究生创新项目(2018YJCX55)资助。

Image Description Model Fusing Word2vec and Attention Mechanism

DENG Zhen-rong^1,2, ZHANG Bao-jun¹, JIANG Zhou-qin¹, HUANG Wen-ming^1,2

School of Computer and Information Security,Guilin University of Electronic Technology,Guilin,Guangxi 541004,China¹
Guangxi Colleges and Universities Keys Laboratory of cloud Computing and Complex Systems,Guilin,Guangxi 541004,China²

Received:2018-06-03 Online:2019-04-15 Published:2019-04-23

摘要/Abstract

摘要： 针对当前图像描述任务中,生成描述图像的语句整体质量不高的问题,提出一种融合word2vec和注意力机制的图像描述模型。在编码阶段,应用word2vec模型描述文本向量化操作,以增强词与词的相关性;应用VGGNet19网络提取图像特征,并在图像特征中融合注意力机制,使得模型在每一个时间节点上生成单词时能够突出相对应的图像特征。在解码阶段,应用GRU网络作为图像描述任务的语言生成模型,用以提高模型的训练效率和生成句子的质量。在Flickr8k和Flickr30k两个公共数据集上的实验结果表明,在同一训练环境下,GRU模型的训练时长比LSTM模型节省了1／3的时间,在BLEU和METEOR评价标准上,所提模型的性能得到了显著提升。

关键词: GRU模型, word2vec, 图像描述, 注意力机制

Abstract: For the overall quality of the sentence describing the generated image is not high in the current image description task,and an image description model fusing word2vec and attention mechanism was proposed.In the encoding stage,the word2vec model is used to describe the text vectorization operations to enhance the relationship among words.The VGGNet19 network is utilized to extract image features,and the attention mechanism is integrated in the image features,so that the corresponding image features can be highlighted when the words are generated at each time node.In the decoding stage,the GRU network is used as a language generation model for image description tasks to improve the efficiency of model training and the quality of generated sentences.Experimental results onFlickr8k and Flickr30k data sets show that under the same training environment,the GRU model saves 1/3 training time compared to the LSTM model.In the BLEU and METEOR evaluation standards,the performance of the proposed model in this paper is significantly improved.

Key words: Attention mechanism, GRU model, Image description, word2vec

中图分类号:

TP391.41

邓珍荣, 张宝军, 蒋周琴, 黄文明. 融合word2vec和注意力机制的图像描述模型[J]. 计算机科学, 2019, 46(4): 268-273. https://doi.org/10.11896/j.issn.1002-137X.2019.04.042

DENG Zhen-rong, ZHANG Bao-jun, JIANG Zhou-qin, HUANG Wen-ming. Image Description Model Fusing Word2vec and Attention Mechanism[J]. Computer Science, 2019, 46(4): 268-273. https://doi.org/10.11896/j.issn.1002-137X.2019.04.042

参考文献

[1] OLIVA A,TORRALBA A.The role of context in object recognition.Trends in Cognitive Sciences,2007,11(12):520-527.
[2] MAO J,XU W,YANG Y,et al.Deep captioning with multimodal recurrent neural networks (m-rnn).Preprint arXiv:1412.6632v5.
[3] KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3128-3137.
[4] VINYALS O.TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3156-3164.
[5] DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description∥IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:2625-2634.
[6] FARHADI A,HEJRATI M,SADEGHI M A,et al.Every picture tells a story:generating sentences from images∥Proceedings of the 11th European Conference on Computer Vision.Heraklion,Crete,reece:Springer,2010:15-29.
[7] MITCHELL M,HAN X F,DODGE J,et al.Midge:generating image descriptions from computer vision detections∥Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.Avignon,France:ACL,2012:747-756.
[8] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.BabyTalk:understanding and generating simple image descriptions.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35 (12):2891-2903.
[9] KUZNETSOVA P,ORDONEZ V,BERG A C,et al.Generali- zing image captions for image-text parallel corpus∥Procee-dings of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia,Bulgaria:ACL,2013:790-796.
[10] MASON R,CHARNIAK E.Nonparametric method for data driven image captioning∥Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.Baltimore,Maryland,USA:ACL,2014:592-598.
[11] SOCHER R,KARPATHY A,LE Q V,et al.Grounded compositional semantics for finding and describing images with sentences.Transactions of the Association for Computational Linguistics,2014,2:207-218.
[12] OVINYALS A.TOSHEV S.BENGIO D.Erhan,Show and tell:a neural image caption generator∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3156-3164.
[13] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition∥International Conference on Learning Representations (ICLR).2014.
[14] JIA X,GAVVES E,FERNANDO B,et al.Guiding the Long- Short Term Memory model for Image Caption Generation∥IEEE International Conference on Computer Vision(ICCV).2015:2407-2415.
[15] XU K,BA J,KIROS R,et al.Show,attend and tell:Neural ima- ge caption generation with visual attention∥International Conference on Machine Learning(ICML).2015.
[16] MIKOLOV T,KOPECK J,BURGET L,et al.Neural network based language models for highly inflective languages∥IEEE International Conference on Acoustics.IEEE Computer Society,2009:126-129.
[17] HINTON G E,MCCLELLAND J L,RUMELHART D E.Distributed Representations∥Parallel Distributed Processing:Explorations in the Microstructure of Cognition.Cambridge:MIT Press,1986.
[18] CHO K,MERRIENBOER B V,GULCEHRE C,et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation∥Proceedings of the 2014 Confe-rence on Empirical Methods in Natural Language Processing.Doha:Association for Computational Linguistics,2014:1724-1734.
[19] LIN T Y,MAIRE M,BELONGIE S et al.Microsoft coco:common objects in context∥Proceedings of the 13th European Conference on Computer Vision.Zurich,Switzerland:Springer,2014:740-755.
[20] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:new similarity metrics for semantic infe-rence over event descriptions.Transactions of the Association for Computational Linguistics,2014,2:67-78.
[21] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a method for automatic evaluation of machine translation∥Procee-dings of the 40th Annual Meeting on Association for Computational Linguistics.Philadelphia,Pennsylvania:Association for Computational Linguistics,2002:311-318.
[22] BANERJEE S,LAVIE A.METEO R:an automatic metric for MT evaluation with improved correlation with human judgments∥Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and /or SUMMARIZATION.ANN ARBO:ACL,2005:65-72.

相关文章 15

[1]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2]	戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[3]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[6]	汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[7]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[8]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[9]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[10]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[11]	张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[12]	曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[13]	徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[14]	孟月波, 穆思蓉, 刘光辉, 徐胜军, 韩九强. 基于向量注意力机制GoogLeNet-GMP的行人重识别方法 Person Re-identification Method Based on GoogLeNet-GMP Based on Vector Attention Mechanism 计算机科学, 2022, 49(7): 142-147. https://doi.org/10.11896/jsjkx.210600198
[15]	金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed