计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 268-273.doi: 10.11896/j.issn.1002-137X.2019.04.042
邓珍荣1,2, 张宝军1, 蒋周琴1, 黄文明1,2
DENG Zhen-rong1,2, ZHANG Bao-jun1, JIANG Zhou-qin1, HUANG Wen-ming1,2
摘要: 针对当前图像描述任务中,生成描述图像的语句整体质量不高的问题,提出一种融合word2vec和注意力机制的图像描述模型。在编码阶段,应用word2vec模型描述文本向量化操作,以增强词与词的相关性;应用VGGNet19网络提取图像特征,并在图像特征中融合注意力机制,使得模型在每一个时间节点上生成单词时能够突出相对应的图像特征。在解码阶段,应用GRU网络作为图像描述任务的语言生成模型,用以提高模型的训练效率和生成句子的质量。在Flickr8k和Flickr30k两个公共数据集上的实验结果表明,在同一训练环境下,GRU模型的训练时长比LSTM模型节省了1/3的时间,在BLEU和METEOR评价标准上,所提模型的性能得到了显著提升。
中图分类号:
[1] OLIVA A,TORRALBA A.The role of context in object recognition.Trends in Cognitive Sciences,2007,11(12):520-527. [2] MAO J,XU W,YANG Y,et al.Deep captioning with multimodal recurrent neural networks (m-rnn).Preprint arXiv:1412.6632v5. [3] KARPATHY A,LI F F.Deep visual-semantic alignments for generating image descriptions∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3128-3137. [4] VINYALS O.TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3156-3164. [5] DONAHUE J,HENDRICKS L A,GUADARRAMA S,et al.Long-term recurrent convolutional networks for visual recognition and description∥IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2015:2625-2634. [6] FARHADI A,HEJRATI M,SADEGHI M A,et al.Every picture tells a story:generating sentences from images∥Proceedings of the 11th European Conference on Computer Vision.Heraklion,Crete,reece:Springer,2010:15-29. [7] MITCHELL M,HAN X F,DODGE J,et al.Midge:generating image descriptions from computer vision detections∥Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics.Avignon,France:ACL,2012:747-756. [8] KULKARNI G,PREMRAJ V,ORDONEZ V,et al.BabyTalk:understanding and generating simple image descriptions.IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35 (12):2891-2903. [9] KUZNETSOVA P,ORDONEZ V,BERG A C,et al.Generali- zing image captions for image-text parallel corpus∥Procee-dings of the 51st Annual Meeting of the Association for Computational Linguistics.Sofia,Bulgaria:ACL,2013:790-796. [10] MASON R,CHARNIAK E.Nonparametric method for data driven image captioning∥Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.Baltimore,Maryland,USA:ACL,2014:592-598. [11] SOCHER R,KARPATHY A,LE Q V,et al.Grounded compositional semantics for finding and describing images with sentences.Transactions of the Association for Computational Linguistics,2014,2:207-218. [12] OVINYALS A.TOSHEV S.BENGIO D.Erhan,Show and tell:a neural image caption generator∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Boston,Massachusetts,2015:3156-3164. [13] SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition∥International Conference on Learning Representations (ICLR).2014. [14] JIA X,GAVVES E,FERNANDO B,et al.Guiding the Long- Short Term Memory model for Image Caption Generation∥IEEE International Conference on Computer Vision(ICCV).2015:2407-2415. [15] XU K,BA J,KIROS R,et al.Show,attend and tell:Neural ima- ge caption generation with visual attention∥International Conference on Machine Learning(ICML).2015. [16] MIKOLOV T,KOPECK J,BURGET L,et al.Neural network based language models for highly inflective languages∥IEEE International Conference on Acoustics.IEEE Computer Society,2009:126-129. [17] HINTON G E,MCCLELLAND J L,RUMELHART D E.Distributed Representations∥Parallel Distributed Processing:Explorations in the Microstructure of Cognition.Cambridge:MIT Press,1986. [18] CHO K,MERRIENBOER B V,GULCEHRE C,et al.Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation∥Proceedings of the 2014 Confe-rence on Empirical Methods in Natural Language Processing.Doha:Association for Computational Linguistics,2014:1724-1734. [19] LIN T Y,MAIRE M,BELONGIE S et al.Microsoft coco:common objects in context∥Proceedings of the 13th European Conference on Computer Vision.Zurich,Switzerland:Springer,2014:740-755. [20] YOUNG P,LAI A,HODOSH M,et al.From image descriptions to visual denotations:new similarity metrics for semantic infe-rence over event descriptions.Transactions of the Association for Computational Linguistics,2014,2:67-78. [21] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a method for automatic evaluation of machine translation∥Procee-dings of the 40th Annual Meeting on Association for Computational Linguistics.Philadelphia,Pennsylvania:Association for Computational Linguistics,2002:311-318. [22] BANERJEE S,LAVIE A.METEO R:an automatic metric for MT evaluation with improved correlation with human judgments∥Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and /or SUMMARIZATION.ANN ARBO:ACL,2005:65-72. |
[1] | 周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085 |
[2] | 戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032 |
[3] | 周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026 |
[4] | 熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112 |
[5] | 饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277 |
[6] | 汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188 |
[7] | 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153 |
[8] | 孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061 |
[9] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[10] | 姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046 |
[11] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 |
[12] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 |
[13] | 徐鸣珂, 张帆. Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法 Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition 计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085 |
[14] | 孟月波, 穆思蓉, 刘光辉, 徐胜军, 韩九强. 基于向量注意力机制GoogLeNet-GMP的行人重识别方法 Person Re-identification Method Based on GoogLeNet-GMP Based on Vector Attention Mechanism 计算机科学, 2022, 49(7): 142-147. https://doi.org/10.11896/jsjkx.210600198 |
[15] | 金方焱, 王秀利. 融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取 Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM 计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190 |
|