[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Pseudo-3D Attention Transfer Network with Content-aware Strategy for Image Captioning

Published: 08 August 2019 Publication History

Abstract

In this article, we propose a novel Pseudo-3D Attention Transfer network with Content-aware Strategy (P3DAT-CAS) for the image captioning task. Our model is composed of three parts: the Pseudo-3D Attention (P3DA) network, the P3DA-based Transfer (P3DAT) network, and the Content-aware Strategy (CAS). First, we propose P3DA to take full advantage of three-dimensional (3D) information in convolutional feature maps and capture more details. Most existing attention-based models only extract the 2D spatial representation from convolutional feature maps to decide which area should be paid more attention to. However, convolutional feature maps are 3D and different channel features can detect diverse semantic attributes associated with images. P3DA is proposed to combine 2D spatial maps with 1D semantic-channel attributes and generate more informative captions. Second, we design the transfer network to maintain and transfer the key previous attention information. The traditional attention-based approaches only utilize the current attention information to predict words directly, whereas transfer network is able to learn long-term attention dependencies and explore global modeling pattern. Finally, we present CAS to provide a more relevant and distinct caption for each image. The captioning model trained by maximum likelihood estimation may generate the captions that have a weak correlation with image contents, resulting in the cross-modal gap between vision and linguistics. However, CAS is helpful to convey the meaningful visual contents accurately. P3DAT-CAS is evaluated on Flickr30k and MSCOCO, and it achieves very competitive performance among the state-of-the-art models.

References

[1]
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision. Springer, 382--398.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 3. 6077--6086.
[3]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Vol. 29. 65--72.
[5]
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems. 1171--1179.
[6]
Hui Chen, Guiguang Ding, Sicheng Zhao, and Jungong Han. 2018. Temporal-difference learning with sampling baseline for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 6706--6713.
[7]
Long Chen, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. 2017. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5659--5667.
[8]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
[9]
Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2422--2431.
[10]
Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C. Lawrence Zitnick. 2015. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 (2015).
[11]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625--2634.
[12]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, et al. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473--1482.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.
[14]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.
[15]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. 2017. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700--4708.
[16]
Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. 2015. Guiding the long-short term memory model for image caption generation. In Proceedings of the IEEE International Conference on Computer Vision. 2407--2415.
[17]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.
[18]
Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
[20]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105.
[21]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Vol. 8. Barcelona, Spain.
[22]
Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2017. Semantic regularisation for recurrent image annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2872--2880.
[23]
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision. 873--881.
[24]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375--383.
[25]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 (2014).
[26]
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan “Honza” Černocký, and Sanjeev Khudanpur. Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association. 1045--1048.
[27]
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). 807--814.
[28]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311--318.
[29]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision. 2641--2649.
[30]
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
[31]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008--7024.
[32]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.
[33]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[34]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1--9.
[35]
Yoshitaka Ushiku, Tatsuya Harada, and Yasuo Kuniyoshi. 2012. Efficient image annotation for automatic sentence generation. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 549--558.
[36]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[37]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[38]
Anqi Wang, Haifeng Hu, and Liang Yang. 2018. Image captioning with affective guiding and selective attention. ACM Trans. Multimedia Comput. Commun. Appl. 14, 3 (2018), 73.
[39]
Weixuan Wang and Haifeng Hu. 2019. Image captioning using region-based attention joint with time-varying attention. Neural Process. Lett. 49, 148 (2019), 1--13.
[40]
Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems. 1782--1792.
[41]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.
[42]
Liang Yang and Haifeng Hu. 2017. TVPRNN for image caption generation. Electronics Letters 53, 22 (2017), 1471--1473.
[43]
Liang Yang and Haifeng Hu. 2019. Adaptive syncretic attention for constrained image captioning. Neural Process. Lett. 49, 148 (2019), 1--16.
[44]
Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, and Ruslan R. Salakhutdinov. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361--2369.
[45]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684--699.
[46]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651--4659.

Cited By

View all
  • (2024)NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131(107732)Online publication date: May-2024
  • (2023)Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-WeightingIEEE Transactions on Multimedia10.1109/TMM.2022.320269025(5984-5999)Online publication date: 2023
  • (2023)IAC-ReCAM: Two-dimensional attention modulation and category label guidance for weakly supervised semantic segmentationImage and Vision Computing10.1016/j.imavis.2023.104738136(104738)Online publication date: Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 3
August 2019
331 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3352586
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2019
Accepted: 01 April 2019
Revised: 01 March 2019
Received: 01 May 2018
Published in TOMM Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image captioning
  2. content-aware strategy
  3. pseudo-3D attention network
  4. pseudo-3D attention transfer network
  5. transfer network

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Guangdong
  • National Key R8D Program of China
  • Science and Technology Program of Guangzhou
  • Fundamental Research Funds for the Central Universities of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131(107732)Online publication date: May-2024
  • (2023)Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-WeightingIEEE Transactions on Multimedia10.1109/TMM.2022.320269025(5984-5999)Online publication date: 2023
  • (2023)IAC-ReCAM: Two-dimensional attention modulation and category label guidance for weakly supervised semantic segmentationImage and Vision Computing10.1016/j.imavis.2023.104738136(104738)Online publication date: Aug-2023
  • (2021)A Multi-instance Multi-label Dual Learning Approach for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/344679217:2s(1-18)Online publication date: 10-Jun-2021
  • (2021)Attention-Aware Pseudo-3-D Convolutional Neural Network for Hyperspectral Image ClassificationIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2020.303821259:9(7790-7802)Online publication date: Sep-2021
  • (2021)Online Multi-Granularity Distillation for GAN Compression2021 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV48922.2021.00672(6773-6783)Online publication date: Oct-2021
  • (2019)Visual Skeleton and Reparative Attention for Part-of-Speech image captioning systemComputer Vision and Image Understanding10.1016/j.cviu.2019.102819(102819)Online publication date: Sep-2019

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media