Pseudo-3D Attention Transfer Network with Content-aware Strategy for Image Captioning

Published: 08 August 2019


In this article, we propose a novel Pseudo-3D Attention Transfer network with Content-aware Strategy (P3DAT-CAS) for the image captioning task. Our model is composed of three parts: the Pseudo-3D Attention (P3DA) network, the P3DA-based Transfer (P3DAT) network, and the Content-aware Strategy (CAS). First, we propose P3DA to take full advantage of three-dimensional (3D) information in convolutional feature maps and capture more details. Most existing attention-based models only extract the 2D spatial representation from convolutional feature maps to decide which area should be paid more attention to. However, convolutional feature maps are 3D and different channel features can detect diverse semantic attributes associated with images. P3DA is proposed to combine 2D spatial maps with 1D semantic-channel attributes and generate more informative captions. Second, we design the transfer network to maintain and transfer the key previous attention information. The traditional attention-based approaches only utilize the current attention information to predict words directly, whereas transfer network is able to learn long-term attention dependencies and explore global modeling pattern. Finally, we present CAS to provide a more relevant and distinct caption for each image. The captioning model trained by maximum likelihood estimation may generate the captions that have a weak correlation with image contents, resulting in the cross-modal gap between vision and linguistics. However, CAS is helpful to convey the meaningful visual contents accurately. P3DAT-CAS is evaluated on Flickr30k and MSCOCO, and it achieves very competitive performance among the state-of-the-art models.


  (2024)NPoSC-A3: A novel part of speech clues-aware adaptive attention mechanism for image captioningEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107732131(107732)Online publication date: May-2024
  (2023)Image Captioning With Novel Topics Guidance and Retrieval-Based Topics Re-WeightingIEEE Transactions on Multimedia10.1109/TMM.2022.320269025(5984-5999)Online publication date: 2023
  (2023)IAC-ReCAM: Two-dimensional attention modulation and category label guidance for weakly supervised semantic segmentationImage and Vision Computing10.1016/j.imavis.2023.104738136(104738)Online publication date: Aug-2023
Information & Contributors


Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 3
August 2019
331 pages
Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 August 2019
Accepted: 01 April 2019
Revised: 01 March 2019
Received: 01 May 2018
Published in TOMM Volume 15, Issue 3


Author Tags

  1. Image captioning
  2. content-aware strategy
  3. pseudo-3D attention network
  4. pseudo-3D attention transfer network
  5. transfer network


Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Guangdong
  • National Key R8D Program of China
  • Science and Technology Program of Guangzhou
  • Fundamental Research Funds for the Central Universities of China


