Hierarchical decoding with latent context for image captioning

Jing Zhang ORCID: orcid.org/0000-0001-6270-7771¹,
Yingshuai Xie¹,
Kangkang Li¹,
Zhe Wang¹ &
…
Wen Du²

357 Accesses
Explore all metrics

Abstract

Mining more rich visual features and analyzing the context information from image for decoding part has become a challenging problem in image captioning. Some recent works employ other knowledge bases to obtain the additional objects semantic relationships by constructing scene graph, which spend much time on pre-training scene graph and these artificial defined relationships may not be comprehensive. In this paper, a novel hierarchical decoding with latent context method is proposed for image captioning, which analyzes the visual context information and decodes multi-level visual features by a hierarchical decoding method to achieve more accurate caption words. In our proposed method, a novel Latent Context Generation Network (LCGN) is proposed to infer latent relationships between objects without any external knowledge, and meanwhile, a context vector which contains rich neighbor information for each object is constructed. Then a graph convolutional network with attention is used to further aggregate latent context information for achieving high-level context features by combining objects features and their context vectors. Finally, hierarchical decoding based on Triple Long Short-Term Memory (Tri-LSTM) is proposed to decode global features, local features and object features hierarchically, which gradually analyzes the content of the image from the whole to the local to the object. Experiments on MSCOCO dataset prove that our proposed method can achieve extremely competitive results in image captioning and outperform most CNN-RNN architecture methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Exploring Visual Relationship for Image Captioning

Incorporating attentive multi-scale context information for image captioning

Article 13 January 2022

Multilevel attention and relation network based image captioning model

Article 16 September 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The data that support the findings of this study are openly available in MSCOCO at https://cocodataset.org, reference number [50].

References

Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Zhang J, Peng Y (2020) Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Trans Image Process 29:6209–6222
Article MATH Google Scholar
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3242–3250. https://doi.org/10.1109/CVPR.2017.345
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 6077–6086
Yang L, Tang KD, Yang J, Li L (2017) Dense captioning with joint inference and visual context. In: IEEE conference on computer vision and pattern recognition, pp 1978–1987
Kim D, Choi J, Oh T, Kweon IS (2019) Dense relational captioning: Triple-stream networks for relationship-based captioning. In: IEEE conference on computer vision and pattern recognition, pp 6271–6280
Zhang J, Peng Y (2019) Hierarchical vision-language alignment for video captioning. In: MultiMedia modeling—25th international conference, vol 11295, pp 42–54
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd international conference on learning representations, ICLR. http://arxiv.org/abs/1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th advances in neural information processing systems, NIPS, pp 91–99
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, ICCV, pp 2407–2415. https://doi.org/10.1109/ICCV.2015.277
Guo Y, Liu Y, M. H. T. de Boer, Liu L, Lew MS (2018) A dual prediction network for image captioning. In: Proceedings of the IEEE international conference on multimedia and expo, ICME, pp 1–6. https://doi.org/10.1109/ICME.2018.8486491
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, ICML, pp 2048–2057. http://proceedings.mlr.press/v37/xuc15.html
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of IEEE conference on computer vision and pattern recognition, CVPR, pp 6298–6306. https://doi.org/10.1109/CVPR.2017.667
Chen S, Zhao Q (2018) Boosted attention: leveraging human attention for image captioning. In: Proceedings of the European conference on computer vision, ECCV, pp 72–88. https://doi.org/10.1007/978-3-030-01252-6_5
Wu Q, Shen C, Wang P, Dick AR, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381. https://doi.org/10.1109/TPAMI.2017.2708709
Article Google Scholar
Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075. https://doi.org/10.1016/j.patcog.2019.107075
Article Google Scholar
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision, ECCV, pp 711–727. https://doi.org/10.1007/978-3-030-01264-9_42
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM international conference on multimedia, MM, pp 765–773. https://doi.org/10.1145/3343031.3350943
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. In: Proceedings of the European conference on computer vision, ECCV, pp 690–706. https://doi.org/10.1007/978-3-030-01246-5_41
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3097–3106. https://doi.org/10.1109/CVPR.2017.330
Hamilton WL, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: Proceedings of the 30th advances in neural information processing systems, NIPS, pp 1024–1034
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimed 21(11):2942–2956. https://doi.org/10.1109/TMM.2019.2915033
Article Google Scholar
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: IEEE conference on computer vision and pattern recognition, pp 8327–8336
Wei Y, Wang L, Cao H, Shao M, Wu C (2020) Multi-attention generative adversarial network for image captioning. Neurocomputing 387:91–99. https://doi.org/10.1016/j.neucom.2019.12.073
Article Google Scholar
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: image captioning with visual relationship attention. Image Vis Comput 10:104146
Article Google Scholar
Zhang Z, Wu Q, Wang Y, Chen F (2019) High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Trans Multimed 21(7):1681–1693. https://doi.org/10.1109/TMM.2018.2888822
Article Google Scholar
Wang S, Lan L, Zhang X, Dong G, Luo Z (2019) Cascade semantic fusion for image captioning. IEEE Access 7:66680–66688. https://doi.org/10.1109/ACCESS.2019.2917979
Article Google Scholar
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of IEEE international conference on computer vision, ICCV, pp 4904–4912. https://doi.org/10.1109/ICCV.2017.524
Guan Z, Liu K, Ma Y, Xu Q, J T (2018) Middle-level attribute-based language retouching for image caption generation. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, ICASSP, vol 8, pp 3081–3085
Li X, Yuan A, Lu X (2019) Vision-to-language tasks based on attributes and attention mechanism. IEEE Trans Cybern 51(2):1–14
Article Google Scholar
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects, phrases and region captions. In: Proceedings of IEEE international conference on computer vision, ICCV, pp 1270–1279. https://doi.org/10.1109/ICCV.2017.142
Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 5831–5840. http://openaccess.thecvf.com/content_cvpr_2018/html/Zellers_Neural_Motifs_Scene_CVPR_2018_paper.html
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 10685–10694. http://openaccess.thecvf.com/content_CVPR_2019/html/Yang_Auto-Encoding_Scene_Graphs_for_Image_Captioning_CVPR_2019_paper.html
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7
Article MathSciNet Google Scholar
Wu L, Xu M, Wang J, Perry SW (2020) Recall what you see continually using gridlstm in image captioning. IEEE Trans Multimed 22(3):808–818. https://doi.org/10.1109/TMM.2019.2931815
Article Google Scholar
Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 8367–8375. http://openaccess.thecvf.com/content_CVPR_2019/html/Qin_Look_Back_and_Predict_Forward_in_Image_Captioning_CVPR_2019_paper.html
Wei Y, Wu C, Jia ZY, Hu XF, Shi H (2021) Past is important: improved image captioning by looking back in time. Signal Process Image Commun 94(8):116183
Article Google Scholar
Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Google Scholar
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 1179–1195. https://doi.org/10.1109/CVPR.2017.131
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, ICCV, pp 873–881. https://doi.org/10.1109/ICCV.2017.100
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: semantic propositional image caption evaluation. In: Proceedings of the European conference on computer vision, ECCV, pp 382–398. https://doi.org/10.1007/978-3-319-46454-1_24
Xu N, Zhang H, Liu A-A, Nie W, Su Y, Nie J, Zhan Y (2020) Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans Multimed 22(5):1372–1383
Article Google Scholar
Wu J, Chen T, Wu H, Yang Z, Lin L (2020) Fine-grained image captioning with global-local discriminative objective. IEEE Trans Multimed 99:2413–2427
Google Scholar
Guo L, Liu J, Lu S, Lu H (2020) Show, tell, and polish: ruminant decoding for image captioning. IEEE Trans Multimed 22(8):2149–2162
Article Google Scholar
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of the 6th international conference on learning representations, ICLR. https://openreview.net/forum?id=rJXMpikCZ
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th international conference on learning representations, ICLR. https://openreview.net/forum?id=SJU4ayYgl
Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, P. Dollár, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Proceedings of the European conference on computer vision, ECCV, pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, ACL, pp 311–318. https://www.aclweb.org/anthology/P02-1040/
Denkowski MJ, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the 9th workshop on statistical machine translation, WMT, pp 376–380. https://doi.org/10.3115/v1/w14-3348
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out, association for computational linguistics, Barcelona, Spain, pp 74–81. https://www.aclweb.org/anthology/W04-1013
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, ICLR. http://arxiv.org/abs/1412.6980
Gu J, Cai J, Wang G, Chen T (2018) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of the 32nd association for the advancement of artificial intelligence, AAAI, pp 6837–6844. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16465

Download references

Acknowledgements

This work was supported by the Shanghai Science and Technology Program “Distributed and generative few-shot algorithm and theory research” under Grant 20511100600 and the Natural Science Foundation of Shanghai “Research on image sentiment analysis and expression based on human vision and cognitive psychology” under Grant 22ZR1418400.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
Jing Zhang, Yingshuai Xie, Kangkang Li & Zhe Wang
DS Information Technology Co., Ltd, Shanghai, 200032, China
Wen Du

Authors

Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yingshuai Xie
View author publications
You can also search for this author in PubMed Google Scholar
Kangkang Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhe Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wen Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jing Zhang or Zhe Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, J., Xie, Y., Li, K. et al. Hierarchical decoding with latent context for image captioning. Neural Comput & Applic 35, 2429–2442 (2023). https://doi.org/10.1007/s00521-022-07726-z

Download citation

Received: 10 January 2022
Accepted: 12 August 2022
Published: 27 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00521-022-07726-z

Hierarchical decoding with latent context for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring Visual Relationship for Image Captioning

Incorporating attentive multi-scale context information for image captioning

Multilevel attention and relation network based image captioning model

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Hierarchical decoding with latent context for image captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring Visual Relationship for Image Captioning

Incorporating attentive multi-scale context information for image captioning

Multilevel attention and relation network based image captioning model

Explore related subjects

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation