Abstract
Mining more rich visual features and analyzing the context information from image for decoding part has become a challenging problem in image captioning. Some recent works employ other knowledge bases to obtain the additional objects semantic relationships by constructing scene graph, which spend much time on pre-training scene graph and these artificial defined relationships may not be comprehensive. In this paper, a novel hierarchical decoding with latent context method is proposed for image captioning, which analyzes the visual context information and decodes multi-level visual features by a hierarchical decoding method to achieve more accurate caption words. In our proposed method, a novel Latent Context Generation Network (LCGN) is proposed to infer latent relationships between objects without any external knowledge, and meanwhile, a context vector which contains rich neighbor information for each object is constructed. Then a graph convolutional network with attention is used to further aggregate latent context information for achieving high-level context features by combining objects features and their context vectors. Finally, hierarchical decoding based on Triple Long Short-Term Memory (Tri-LSTM) is proposed to decode global features, local features and object features hierarchically, which gradually analyzes the content of the image from the whole to the local to the object. Experiments on MSCOCO dataset prove that our proposed method can achieve extremely competitive results in image captioning and outperform most CNN-RNN architecture methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data that support the findings of this study are openly available in MSCOCO at https://cocodataset.org, reference number [50].
References
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Zhang J, Peng Y (2020) Video captioning with object-aware spatio-temporal correlation and aggregation. IEEE Trans Image Process 29:6209–6222
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3242–3250. https://doi.org/10.1109/CVPR.2017.345
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 6077–6086
Yang L, Tang KD, Yang J, Li L (2017) Dense captioning with joint inference and visual context. In: IEEE conference on computer vision and pattern recognition, pp 1978–1987
Kim D, Choi J, Oh T, Kweon IS (2019) Dense relational captioning: Triple-stream networks for relationship-based captioning. In: IEEE conference on computer vision and pattern recognition, pp 6271–6280
Zhang J, Peng Y (2019) Hierarchical vision-language alignment for video captioning. In: MultiMedia modeling—25th international conference, vol 11295, pp 42–54
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd international conference on learning representations, ICLR. http://arxiv.org/abs/1409.1556
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Ren S, He K, Girshick RB, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th advances in neural information processing systems, NIPS, pp 91–99
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, ICCV, pp 2407–2415. https://doi.org/10.1109/ICCV.2015.277
Guo Y, Liu Y, M. H. T. de Boer, Liu L, Lew MS (2018) A dual prediction network for image captioning. In: Proceedings of the IEEE international conference on multimedia and expo, ICME, pp 1–6. https://doi.org/10.1109/ICME.2018.8486491
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, ICML, pp 2048–2057. http://proceedings.mlr.press/v37/xuc15.html
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of IEEE conference on computer vision and pattern recognition, CVPR, pp 6298–6306. https://doi.org/10.1109/CVPR.2017.667
Chen S, Zhao Q (2018) Boosted attention: leveraging human attention for image captioning. In: Proceedings of the European conference on computer vision, ECCV, pp 72–88. https://doi.org/10.1007/978-3-030-01252-6_5
Wu Q, Shen C, Wang P, Dick AR, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381. https://doi.org/10.1109/TPAMI.2017.2708709
Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075. https://doi.org/10.1016/j.patcog.2019.107075
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision, ECCV, pp 711–727. https://doi.org/10.1007/978-3-030-01264-9_42
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM international conference on multimedia, MM, pp 765–773. https://doi.org/10.1145/3343031.3350943
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph R-CNN for scene graph generation. In: Proceedings of the European conference on computer vision, ECCV, pp 690–706. https://doi.org/10.1007/978-3-030-01246-5_41
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3097–3106. https://doi.org/10.1109/CVPR.2017.330
Hamilton WL, Ying Z, Leskovec J (2017) Inductive representation learning on large graphs. In: Proceedings of the 30th advances in neural information processing systems, NIPS, pp 1024–1034
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimed 21(11):2942–2956. https://doi.org/10.1109/TMM.2019.2915033
Zhang J, Peng Y (2019) Object-aware aggregation with bidirectional temporal graph for video captioning. In: IEEE conference on computer vision and pattern recognition, pp 8327–8336
Wei Y, Wang L, Cao H, Shao M, Wu C (2020) Multi-attention generative adversarial network for image captioning. Neurocomputing 387:91–99. https://doi.org/10.1016/j.neucom.2019.12.073
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: image captioning with visual relationship attention. Image Vis Comput 10:104146
Zhang Z, Wu Q, Wang Y, Chen F (2019) High-quality image captioning with fine-grained and semantic-guided visual attention. IEEE Trans Multimed 21(7):1681–1693. https://doi.org/10.1109/TMM.2018.2888822
Wang S, Lan L, Zhang X, Dong G, Luo Z (2019) Cascade semantic fusion for image captioning. IEEE Access 7:66680–66688. https://doi.org/10.1109/ACCESS.2019.2917979
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of IEEE international conference on computer vision, ICCV, pp 4904–4912. https://doi.org/10.1109/ICCV.2017.524
Guan Z, Liu K, Ma Y, Xu Q, J T (2018) Middle-level attribute-based language retouching for image caption generation. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, ICASSP, vol 8, pp 3081–3085
Li X, Yuan A, Lu X (2019) Vision-to-language tasks based on attributes and attention mechanism. IEEE Trans Cybern 51(2):1–14
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects, phrases and region captions. In: Proceedings of IEEE international conference on computer vision, ICCV, pp 1270–1279. https://doi.org/10.1109/ICCV.2017.142
Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: scene graph parsing with global context. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 5831–5840. http://openaccess.thecvf.com/content_cvpr_2018/html/Zellers_Neural_Motifs_Scene_CVPR_2018_paper.html
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 10685–10694. http://openaccess.thecvf.com/content_CVPR_2019/html/Yang_Auto-Encoding_Scene_Graphs_for_Image_Captioning_CVPR_2019_paper.html
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73. https://doi.org/10.1007/s11263-016-0981-7
Wu L, Xu M, Wang J, Perry SW (2020) Recall what you see continually using gridlstm in image captioning. IEEE Trans Multimed 22(3):808–818. https://doi.org/10.1109/TMM.2019.2931815
Qin Y, Du J, Zhang Y, Lu H (2019) Look back and predict forward in image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 8367–8375. http://openaccess.thecvf.com/content_CVPR_2019/html/Qin_Look_Back_and_Predict_Forward_in_Image_Captioning_CVPR_2019_paper.html
Wei Y, Wu C, Jia ZY, Hu XF, Shi H (2021) Past is important: improved image captioning by looking back in time. Signal Process Image Commun 94(8):116183
Gao L, Li X, Song J, Shen HT (2020) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pattern Anal Mach Intell 42(5):1112–1131
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 1179–1195. https://doi.org/10.1109/CVPR.2017.131
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE international conference on computer vision, ICCV, pp 873–881. https://doi.org/10.1109/ICCV.2017.100
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Anderson P, Fernando B, Johnson M, Gould S (2016) SPICE: semantic propositional image caption evaluation. In: Proceedings of the European conference on computer vision, ECCV, pp 382–398. https://doi.org/10.1007/978-3-319-46454-1_24
Xu N, Zhang H, Liu A-A, Nie W, Su Y, Nie J, Zhan Y (2020) Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Trans Multimed 22(5):1372–1383
Wu J, Chen T, Wu H, Yang Z, Lin L (2020) Fine-grained image captioning with global-local discriminative objective. IEEE Trans Multimed 99:2413–2427
Guo L, Liu J, Lu S, Lu H (2020) Show, tell, and polish: ruminant decoding for image captioning. IEEE Trans Multimed 22(8):2149–2162
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: Proceedings of the 6th international conference on learning representations, ICLR. https://openreview.net/forum?id=rJXMpikCZ
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th international conference on learning representations, ICLR. https://openreview.net/forum?id=SJU4ayYgl
Lin T, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, P. Dollár, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Proceedings of the European conference on computer vision, ECCV, pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, CVPR, pp 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932
Papineni K, Roukos S, Ward T, Zhu W (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, ACL, pp 311–318. https://www.aclweb.org/anthology/P02-1040/
Denkowski MJ, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the 9th workshop on statistical machine translation, WMT, pp 376–380. https://doi.org/10.3115/v1/w14-3348
Lin C-Y (2004) ROUGE: a package for automatic evaluation of summaries. In: Text summarization branches out, association for computational linguistics, Barcelona, Spain, pp 74–81. https://www.aclweb.org/anthology/W04-1013
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, ICLR. http://arxiv.org/abs/1412.6980
Gu J, Cai J, Wang G, Chen T (2018) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of the 32nd association for the advancement of artificial intelligence, AAAI, pp 6837–6844. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16465
Acknowledgements
This work was supported by the Shanghai Science and Technology Program “Distributed and generative few-shot algorithm and theory research” under Grant 20511100600 and the Natural Science Foundation of Shanghai “Research on image sentiment analysis and expression based on human vision and cognitive psychology” under Grant 22ZR1418400.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Xie, Y., Li, K. et al. Hierarchical decoding with latent context for image captioning. Neural Comput & Applic 35, 2429–2442 (2023). https://doi.org/10.1007/s00521-022-07726-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-022-07726-z