More Web Proxy on the site http://driver.im/

research-article

Multilevel attention and relation network based image captioning model

Authors:

Himanshu Sharma,

Swati SrivastavaAuthors Info & Claims

Multimedia Tools and Applications, Volume 82, Issue 7

Pages 10981 - 11003

https://doi.org/10.1007/s11042-022-13793-0

Published: 16 September 2022 Publication History

Abstract

The aim of the image captioning task is to understand various semantic concepts such as objects and their relationships in an image and combine them to generate a natural language description. Thus, it needs an algorithm to understand the visual content of a given image and translates it into a sequence of output words. In this paper, a Local Relation Network (LRN) is designed over the objects and image regions which not only discovers the relationship between the object and the image regions but also generates significant context-based features corresponding to every region in the image. Also, a multilevel attention approach is used to focus on a given image region and its related image regions, thus enhancing the image representation capability of the proposed method. Finally, a variant of traditional long-short term memory (LSTM), which uses an attention mechanism, is employed which focuses on relevant contextual information, spatial locations, and deep visual features. With these measures, the proposed model encodes an image in an improved way, which gives the model significant cues and thus leads to improved caption generation. Extensive experiments have been performed on three benchmark datasets: Flickr30k, MSCOCO, and Nocaps. On Flickr30k, the obtained evaluation scores are 31.2 BLEU@4, 23.5 METEOR, 51.5 ROUGE, 65.6 CIDEr and 17.2 SPICE. On MSCOCO, the proposed model has attained 42.4 BLEU@4, 29.4 METEOR, 59.7 ROUGE, 125.7 CIDEr and 23.2 SPICE. The overall CIDEr score on Nocaps dataset achieved by the proposed model is 114.3. The above scores clearly show the superiority of the proposed method over the existing methods.

References

[1]

Aggarwal AK (2022) Learning texture features from GLCM for classification of brain tumor MRI images using random Forest classifier. Journal: WSEAS Transactions on Signal Processing, 60–63, Learning Texture Features from GLCM for Classification of Brain Tumor MRI Images using Random Forest Classifier

[2]

Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, ..., Anderson P (2019) Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8948–8957)

[3]

Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In European conference on computer vision (pp. 382-398). Springer, Cham

[4]

Anderson P, Fernando B, Johnson M, Gould S (2016) Guided open vocabulary image captioning with constrained beam search. arXiv preprint arXiv:1612.00576

[5]

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077-6086)

[6]

Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72)

[7]

Barraco M, Cornia M, Cascianelli S, Baraldi L, Cucchiara R (2022) The unreasonable effectiveness of CLIP features for image captioning: an experimental analysis. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4662-4670)

[8]

Beddiar D, Oussalah M, Tapio S (2022) Explainability for medical image captioning. In 2022 eleventh international conference on image processing theory, tools and applications (IPTA) (pp. 1-6). IEEE

[9]

Castellano G, Digeno V, Sansaro G, and Vessio G Leveraging knowledge graphs and deep learning for automatic art analysis Knowl-Based Syst 2022 248 108859

[10]

Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5659-5667)

[11]

Chen X, Ma L, Jiang W, Yao J, Liu W (2018) Regularizing rnns for caption generation by reconstructing the past with the present. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7995-8003)

[12]

Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10578-10587)

[13]

Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura JM, ..., Batra D (2017) Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 326–335)

[14]

Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634)

[15]

Fang Z, Wang J, Hu X, Liang L, Gan Z, Wang L, ..., Liu Z (2022) Injecting semantic concepts into end-to-end image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18009–18019)

[16]

Ghataoura D, Ogbonnaya S (2021) Application of image captioning and retrieval to support military decision making. In 2021 international conference on military communication and information systems (ICMCIS) (pp. 1-8). IEEE

[17]

Gupta N and Jalal AS Integration of textual cues for fine-grained image captioning using deep CNN and LSTM Neural Comput & Applic 2020 32 24 17899-17908

[18]

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)

[19]

He X, Shi B, Bai X, Xia GS, Zhang Z, and Dong W Image caption generation with part of speech guidance Pattern Recogn Lett 2019 119 229-237

[20]

Hochreiter S and Schmidhuber J Long short-term memory Neural Comput 1997 9 8 1735-1780

[21]

Hossain MZ, Sohel F, Shiratuddin MF, and Laga H A comprehensive survey of deep learning for image captioning ACM Comput Surveys (CsUR) 2019 51 6 1-36

[22]

Hu X, Yin X, Lin K, Zhang L, Gao J, Wang L, Liu Z (2021) Vivo: visual vocabulary pre-training for novel object captioning. In proceedings of the AAAI conference on artificial intelligence (Vol. 35, no. 2, pp. 1575-1583)

[23]

Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In proceedings of the IEEE international conference on computer vision (pp. 2407-2415)

[24]

Jiang S and Yang S A strength Pareto evolutionary algorithm based on reference direction for multiobjective and many-objective optimization IEEE Trans Evol Comput 2017 21 3 329-346

[25]

Jiang W, Wang W, and Hu H Bi-directional co-attention network for image captioning ACM Trans Multimed Comput Commun Appl (TOMM) 2021 17 4 1-20

[26]

Kalimuthu M, Mogadala A, Mosbach M, Klakow D (2021) Fusion models for improved image captioning. In international conference on pattern recognition (pp. 381-395). Springer, Cham

[27]

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137)

[28]

Kinghorn P, Zhang L, and Shao L A hierarchical and regional deep learning architecture for image description generation Pattern Recogn Lett 2019 119 77-85

[29]

Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

[30]

Kotsiantis SB Decision trees: a recent overview Artif Intell Rev 2013 39 4 261-283

[31]

Krasin I, Duerig T, Alldrin N, Ferrari V, Abu-El-Haija S, Kuznetsova A, ..., Murphy K (2017) Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2(3), 18

[32]

Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA, Bernstein MS, and Fei-Fei L Visual genome: connecting language and vision using crowdsourced dense image annotations Int J Comput Vis 2017 123 1 32-73

[33]

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst, 25

[34]

Li N, Chen Z (2018) Image Cationing with visual-semantic LSTM. In IJCAI (pp. 793-799)

[35]

Li X and Jiang S Know more say less: image captioning based on scene graphs IEEE Trans Multimed 2019 21 8 2117-2130

[36]

Li G, Zhu L, Liu P,Yang Y (2019) Entangled transformer for image captioning. In proceedings of the IEEE/CVF international conference on computer vision (pp. 8928-8937)

[37]

Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, ..., Gao J (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision (pp. 121–137). Springer, Cham

[38]

Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In text summarization branches out (pp. 74-81)

[39]

Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, ..., Zitnick CL (2014) Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740–755). Springer, Cham

[40]

Liu AA, Zhai Y, Xu N, Nie W, Li W, Zhang Y (2021) Region-aware image captioning via interaction learning. IEEE Trans Circ Syst Vid Technol

[41]

Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 375-383)

[42]

Lu Y, Guo C, Dai X, and Wang FY Data-efficient image captioning of fine art paintings via virtual-real semantic alignment training Neurocomputing 2022 490 163-180

[43]

Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318)

[44]

Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, ..., Lerer A (2017) Autom Differ Pytorch

[45]

Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Proces Syst 28

[46]

Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, and Monfardini G The graph neural network model IEEE Trans Neural Netw 2008 20 1 61-80

[47]

Sharma H and Jalal AS Incorporating external knowledge for image captioning using CNN and LSTM Modern Physics Letters B 2020 34 28 2050315

[48]

Sharma H, Jalal AS (2021) Image captioning improved visual question answering. Multimedia tools and applications, 1-22

[49]

Sharma H and Jalal AS A survey of methods, datasets and evaluation metrics for visual question answering Image Vis Comput 2021 116 104327

[50]

Sharma H and Jalal AS Visual question answering model based on graph neural network and contextual attention Image Vis Comput 2021 110 104165

[51]

Sharma H and Jalal AS A framework for visual question answering with the integration of scene-text using PHOCs and fisher vectors Expert Syst Appl 2022 190 116159

[52]

Sharma H and Jalal AS An improved attention and hybrid optimization technique for visual question answering Neural Process Lett 2022 54 1 709-730

[53]

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

[54]

Sreedhar Kumar S, Ahmed ST, NishaBhai VB (2019) Type of supervised text classification system for unstructured text comments using probability theory technique. Int J Recent Technol Eng (IJRTE), 8(10)

[55]

Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575)

[56]

Vinyals O, Toshev A, Bengio S, and Erhan D Show and tell: lessons learned from the 2015 mscoco image captioning challenge IEEE Trans Pattern Anal Mach Intell 2016 39 4 652-663

[57]

Wang J, Wang W, Wang L, Wang Z, Feng DD, and Tan T Learning visual relationship and context-aware attention for image captioning Pattern Recogn 2020 98 107075

[58]

Wang Y, Xu N, Liu AA, Li W, Zhang Y (2021) High-order interaction learning for image captioning. IEEE Trans Circuits Syst Vid Technol

[59]

Wang Z, Yu J, Yu AW, Dai Z, Tsvetkov Y, Cao Y (2021) Simvlm: simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904

[60]

Wu J and Hu H Cascade recurrent neural network for image caption generation Electron Lett 2017 53 25 1642-1643

[61]

Wu S, Wieland J, Farivar O, Schiller J (2017) Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In proceedings of the 2017 ACM conference on computer supported cooperative work and social computing (pp. 1180-1192)

[62]

Xiao F, Xue W, Shen Y, Gao X (2022) A new attention-based LSTM for image captioning. Neural Process Lett, 1-15

[63]

Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, ..., Bengio Y (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048–2057). PMLR., 2018

[64]

Yan C, Hao Y, Li L, Yin J, Liu A, Mao Z, Chen Z, and Gao X Task-adaptive attention for image captioning IEEE Trans Circuits Syst Vid Technol 2021 32 1 43-51

[65]

Yang X and Xu C Image captioning by asking questions ACM Trans Multimed Comput Commun Appl (TOMM) 2019 15 2s 1-19

[66]

Yang L, Hu H, Xing S, and Lu X Constrained lstm and residual attention for image captioning ACM Trans Multimed Comput Commun Appl (TOMM) 2020 16 3 1-18

[67]

Ye S, Han J, and Liu N Attentive linear transformation for image captioning IEEE Trans Image Process 2018 27 11 5514-5524

[68]

You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4651-4659)

[69]

Young P, Lai A, Hodosh M, and Hockenmaier J From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions Trans Assoc Comput Linguist 2014 2 67-78

[70]

Zellers R, Yatskar M, Thomson S, Choi Y (2018) Neural motifs: scene graph parsing with global context. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5831-5840)

[71]

Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, ..., Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5579–5588)

[72]

Zhong Y, Wang L, Chen J, Yu D, Li Y (2020) Comprehensive image captioning via scene graph decomposition. In European conference on computer vision (pp. 211-229). Springer, Cham

[73]

Zhou L, Xu C, Koch P, Corso JJ (2017) Watch what you just said: image captioning with text-conditional attention. In proceedings of the on thematic workshops of ACM multimedia 2017 (pp. 305-313)

[74]

Zhu L, Lu X, Cheng Z, Li J, and Zhang H Deep collaborative multi-view hashing for large-scale image search IEEE Trans Image Process 2020 29 4643-4655

Cited By

Hossen MYe ZAbdussalam AHossain M(2024)GVA: guided visual attention approach for automatic image caption generationMultimedia Systems10.1007/s00530-023-01249-w30:1Online publication date: 29-Jan-2024
https://dl.acm.org/doi/10.1007/s00530-023-01249-w

Index Terms

Multilevel attention and relation network based image captioning model

Index terms have been assigned to the content through auto-classification.

Recommendations

A Framework for Image Captioning Based on Relation Network and Multilevel Attention Mechanism
Abstract
Understanding different semantic concepts, such as objects and their relationships in an image, and integrating them to produce a natural language description is the goal of the image captioning task. Thus, it needs an algorithm to understand the ...
Relation constraint self-attention for image captioning
Abstract
Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, ...
Image Captioning with a Joint Attention Mechanism by Visual Concept Samples

The attention mechanism has been established as an effective method for generating caption words in image captioning; it explores one noticed subregion in an image to predict a related caption word. However, even though the attention mechanism could ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Multimedia Tools and Applications

Multimedia Tools and Applications Volume 82, Issue 7

Mar 2023

1554 pages

ISSN:1380-7501

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022. Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 16 September 2022

Accepted: 05 September 2022

Revision received: 26 July 2022

Received: 06 April 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hossen MYe ZAbdussalam AHossain M(2024)GVA: guided visual attention approach for automatic image caption generationMultimedia Systems10.1007/s00530-023-01249-w30:1Online publication date: 29-Jan-2024
https://dl.acm.org/doi/10.1007/s00530-023-01249-w

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents