Multi-Keys Attention Network for Image Captioning

Ziqian Yang¹^na1,
Hui Li¹^na1,
Renrong Ouyang²,
Quan Zhang¹ &
…
Jimin Xiao¹

322 Accesses
1 Citation
Explore all metrics

Abstract

The image captioning task aims to generate descriptions from the main content of images. Recently, the Transformer with a self-attention mechanism has been widely used for the image captioning task, where the attention mechanism helps the encoder to generate image region features, and guides caption output in the decoder. However, the vanilla decoder uses a simple conventional self-attention mechanism, resulting in captions with poor semantic information and incomplete sentence logic. In this paper, we propose a novel attention block, Multi-Keys attention block, that fully enhances the relevance between explicit and implicit semantic information. Technically, the Multi-Keys attention block first concatenates the key vector and the value vector and spreads it into both the explicit channel and the implicit channel. Then, the “related value” is generated with more semantic information by applying the element-wise multiplication to them. Moreover, to perfect the sentence logic, the reverse key vector with another information flow is residually connected to the final attention result. We also apply the Multi-Keys attention block into the sentence decoder in the transformer named as Multi-Keys Transformer (MKTrans). The experiments demonstrate that our MKTrans achieves 138.6% CIDEr score on MS COCO “Karpathy” offline test split. The proposed Multi-Keys attention block and MKTrans model are proven to be more effective and superior than the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Relational Attention with Textual Enhanced Transformer for Image Captioning

GlosysIC Framework: Transformer for Image Captioning with Sequential Attention

Layer-wise enhanced transformer with multi-modal fusion for image caption

Article 19 December 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The datasets generated during and/or analyzed during the current study are available in the [MS COCO] repository, [https://cocodataset.org/].

References

Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL. Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell. 2013;35(12):2891.
Article Google Scholar
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC, et al. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 1473–82.
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III, H. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. 2012. p. 747–56.
Li Y, Pan Y, Yao T, Mei T. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 17990–9.
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R. From show to tell: a survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell. 2022;45(1):539.
Article Google Scholar
Sutskever I, Vinyals O, Le QV. Sequence to sequence learning with neural networks. Adv Neural Inf Proces Syst. 2014;27.
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y. In: International Conference on Machine Learning. PMLR; 2015. p. 2048–57.
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. p. 6077–86.
Chen H, Ding G, Lin Z, Guo Y, Shan C, Han J. Image captioning with memorized knowledge. Cognitive Computation. 2021;13:807.
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Proces Syst. 2017;30.
Y.N. Dauphin, A. Fan, M. Auli, D. Grangier. In: International Conference on Machine Learning. PMLR; 2017. p. 933–41.
You Q, Jin H, Wang Z, Fang C, Luo J. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. p. 4651–9.
Lu J, Xiong C, Parikh D, Socher R. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 375–83.
Huang L, Wang W, Chen J, Wei XY. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 4634–43.
Pan Y, Yao T, Li Y, Mei T. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 10971–80.
Zhou Y, Hu Z, Liu D, Ben H, Wang M. Compact bidirectional transformer for image captioning. arXiv:2201.01984 [Preprint]. 2022.
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11. Springer, 2010. p. 15–29.
Li S, Kulkarni G, Berg T, Berg A, Choi Y. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning. 2011. p. 220–8.
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part IV 13. Springer; 2014. p. 529–45.
Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res. 2013;47:853.
Article MathSciNet Google Scholar
Ordonez V, Kulkarni G, Berg T. Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Proces Syst. 2011;24.
Sun C, Gan C, Nevatia R. In: Proceedings of the IEEE International Conference on Computer Vision. 2015. p. 2596–604.
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078 [Preprint]. 2014.
Mao J, Xu W, Yang Y, Wang J, Yuille AL. Explain images with multimodal recurrent neural networks. arXiv:1410.1090 [Preprint]. 2014.
Vinyals O, Toshev A, Bengio S, Erhan D. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 3156–64.
Xu D, Zhu Y, Choy CB, Fei-Fei L. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 5410–9.
Yang J, Lu J, Lee S, Batra D, Parikh D. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 670–85.
Yang X, Tang K, Zhang H, Cai J. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 10685–94.
Xiao X, Wang L, Ding K, Xiang S, Pan C. Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia. 2019;21(11):2942.
Article Google Scholar
Shetty R, Rohrbach M, Anne Hendricks L, Fritz M, Schiele B. In: Proceedings of the IEEE International Conference on Computer Vision. 2017. p. 4135–44.
Chen C, Mu S, Xiao W, Ye Z, Wu L, Ju Q. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33. 2019. p. 8142–50.
Vedantam R, Lawrence Zitnick C, Parikh D. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 4566–75.
Luo R. A better variant of self-critical sequence training. arXiv:2003.09971 [Preprint]. 2020.
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer; 2014. p. 740–55.
Karpathy A, Fei-Fei L. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. p. 3128–37.
Papineni K, Roukos S, Ward T, Zhu WJ. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. p. 311–8.
Banerjee S, Lavie A. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005. p. 65–72.
Lin CY. In: Text summarization branches out. 2004. p. 74–81.
Anderson P, Fernando B, Johnson M, Gould S. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part V 14. Springer; 2016. p. 382–98.
Cornia M, Stefanini M, Baraldi L, Cucchiara R. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 10578–87.
Yang X, Zhang H, Cai J. Deconfounded image captioning: a causal retrospect. IEEE Trans Pattern Anal Mach Intell. 2021.
Wang J, Xu W, Wang Q, Chan AB. On distinctive image captioning via comparing and reweighting. IEEE Trans Pattern Anal Mach Intell. 2022;45(2):2088.
Ren S, He K, Girshick R, Sun J. Faster r-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Proces Syst. 2015;28.
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 5579–88.
Yao T, Pan Y, Li Y, Qiu Z, Mei T. In: Proceedings of the IEEE international conference on computer vision. 2017. p. 4894–902.
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. p. 7008–24.
Jiang W, Ma L, Jiang YG, Liu W, Zhang T. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 499–515.
Yao T, Pan Y, Li Y, Mei T. In: Proceedings of the European Conference on Computer Vision (ECCV). 2018. p. 684–99.
Qin Y, Du J, Zhang Y, Lu H, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 8367–75.
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021. p. 15465–74.
Yao T, Pan Y, Li Y, Mei T. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 2621–9.
Li G, Zhu L, Liu P, Yang Y. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 8928–37.
Ji J, Luo Y, Sun X, Chen F, Luo G, Wu Y, Gao Y, Ji R. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35. 2021. p. 1655–63.

Download references

Author information

Ziqian Yang and Hui Li have the same contribution to this paper and are the co-first authors.

Authors and Affiliations

School of Advanced Technology, Xi’an Jiaotong-Liverpool University, 111 Ren’ai Road, Suzhou Industrial Park, Suzhou, 215123, Jiangsu Province, China
Ziqian Yang, Hui Li, Quan Zhang & Jimin Xiao
Suzhou Construction and Transportation Branch, Jiangsu United Vocational and Technical College, Suzhou, Jiangsu Province, China
Renrong Ouyang

Authors

Ziqian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Li
View author publications
You can also search for this author in PubMed Google Scholar
Renrong Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Quan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jimin Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Renrong Ouyang or Quan Zhang.

Ethics declarations

Ethical Approval

This article does not contain any studies with human participants performed by any of the authors.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, Z., Li, H., Ouyang, R. et al. Multi-Keys Attention Network for Image Captioning. Cogn Comput 16, 1061–1072 (2024). https://doi.org/10.1007/s12559-023-10231-7

Download citation

Received: 24 February 2023
Accepted: 12 November 2023
Published: 24 January 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s12559-023-10231-7

Multi-Keys Attention Network for Image Captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Relational Attention with Textual Enhanced Transformer for Image Captioning

GlosysIC Framework: Transformer for Image Captioning with Sequential Attention

Layer-wise enhanced transformer with multi-modal fusion for image caption

Data Availability

References

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Multi-Keys Attention Network for Image Captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Relational Attention with Textual Enhanced Transformer for Image Captioning

GlosysIC Framework: Transformer for Image Captioning with Sequential Attention

Layer-wise enhanced transformer with multi-modal fusion for image caption

Explore related subjects

Data Availability

References

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Ethical Approval

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation