Sentiment Caption Generation from Visual Scene Using Pre-trained Language Model

Xiaochen Zhang¹²,
Jin Li¹²,
Mengfan Xu¹²,
Liangfu Li¹²,
Longjiang Guo¹² &
…
Yunpeng Song¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15206))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

13 Accesses

Abstract

Developing artificial emotional intelligence for machines has become a hot topic in human-computer interaction, especially for educational robots. With the success of generative models, emotion-driven content generation, i.e. image captioning, has emerged as a new research problem, enabling robots to produce content more aligned with human habits. However, existing image captioning techniques have overlooked emotional factors, leading to stiff and rough outputs. This paper proposes an emotion-oriented image captioning generation method, which aims to reduce the gap between model outputs and human perception by introducing text sentiment analysis. Specifically, the masked language model is utilized to generate the candidate textual sequence. Then the pre-trained CLIP is introduced to ensure that the generated image descriptions match the visual content of the images. Finally, a text sentiment analysis model is integrated into the proposed framework to enhance emotional expression. Experiments show that compared to existing techniques, the generated captions from this approach better align with the actual semantic content of images.

J. Li—Equally contributed with the first author.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 89.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Papineni, K., Roukos, S., Ward, T., et al.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Vedantam, R., Lawrence, Z.C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Tewel, Y., Shalev, Y., Schwartz, I., et al.: Zerocap: zero-shot image-to-text generation for visual-semantic arithmetic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17918–17928 (2022)
Google Scholar
Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)
Radford, A., Kim, J.W., Hallacy, C., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Cheng, Y.W., Sun, P.C., Chen, N.S.: The essential applications of educational robot: requirement analysis from the perspectives of experts, researchers and instructors. Comput. Educ. 126, 399–416 (2018)
Article MATH Google Scholar
Kory, J., Breazeal, C.: Storytelling with robots: learning companions for preschool children’s language development. In: The 23rd IEEE International Symposium on Robot and Human Interactive Communication, pp. 643–648. IEEE (2014)
Google Scholar
He, B., Xia, M., Yu, X., et al.: An educational robot system of visual question answering for preschoolers. In: 2017 2nd International Conference on Robotics and Automation Engineering (ICRAE), pp. 441–445. IEEE (2017)
Google Scholar
Yaakub, M.R., Zaki, F.Z.M., Latiffi, M.I.A., et al.: Sentiment analysis of preschool teachers’ perceptions on ICT use for young children. In: 2019 IEEE International Conference on Engineering, Technology and Education (TALE), pp. 1–6. IEEE (2019)
Google Scholar
Hughes, C., Serena, L., Charlotte, W.: “Do you know what I want?” Preschoolers’ talk about desires, thoughts and feelings in their conversations with sibs and friends. Cogn. Emot. 21(2), 330–350 (2007)
Google Scholar
Grazzani, I., Veronica, O., Jens, B.: Conversation on mental states at nursery: promoting social cognition in early childhood. Eur. J. Dev. Psychol. 13(5), 563–581 (2016)
Article Google Scholar
Ornaghi, V., Grazzani, I., Cherubin, E., et al.: ‘Let’s talk about emotions!’. The effect of conversational training on preschoolers’ emotion comprehension and prosocial orientation. Soc. Dev. 24(1), 166–183 (2015)
Google Scholar
Zhai, X., Wang, X., Mustafa, B., et al.: LIT: zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133 (2022)
Google Scholar
Hu, X., Gan, Z., Wang, J., et al.: Scaling up vision-language pre-training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17980–17989 (2022)
Google Scholar
Radford, A., Wu, J., Child, R., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Zeng, Z., Zhang, H., Lu, R., et al.: Conzic: controllable zero-shot image captioning by sampling-based polishing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23465–23476 (2023)
Google Scholar
Li, T., Hu, Y., Wu, X.: Image captioning with inherent sentiment. In: IEEE International Conference on Multimedia and Expo. 2021, pp. 1–6 (2021)
Google Scholar
Xiao, Y., et al.: A survey on non-autoregressive generation for neural machine translation and beyond. arXiv:2204.09269, 3, 4 (2022)
Hartmann, J., Heitmann, M., Siebert, C., et al.: More than a feeling: accuracy and application of sentiment analysis. Int. J. Res. Mark. 40(1), 75–87 (2023)
Article MATH Google Scholar
Chen, X., Fang, H., Lin, T.Y., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv: 1504.00325 (2015)
Mathews, A., Xie, L., He, X.: Senticap: generating image descriptions with sentiments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1 (2016)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., et al.: Spice: semantic propositional image caption evaluation. In: Proceedings of European Conference on Computer Vision, pp. 382–398 (2016)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., et al.: CLIPScore: a reference-free evaluation metric for image captioning. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2021, pp. 7514–7528 (2021)
Google Scholar
Gan, C., Gan, Z., He, X., et al.: Stylenet: generating attractive visual captions with styles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137–3146 (2017)
Google Scholar
Guo, L., Liu, J., Yao, P., et al.: MSCAP: multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4204–4213 (2019)
Google Scholar
Zhao, W., Wu, X., Zhang, X.: Memcap: memorizing style knowledge for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12984–12992 (2020)
Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China No. 62206162, 62102308, and 61977044. Young Talent Fund of Xi’an Association for Science and Technology No. 959202313048, the Ministry of Education’s Cooperative Education Project Grant No. 202102591018.

Author information

Authors and Affiliations

Shaanxi Normal University, Xi’an, 710119, China
Xiaochen Zhang, Jin Li, Mengfan Xu, Liangfu Li & Longjiang Guo
Xi’an Jiaotong University, Xi’an, 710049, China
Yunpeng Song

Authors

Xiaochen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jin Li
View author publications
You can also search for this author in PubMed Google Scholar
Mengfan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Liangfu Li
View author publications
You can also search for this author in PubMed Google Scholar
Longjiang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yunpeng Song
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Longjiang Guo or Yunpeng Song .

Editor information

Editors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Xuguang Lan
Xi’an Jiaotong University, Xi’an, China
Xuesong Mei
Xi’an Jiaotong University, Xi'an, China
Caigui Jiang
Xi’an Jiaotong University, Xi’an, China
Fei Zhao
Xi'an Jiaotong University, Xi'an, China
Zhiqiang Tian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, X., Li, J., Xu, M., Li, L., Guo, L., Song, Y. (2025). Sentiment Caption Generation from Visual Scene Using Pre-trained Language Model. In: Lan, X., Mei, X., Jiang, C., Zhao, F., Tian, Z. (eds) Intelligent Robotics and Applications. ICIRA 2024. Lecture Notes in Computer Science(), vol 15206. Springer, Singapore. https://doi.org/10.1007/978-981-96-0792-1_15

Download citation

DOI: https://doi.org/10.1007/978-981-96-0792-1_15
Published: 25 January 2025
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0791-4
Online ISBN: 978-981-96-0792-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics