Abstract
Multimodal sentiment analysis has attracted many research interests in social media. Existing methods mainly rely on mining the global/local information in the image, to realize the fusion with better text information, while ignoring the inherent semantic information contained in the text. For this purpose, the Emotion-label Guiding and Similarity Reasoning Network(EGSRNet) is proposed, which introduces emotion-label guided text features to extract hidden semantic information to improve local image-text interaction, and realize deeper understanding and analysis of image-text by combining context information. Specifically, the Image-Text Feature Extraction module is used to fully extract the global/local-entity image-text features to improve the utilization rate of vital features. For text features, the emotion-label is introduced to enhance the representation ability of deep semantic information. Secondly, to explicitly calculate the similarity between text and local-entity image features, capture the image-text correlation and fully interact, a Local-Entity Similarity Reasoning module based on the attention mechanism is designed. Finally, multimodal interaction is achieved by combining the global image-text context, and the data/label-based contrastive learning is introduced to improve performance. Experimental results show that the proposed model outperforms the baseline methods on three public datasets.
Supported by the National Natural Science Foundation of China under Grant (62162065), the Joint Special Project Research Foundation of Yunnan Province (202401BF070001-023), and the Yunnan Fundamental Research Projects (202201AT070167), the Yunnan University Research Innovation Project for Recommended Exempt Postgraduates (TM-23236964).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cai, Y., Cai, H., Wan, X.: Multi-modal sarcasm detection in twitter with hierarchical fusion model. In: Association for Computational Linguistics, pp. 2506–2515 (2019)
Chen, T., Borth, D., Darrell, T., Chang, S.: DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks. Comput. Sci. (2014)
Chen, Y.: Convolutional neural network for sentence classification. MS thesis. University of Waterloo (2015)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Conference on Computer Vision and Pattern Recognition, pp. 3008–3017 (2020)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Association for Computational Linguistics, pp. 4171–4186 (2018)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Houlsby, N.: An image is worth 16×16 words: transformers for image recognition at scale. In: Computer Vision and Pattern Recognition (2020)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Kim, Y.: Convolutional neural networks for sentence classification. In: Association for Computational Linguistics, pp. 1746–1751 (2014)
Kumar, A., Srinivasan, K., Cheng, W., Zomaya, A.Y.: Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data. Inf. Process. Manag. 57(1), 102141.1–102141.25 (2020)
Li, Z., Xu, B., Zhu, C., Zhao, T.: CLMLF:A contrastive learning and multi-layer fusion method for multimodal sentiment detection. In: Association for Computational Linguistics, pp. 2282–2294 (2022)
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 2579–2605 (2008)
Machajdik, J., Hanbury, A.: Affective image classification using features inspired by psychology and art theory. In: Proceedings of the 18th International Conference on Multimedia, pp. 83–92 (2010)
Niu, T., Zhu, S., Pang, L., El Saddik, A.: Sentiment analysis on multi-view social data. In: Tian, Q., Sebe, N., Qi, G.J., Huet, B., Hong, R., Liu, X. (eds.) MultiMedia Modeling 2016. LNCS, vol. 9517, pp. 15–27. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-27674-8_2
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? sentiment classification using machine learning techniques. In: Association for Computational Linguistics, pp.79-86 (2002)
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 251–260 (2010)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2015)
Schifanella, R., De Juan, P., Tetreault, J., Cao, L.: Detecting sarcasm in multimodal social platforms. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 1136–1145 (2016)
Shin, B., Lee, T., Choi, J.D.: Lexicon integrated CNN models with attention for sentiment analysis. In: Workshop on Computational Approaches to Subjectivity, pp. 149–158 (2017)
Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for sentiment analysis. Comput. Linguist. 37(2), 267–307 (2011)
Turney, P.: Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In: Association for Computational Linguistics, pp. 417–424 (2002)
Wang, X., Jia, J., Yin, J., Cai, L.: Interpretable aesthetic features for affective image classification. In: 20th IEEE International Conference on Image Processing(ICIP), pp. 3230–3234 (2013)
Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation for consistency training. In: Advances in Neural Information Processing Systems, pp. 6256–6268 (2020)
Xiong, Y., Feng, Y., Wu, H., Kamigaito, H., Okumura, M.: Fusing label embedding into BERT: an efficient improvement for text classification. In: Association for Computational Linguistics, pp. 1743–1750 (2021)
Xu, N.: Analyzing multimodal public sentiment based on hierarchical semantic attentional network. In: International Conference on Intelligence and Security Informatics, pp. 152–154 (2017)
Xu, N., Mao, W.: MultiSentiNet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2399–2402 (2017)
Xu, N., Mao, W., Chen, G: A Co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 929–932 (2018)
Xu, N., Zeng, Z., Mao, W.: Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In: Association for Computational Linguistics, pp. 3777–3786 (2020)
Yang, J., She, D., Sun, M.: Joint image emotion classification and distribution learning via deep convolutional neural network. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 3266–3272 (2017)
Yang, X., Feng, S., Wang, D., Zhang, Y.: Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans. Multimedia 23, 4014–4026 (2021)
Yang, X., Feng, S., Zhang, Y., Wang, D.: Multimodal sentiment detection based on multi-channel graph neural networks. In: Association for Computational Linguistics, pp. 328–339 (2021)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Association for Computational Linguistics, pp. 1480–1489 (2016)
You, Q., Jin, H., Luo, J.: Visual sentiment analysis by attending on local image regions. In: 31st AAAI Conference on Artificial Intelligence, pp. 231–237 (2017)
Yu, J., Jiang, J., Xia, R.: Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 429–439 (2020)
Zhai, X., Peng, Y., Xiao, J.: Learning cross-media joint representation with sparse and semisupervised regularization. Trans. Circuits Syst. Video Technol. 24(6), 1–1 (2014)
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classification. In: Association for Computational Linguistics, pp. 207–212 (2016)
Zhu, T., Li, L., Yang, J., Zhap, S., Liu, H., Qian, J.: Multimodal sentiment analysis with image-text interaction network. Trans. Multimed. 25, 3375–3385 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhan, C., Qian, W., Liu, P. (2025). EGSRNet: Emotion-Label Guiding and Similarity Reasoning Network for Multimodal Sentiment Analysis. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15035. Springer, Singapore. https://doi.org/10.1007/978-981-97-8620-6_25
Download citation
DOI: https://doi.org/10.1007/978-981-97-8620-6_25
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8619-0
Online ISBN: 978-981-97-8620-6
eBook Packages: Computer ScienceComputer Science (R0)