Abstract
To fulfill the explosion of multi-modal data, multi-modal sentiment analysis (MSA) emerged and attracted widespread attention. Unfortunately, conventional multi-modal research relies on large-scale datasets. On the one hand, collecting and annotating large-scale datasets is challenging and resource-intensive. On the other hand, the training on large-scale datasets also increases the research cost. However, the few-shot MSA (FMSA), which is proposed recently, requires only few samples for training. Therefore, in comparison, it is more practical and realistic. There have been approaches to investigating the prompt-based method in the field of FMSA, but they have not sufficiently considered or leveraged the information specificity of visual modality. Thus, we propose a vision-enhanced prompt-based model based on graph structure to better utilize vision information for fusion and collaboration in encoding and optimizing prompt representations. Specifically, we first design an aggregation-based multi-modal attention module. Then, based on this module and the biaffine attention, we construct a syntax–semantic dual-channel graph convolutional network to optimize the encoding of learnable prompts by understanding the vision-enhanced information in semantic and syntactic knowledge. Finally, we propose a collaboration-based optimization module based on the collaborative attention mechanism, which employs visual information to collaboratively optimize prompt representations. Extensive experiments conducted on both coarse-grained and fine-grained MSA datasets have demonstrated that our model significantly outperforms the baseline models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
We have given the accompanying official websites of data; the completed data will be made available on reasonable request.
References
Truong Q-T, Lauw HW (2019) Vistanet: visual aspect attention network for multimodal sentiment analysis. Proc AAAI Conf Artif Intell 33(01):305–312. https://doi.org/10.1609/aaai.v33i01.3301305
Cai Y, Cai H, Wan X (2019) Multi-modal sarcasm detection in twitter with hierarchical fusion model. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2506–2515. Association for Computational Linguistics, Florence, Italy . https://doi.org/10.18653/v1/P19-1239
Jia J, Zhou S, Yin Y, Wu B, Chen W, Meng F, Wang Y (2019) Inferring emotions from large-scale internet voice data. IEEE Trans Multimed 21(7):1853–1866. https://doi.org/10.1109/TMM.2018.2887016
Niu T, Zhu S, Pang L, El Saddik A (2016) Sentiment analysis on multi-view social data. In: Tian Q, Sebe N, Qi G-J, Huet B, Hong R, Liu X (eds) MultiMedia Modeling Lecture Notes in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-319-27674-8_2
Yu Y, Zhang D, Li S (2022) Unified multi-modal pre-training for few-shot sentiment analysis with prompt-based learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 189–198. ACM, Lisboa Portugal https://doi.org/10.1145/3503161.3548306
Ju X, Zhang D, Xiao R, Li J, Li S, Zhang M, Zhou G 2021 Joint multi-modal aspect-sentiment analysis with auxiliary cross-modal relation detection. In: Proceedings of theConference on Empirical Methods in Natural Language Processing, pp. 4395–4405. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.emnlp-main.360
Xu L, Lu X, Yuan C, Zhang X, Xu H, Yuan H, Wei G, Pan X, Tian X, Qin L. Hai H (2021) FewCLUE: A Chinese Few-Shot Learning Evaluation Benchmark. arXiv . https://doi.org/10.48550/arXiv.2107.07498
Bianchi, F., Attanasio, G., Pisoni, R., Terragni, S., Sarti, G., Lakshmi, S.: Contrastive Language-Image Pre-Training for the Italian Language. arXiv (2021). https://doi.org/10.48550/arXiv.2108.08688
Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S.M.A., Vinyals, O., Hill, F.: Multimodal Few-Shot Learning with Frozen Language Models. arXiv (2021)
Yu, Y, Zhang D, Li S (2022) Unified multi-modal pre-training for few-shot sentiment analysis with prompt-based learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 189–198. ACM, Lisboa Portugal . https://doi.org/10.1145/3503161.3548306
Yu, Y., Zhang, D.: Few-shot multi-modal sentiment analysis with prompt-based vision-aware language modeling. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6 (2022). https://doi.org/10.1109/ICME52920.2022.9859654
Xu, N., Mao, W., Chen, G.: A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 929–932. ACM, Ann Arbor MI USA (2018). https://doi.org/10.1145/3209978.3210093
Yang X, Feng S, Wang D, Zhang Y (2021) Image-text multimodal emotion classification via multi-view attentional network. IEEE Trans Multimed 23:4014–4026. https://doi.org/10.1109/TMM.2020.3035277
Xu N, Mao W, Chen G (2019) Multi-interactive memory network for aspect based multimodal sentiment analysis. Proc AAAI Conf Artif Intell 33(01):371–378. https://doi.org/10.1609/aaai.v33i01.3301371
YU, J., JIANG, J.: Adapting bert for target-oriented multimodal sentiment classification. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 5408–5414 (2019) https://doi.org/10.24963/ijcai.2019/751
Zhou J, Zhao J, Huang JX, Hu QV, He L (2021) Masad: a large-scale dataset for multimodal aspect-based sentiment analysis. Neurocomputing 455:47–58. https://doi.org/10.1016/j.neucom.2021.05.040
Gao, T., Fisch, A., Chen, D.: Making Pre-Trained Language Models Better Few-Shot Learners. arXiv (2021). https://doi.org/10.48550/arXiv.2012.15723
Gu, Y., Han, X., Liu, Z., Huang, M.: PPT: Pre-Trained Prompt Tuning for Few-Shot Learning. arXiv (2022). https://doi.org/10.48550/arXiv.2109.04332
Schick, T., Schütze, H.: It’s not just size that matters: Small language models are also few-shot learners. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2339–2352. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.naacl-main.185
Şahin, G.G., Steedman, M.: Data augmentation via dependency tree morphing for low-resource languages. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5004–5009. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1545
Wei, J., Zou, K.: Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1670
Zhang P, Chai T, Xu Y (2023) Adaptive prompt learning-based few-shot sentiment analysis. Neural Process Lett. https://doi.org/10.1007/s11063-023-11259-4
Ji Y, Zhang H, Jonathan Wu (2018) Salient object detection via multi-scale attention cnn. Neurocomputing 322:130–140. https://doi.org/10.1016/j.neucom.2018.09.061
Qian X, Fu Y, Xiang T, Jiang Y-G, Xue X (2020) Leader-based multi-scale attention deep architecture for person re-identification. IEEE Trans Pattern Anal Mach Intell 42(2):371–385. https://doi.org/10.1109/TPAMI.2019.2928294
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical Question-Image Co-Attention for Visual Question Answering. arXiv (2017)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv (2019). https://doi.org/10.48550/arXiv.1810.04805
YU, J., JIANG, J., YANG, L., XIA, R.: Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3342–3352 (2020) https://doi.org/10.18653/v1/2020.acl-main.306
Myerson J, Green L, Warusawitharana M (2001) Area under the curve as a measure of discounting. J Exp Anal Behav 76(2):235–243. https://doi.org/10.1901/jeab.2001.76-235
Acknowledgments
The study was funded by the National Natural Science Foundation of China (Grant no. 61672144). Correspondence should be addressed to Baiyou Qiao.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We assert that there are no known competing financial interests or personal relationships that could have an impact on the results presented in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhou, Z., Qiao, B., Feng, H. et al. Attention-optimized vision-enhanced prompt learning for few-shot multi-modal sentiment analysis. Neural Comput & Applic 36, 21091–21105 (2024). https://doi.org/10.1007/s00521-024-10297-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-10297-w