Abstract
Multimodal sentiment analysis leverages various modalities, including text, audio, and video, to determine human sentiment tendencies, which holds significance in fields such as intention understanding and opinion analysis. However, there are two critical challenges in multimodal sentiment analysis: one is how to effectively extract and integrate information from various modalities, which is important for reducing the heterogeneity gap among modalities; the other is how to overcome the problem of information forgetting while modelling long sequences, which leads to significant information loss and adversely affect the fusion performance of modalities. Based on the above issues, this paper proposes a multimodal heterogeneity fusion network based on graph convolutional neural networks (HFNGC). A shared convolutional aggregation mechanism is used to overcome the semantic gap among modalities and reduce the noise effect caused by modality heterogeneity. In addition, the model applies Dynamic Routing to convert modality features into graph structures. By learning semantic information in the graph representation space, our model can improve the capability of remote-dependent learning. Furthermore, the model integrates complementary information among modalities and explores the intra- and inter-modal interactions during the modality fusion stage. To validate the effectiveness of our model, we conduct experiments on two benchmark datasets. The experimental results demonstrate that our method outperforms the existing methods, exhibiting strong generalisation capability and high competitiveness.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and materials
In this work, we have used two publicly available datasets, CMU-MOSI dataset and CMU-MOSEI dataset, both of which can be available at https://github.com/A2Zadeh/CMU-MultimodalSDK.
References
Zadeh A, Chen M, Poria S, Cambria E, Morency L (2017) Tensor fusion network for multimodal sentiment analysis. In: Palmer M, Hwa R, Riedel S (eds) Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp 1103–1114. https://doi.org/10.18653/v1/d17-1115
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Zadeh A, Morency L (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Gurevych I, Miyao Y (eds) Proceedings of the 56th annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, vol 1: Long papers, pp 2247–2256. https://doi.org/10.18653/v1/P18-1209
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Wang Y, Shen Y, Liu Z, Liang PP, Zadeh A, Morency L (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216
Akhtar MS, Chauhan DS, Ghosal D, Poria S, Ekbal A, Bhattacharyya P (2019) Multi-task learning for multi-modal emotion recognition and sentiment analysis. In: Burstein J, Doran C, Solorio T (eds) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, vol 1 (Long and Short Papers). Association for computational linguistics, pp 370–379. https://doi.org/10.18653/v1/n19-1034
Baltrusaitis T, Ahuja C, Morency L (2019) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443. https://doi.org/10.1109/TPAMI.2018.2798607
Gkoumas D, Li Q, Lioma C, Yu Y, Song D (2021) What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Inf Fusion 66:184–197. https://doi.org/10.1016/j.inffus.2020.09.005
Abdu SA, Yousef AH, Salem A (2021) Multimodal video sentiment analysis using deep learning approaches, a survey. Inf Fusion 76:204–226. https://doi.org/10.1016/j.inffus.2021.06.003
Zadeh A, Liang PP, Mazumder N, Poria S, Cambria E, Morency L-P (2018) Memory fusion network for multi-view sequential learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 32. https://doi.org/10.1609/aaai.v32i1.12021
Mai S, Hu H, Xu J, Xing S (2020) Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans Affect Comput 13(1):320–334. https://doi.org/10.1109/TAFFC.2020.3000510
Basiri ME, Nemati S, Abdar M, Cambria E, Acharya UR (2021) Abcdm: an attention-based bidirectional cnn-rnn deep model for sentiment analysis. Futur Gener Comput Syst 115:279–294. https://doi.org/10.1016/j.future.2020.08.005
Wu T, Peng J, Zhang W, Zhang H, Tan S, Yi F, Ma C, Huang Y (2022) Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl Based Syst 235:107676. https://doi.org/10.1016/j.knosys.2021.107676
Wang D, Guo X, Tian Y, Liu J, He L, Luo X (2023) TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis. Pattern Recognit 136:109259. https://doi.org/10.1016/j.patcog.2022.109259
Xue X, Zhang C, Niu Z, Wu X (2023) Multi-level attention map network for multimodal sentiment analysis. IEEE Trans Knowl Data Eng 35(5):5105–5118. https://doi.org/10.1109/TKDE.2022.3155290
Zhu T, Li L, Yang J, Zhao S, Liu H, Qian J (2023) Multimodal sentiment analysis with image-text interaction network. IEEE Trans Multim 25:3375–3385. https://doi.org/10.1109/TMM.2022.3160060
Zhang X, Chen Y, He L (2023) Information block multi-head subspace based long short-term memory networks for sentiment analysis. Appl Intell 53(10):12179–12197. https://doi.org/10.1007/s10489-022-03998-z
Peng J, Wu T, Zhang W, Cheng F, Tan S, Yi F, Huang Y (2023) A fine-grained modal label-based multi-stage network for multimodal sentiment analysis. Expert Syst Appl 221:119721. https://doi.org/10.1016/j.eswa.2023.119721
Chen Q, Huang G, Wang Y (2022) The weighted cross-modal attention mechanism with sentiment prediction auxiliary task for multimodal sentiment analysis. IEEE ACM Trans Audio Speech Lang Process 30:2689–2695. https://doi.org/10.1109/TASLP.2022.3192728
Wu J, Mai S, Hu H (2021) Graph capsule aggregation for unaligned multimodal sequences. In: Proceedings of the 2021 international conference on multimodal interaction, pp 521–529. https://doi.org/10.1145/3462244.3479931
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp 3856–3866. https://proceedings.neurips.cc/paper/2017/hash/2cad8fa47bbef282badbb8de5374b894-Abstract.html
Yang J, Wang Y, Yi R, Zhu Y, Rehman A, Zadeh A, Poria S, Morency L-P (2021) Mtag: modal-temporal attention graph for unaligned human multimodal language sequences. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1009–1021. https://doi.org/10.18653/v1/2021.naacl-main.79
Yang X, Feng S, Zhang Y, Wang D (2021) Multimodal sentiment detection based on multi-channel graph neural networks. In: Zong C, Xia F, Li W, Navigli R (eds) Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, ACL/IJCNLP 2021, (vol 1: Long Papers), Virtual Event, August 1-6, 2021. Association for computational linguistics, pp 328–339. https://doi.org/10.18653/v1/2021.acl-long.28
Zeng Z, Sun S, Li Q (2023) Multimodal negative sentiment recognition of online public opinion on public health emergencies based on graph convolutional networks and ensemble learning. Inf Process Manag 60(4):103378. https://doi.org/10.1016/j.ipm.2023.103378
Zhang Y, Tiwari P, Zheng Q, El-Saddik A, Hossain MS (2023) A multimodal coupled graph attention network for joint traffic event detection and sentiment classification. IEEE Trans Intell Transp Syst 24(8):8542–8554. https://doi.org/10.1109/TITS.2022.3205477
Lu Q, Zhu Z, Zhang G, Kang S, Liu P (2021) Aspect-gated graph convolutional networks for aspect-based sentiment analysis. Appl Intell 51(7):4408–4419. https://doi.org/10.1007/s10489-020-02095-3
Xu Q, Peng J, Zheng C, Tan S, Yi F, Cheng F (2023) Short text classification of chinese with label information assisting. ACM Transactions on Asian and Low-Resource Language Information Processing, 1–18. https://doi.org/10.1145/3582301
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88. https://doi.org/10.1109/MIS.2016.94
Zadeh AB, Liang PP, Poria S, Cambria E, Morency L-P (2018) Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 2236–2246. https://doi.org/10.18653/v1/P18-1208
Tsai Y-HH, Bai S, Liang PP, Kolter JZ, Morency L-P, Salakhutdinov R (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for computational linguistics. Meeting, vol 2019, p 6558. https://doi.org/10.18653/v1/p19-1656
Hazarika D, Zimmermann R, Poria S (2020) MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In: Chen CW, Cucchiara R, Hua X, Qi G, Ricci E, Zhang Z, Zimmermann R (eds) MM ’20: the 28th ACM international conference on multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020. ACM, pp 1122–1131. https://doi.org/10.1145/3394171.3413678
Acknowledgements
The authors would like to thank the funding from the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004) and the resources and technical support from the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600).
Funding
This study was supported by the Open Project Program of Shanghai Key Laboratory of Data Science (No. 2020090600004) and the High Performance Computing Center of Shanghai University, and Shanghai Engineering Research Center of Intelligent Computing System (No. 19DZ2252600).
Author information
Authors and Affiliations
Contributions
[Tong Zhao]: Conceptualization of this study, Methodology, Software, Writing-Original Draft. [Junjie Peng]: Conceptualization of this study, Writing-Review & Editing, Supervision. [Yansong Huang]: Formal analysis, Visualization. [Lan Wang]: Validation, Investigation. [Huiran Zhang]: Conceptualization of this study, Resources. [Zesu Cai]: Conceptualization of this study, Writing-Review & Editing.
Corresponding author
Ethics declarations
Ethics approval
This article has never been submitted to more than one journal for simultaneous consideration. This article is original.
Consent to participate
The authors have approved this article before submission, including the names and order of authors.
Consent for publication
The authors agreed with the content and gave explicit consent to submit.
Competing interests
The authors declared that they have no conflict of interest to this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhao, T., Peng, J., Huang, Y. et al. A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis. Appl Intell 53, 30455–30468 (2023). https://doi.org/10.1007/s10489-023-05151-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05151-w