Abstract
Multimodal sentiment analysis (MSA) is a heat topic in the deep learning research. Despite the progress in previous studies, existing methods based on multimodal transformer (MulT) tend to be extremely expensive since they often neglect to simplify the complicated computational procedures and does not take the nonequilibrium contributions of each modality into account, which completely adopt six crossmodal transformers to extract the representation sequences. In addition, these methods cannot effectively manage long-range dependency relationships, resulting in poor performance. Aiming at these problems, we propose a modality-squeeze transformer with attentional recurrent graph capsule network (MST-ARGCN) for MSA. It first squeezes three modalities through low-rank fusion to obtain the multimodal fused vector. After that, it utilizes only one crossmodal transformer, setting the multimodal fused vector as the source modality and the text as the target modality, to extract the representation sequence for subsequent networks, which greatly reduced the number of network parameters. In addition, ARGCN is presented to enhance the capability for learning the long-range dependency relationship during the outer-loop graph aggregation stage for further performance improving. We evaluate our model on the CMU-MOSEI and CMU-MOSI datasets. The experiment result proves that our model can achieve a competitive performance with low computational complexity.
Similar content being viewed by others
Data Availability Statement
The datasets are openly available in a public repository.
References
Poria S, Cambria E, Bajpai R et al (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inf Fusion 37:98–125. https://doi.org/10.1016/j.inffus.2017.02.003
Poria S, Hazarika D, Majumder N et al (2023) Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEEE Trans Aff Comput 14(1):108–132. https://doi.org/10.1109/TAFFC.2020.3038167. arXiv:2005.00357
Zadeh A, Chen M, Poria S, et al (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pp 1103–1114. https://doi.org/10.18653/v1/D17-1115
Liu Z, Shen Y, Lakshminarasimhan VB, et al (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2247–2256. https://doi.org/10.18653/v1/P18-1209
Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Volume 1: Long Papers. Association for Computational Linguistics, Italy, pp 6558–6569. https://doi.org/10.18653/v1/p19-1656
Wen H, You S, Fu Y (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recogn Lett 146:252–259. https://doi.org/10.1016/j.patrec.2021.03.025 (https://www.sciencedirect.com/science/article/pii/S0167865521001124)
Wu J, Mai S, Hu H (2021) Graph capsule aggregation for unaligned multimodal sequences. In: Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, ICMI ’21, pp 521–529. https://doi.org/10.1145/3462244.3479931
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Guyon I, Luxburg UV, Bengio S et al (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, USA
Gori M, Monfardini G, Scarselli F (2005) A new model for learning in graph domains. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., pp 729–734 vol. 2. https://doi.org/10.1109/IJCNN.2005.1555942
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Poria S, Chaturvedi I, Cambria E, et al (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp 439–448. https://doi.org/10.1109/ICDM.2016.0055
Sarkar C, Bhatia S, Agarwal A, et al (2014) Feature analysis for computational personality recognition using youtube personality data set. In: Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition. Association for Computing Machinery, New York, NY, USA, WCPR ’14, pp 11–14. https://doi.org/10.1145/2659522.2659528
Wörtwein T, Scherer S (2017) What really matters—An information gain analysis of questions and reactions in automated PTSD screenings. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp 15–20. https://doi.org/10.1109/ACII.2017.8273573
Yamasaki T, Fukushima Y, Furuta R, et al (2015) Prediction of user ratings of oral presentations using label relations. In: Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia. Association for Computing Machinery, New York, NY, USA, ASM ’15, pp 33–38. https://doi.org/10.1145/2813524.2813533
Gong P, Liu J, Wu Z et al (2023) A multi-level circulant cross-modal transformer for multimodal speech emotion recognition. Comput Mater Continua 74(2):4203–4220. https://doi.org/10.32604/cmc.2023.028291
Tsai YHH, Ma M, Yang M, et al (2020) Multimodal routing: improving local and global interpretability of multimodal language analysis. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, pp 1823–1833. https://doi.org/10.18653/v1/2020.emnlp-main.143
Sahay S, Okur E, H Kumar S, et al (2020) Low rank fusion based transformers for multimodal sequences. In: Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, pp 29–34. https://doi.org/10.18653/v1/2020.challengehml-1.4
Liu Y, Li S, Wu Y, et al (2022) UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3032–3041. https://doi.org/10.1109/CVPR52688.2022.00305
Gong P, Liu J, Zhang X, et al (2023) A multi-stage hierarchical relational graph neural network for multimodal sentiment analysis. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096644
Chang S, Liu J (2020) Multi-lane capsule network for classifying images with complex background. IEEE Access 8:79876–79886. https://doi.org/10.1109/ACCESS.2020.2990700
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings. arXiv:1609.02907
Liu F, Zheng J, Zheng L et al (2020) Combining attention-based bidirectional gated recurrent neural network and two-dimensional convolutional neural network for document-level sentiment classification. Neurocomputing 371:39–50. https://doi.org/10.1016/j.neucom.2019.09.012
Zhou P, Shi W, Tian J, et al (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, pp 207–212. https://doi.org/10.18653/v1/P16-2034
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S et al (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, USA
Cho K, van Merriënboer B, Bahdanau D, et al (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, pp 103–111. https://doi.org/10.3115/v1/W14-4012
Liu J, Yang Y, Lv S et al (2019) Attention-based BiGRU-CNN for Chinese question classification. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01344-9
Zadeh A, Zellers R, Pincus E et al (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88. https://doi.org/10.1109/MIS.2016.94
Bagher Zadeh A, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2236–2246. https://doi.org/10.18653/v1/P18-1208
Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
Degottex G, Kane J, Drugman T, et al (2014) COVAREP—A collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 960–964. https://doi.org/10.1109/ICASSP.2014.6853739
iMotions (2017) Facial expression analysis
Yuan J, Liberman M et al (2008) Speaker identification on the SCOTUS corpus. J Acoust Soc Am 123(5):3878. https://doi.org/10.1121/1.2935783
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019. arXiv:1711.05101
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. arXiv:1412.6980
Liu F, Shen SY, Fu ZW et al (2022) Lgcct: a light gated and crossed complementation transformer for multimodal speech emotion recognition. Entropy. https://doi.org/10.3390/e24071010
Wen H, You S, Fu Y (2021) Cross-modal dynamic convolution for multi-modal emotion recognition. J Vis Commun Image Represent 78:103178. https://doi.org/10.1016/j.jvcir.2021.103178 (https://www.sciencedirect.com/science/article/pii/S1047320321001085)
Mai S, Xing S, He J et al (2023) Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Trans Multimedia Comput Commun Appl. https://doi.org/10.1145/3542927
Wang B, Dong G, Zhao Y et al (2022) Non-uniform attention network for multi-modal sentiment analysis. In: Þór Jónsson B, Gurrin C, Tran MT et al (eds) MultiMedia modeling. Springer, Cham, pp 612–623
Li Q, Gkoumas D, Lioma C et al (2021) Quantum-inspired multimodal fusion for video sentiment analysis. Inf Fusion 65:58–71. https://doi.org/10.1016/j.inffus.2020.08.006
Lv F, Chen X, Huang Y, et al (2021) Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2554–2562. https://doi.org/10.1109/CVPR46437.2021.00258
Pham H, Liang PP, Manzini T, et al (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892
Wang Y, Shen Y, Liu Z, et al (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216
Wang Z, Gao P, Chu X (2022) Sentiment analysis from Customer-generated online videos on product review using topic modeling and Multi-attention BLSTM. Adv Eng Inform 52:101588. https://doi.org/10.1016/j.aei.2022.101588
Wang Z, Xu G, Zhou X et al (2022) Deep tensor evidence fusion network for sentiment classification. IEEE Trans Comput Soc Syst. https://doi.org/10.1109/TCSS.2022.3197994
Author's contribution
CH contributed to the methodology, experiments and the main manuscript writing. JL contributed to the supervision and the manuscript review and revision. XL contributed to the manuscript review and revision. ML and HH contributed to the curation of data resources.
Funding
This work is supported by the National Key Research and Development Program of China (No.2021YFC2801001), and the Major Research plan of the National Social Science Foundation of China (No.20 &ZD130).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hu, C., Liu, J., Li, X. et al. MST-ARGCN: modality-squeeze transformer with attentional recurrent graph capsule network for multimodal sentiment analysis. J Supercomput 81, 86 (2025). https://doi.org/10.1007/s11227-024-06588-7
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-024-06588-7