MST-ARGCN: modality-squeeze transformer with attentional recurrent graph capsule network for multimodal sentiment analysis

Chengyu Hu¹,
Jin Liu¹,
Xingye Li¹,
Meijing Li¹ &
…
Huihua He²

250 Accesses
Explore all metrics

Abstract

Multimodal sentiment analysis (MSA) is a heat topic in the deep learning research. Despite the progress in previous studies, existing methods based on multimodal transformer (MulT) tend to be extremely expensive since they often neglect to simplify the complicated computational procedures and does not take the nonequilibrium contributions of each modality into account, which completely adopt six crossmodal transformers to extract the representation sequences. In addition, these methods cannot effectively manage long-range dependency relationships, resulting in poor performance. Aiming at these problems, we propose a modality-squeeze transformer with attentional recurrent graph capsule network (MST-ARGCN) for MSA. It first squeezes three modalities through low-rank fusion to obtain the multimodal fused vector. After that, it utilizes only one crossmodal transformer, setting the multimodal fused vector as the source modality and the text as the target modality, to extract the representation sequence for subsequent networks, which greatly reduced the number of network parameters. In addition, ARGCN is presented to enhance the capability for learning the long-range dependency relationship during the outer-loop graph aggregation stage for further performance improving. We evaluate our model on the CMU-MOSEI and CMU-MOSI datasets. The experiment result proves that our model can achieve a competitive performance with low computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis

Article 18 November 2023

TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis

Article 12 August 2024

MECG: modality-enhanced convolutional graph for unbalanced multimodal representations

Article 20 December 2024

Data Availability Statement

The datasets are openly available in a public repository.

References

Poria S, Cambria E, Bajpai R et al (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inf Fusion 37:98–125. https://doi.org/10.1016/j.inffus.2017.02.003
Article Google Scholar
Poria S, Hazarika D, Majumder N et al (2023) Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEEE Trans Aff Comput 14(1):108–132. https://doi.org/10.1109/TAFFC.2020.3038167. arXiv:2005.00357
Article Google Scholar
Zadeh A, Chen M, Poria S, et al (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pp 1103–1114. https://doi.org/10.18653/v1/D17-1115
Liu Z, Shen Y, Lakshminarasimhan VB, et al (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2247–2256. https://doi.org/10.18653/v1/P18-1209
Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Volume 1: Long Papers. Association for Computational Linguistics, Italy, pp 6558–6569. https://doi.org/10.18653/v1/p19-1656
Wen H, You S, Fu Y (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recogn Lett 146:252–259. https://doi.org/10.1016/j.patrec.2021.03.025 (https://www.sciencedirect.com/science/article/pii/S0167865521001124)
Article Google Scholar
Wu J, Mai S, Hu H (2021) Graph capsule aggregation for unaligned multimodal sequences. In: Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, ICMI ’21, pp 521–529. https://doi.org/10.1145/3462244.3479931
Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Guyon I, Luxburg UV, Bengio S et al (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, USA
Google Scholar
Gori M, Monfardini G, Scarselli F (2005) A new model for learning in graph domains. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., pp 729–734 vol. 2. https://doi.org/10.1109/IJCNN.2005.1555942
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Poria S, Chaturvedi I, Cambria E, et al (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp 439–448. https://doi.org/10.1109/ICDM.2016.0055
Sarkar C, Bhatia S, Agarwal A, et al (2014) Feature analysis for computational personality recognition using youtube personality data set. In: Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition. Association for Computing Machinery, New York, NY, USA, WCPR ’14, pp 11–14. https://doi.org/10.1145/2659522.2659528
Wörtwein T, Scherer S (2017) What really matters—An information gain analysis of questions and reactions in automated PTSD screenings. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp 15–20. https://doi.org/10.1109/ACII.2017.8273573
Yamasaki T, Fukushima Y, Furuta R, et al (2015) Prediction of user ratings of oral presentations using label relations. In: Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia. Association for Computing Machinery, New York, NY, USA, ASM ’15, pp 33–38. https://doi.org/10.1145/2813524.2813533
Gong P, Liu J, Wu Z et al (2023) A multi-level circulant cross-modal transformer for multimodal speech emotion recognition. Comput Mater Continua 74(2):4203–4220. https://doi.org/10.32604/cmc.2023.028291
Article Google Scholar
Tsai YHH, Ma M, Yang M, et al (2020) Multimodal routing: improving local and global interpretability of multimodal language analysis. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, pp 1823–1833. https://doi.org/10.18653/v1/2020.emnlp-main.143
Sahay S, Okur E, H Kumar S, et al (2020) Low rank fusion based transformers for multimodal sequences. In: Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, pp 29–34. https://doi.org/10.18653/v1/2020.challengehml-1.4
Liu Y, Li S, Wu Y, et al (2022) UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3032–3041. https://doi.org/10.1109/CVPR52688.2022.00305
Gong P, Liu J, Zhang X, et al (2023) A multi-stage hierarchical relational graph neural network for multimodal sentiment analysis. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096644
Chang S, Liu J (2020) Multi-lane capsule network for classifying images with complex background. IEEE Access 8:79876–79886. https://doi.org/10.1109/ACCESS.2020.2990700
Article Google Scholar
Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings. arXiv:1609.02907
Liu F, Zheng J, Zheng L et al (2020) Combining attention-based bidirectional gated recurrent neural network and two-dimensional convolutional neural network for document-level sentiment classification. Neurocomputing 371:39–50. https://doi.org/10.1016/j.neucom.2019.09.012
Article Google Scholar
Zhou P, Shi W, Tian J, et al (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, pp 207–212. https://doi.org/10.18653/v1/P16-2034
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S et al (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, USA
Google Scholar
Cho K, van Merriënboer B, Bahdanau D, et al (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, pp 103–111. https://doi.org/10.3115/v1/W14-4012
Liu J, Yang Y, Lv S et al (2019) Attention-based BiGRU-CNN for Chinese question classification. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01344-9
Article Google Scholar
Zadeh A, Zellers R, Pincus E et al (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88. https://doi.org/10.1109/MIS.2016.94
Article Google Scholar
Bagher Zadeh A, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2236–2246. https://doi.org/10.18653/v1/P18-1208
Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162
Degottex G, Kane J, Drugman T, et al (2014) COVAREP—A collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 960–964. https://doi.org/10.1109/ICASSP.2014.6853739
iMotions (2017) Facial expression analysis
Yuan J, Liberman M et al (2008) Speaker identification on the SCOTUS corpus. J Acoust Soc Am 123(5):3878. https://doi.org/10.1121/1.2935783
Article Google Scholar
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019. arXiv:1711.05101
Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. arXiv:1412.6980
Liu F, Shen SY, Fu ZW et al (2022) Lgcct: a light gated and crossed complementation transformer for multimodal speech emotion recognition. Entropy. https://doi.org/10.3390/e24071010
Article Google Scholar
Wen H, You S, Fu Y (2021) Cross-modal dynamic convolution for multi-modal emotion recognition. J Vis Commun Image Represent 78:103178. https://doi.org/10.1016/j.jvcir.2021.103178 (https://www.sciencedirect.com/science/article/pii/S1047320321001085)
Article Google Scholar
Mai S, Xing S, He J et al (2023) Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Trans Multimedia Comput Commun Appl. https://doi.org/10.1145/3542927
Article Google Scholar
Wang B, Dong G, Zhao Y et al (2022) Non-uniform attention network for multi-modal sentiment analysis. In: Þór Jónsson B, Gurrin C, Tran MT et al (eds) MultiMedia modeling. Springer, Cham, pp 612–623
Chapter Google Scholar
Li Q, Gkoumas D, Lioma C et al (2021) Quantum-inspired multimodal fusion for video sentiment analysis. Inf Fusion 65:58–71. https://doi.org/10.1016/j.inffus.2020.08.006
Article Google Scholar
Lv F, Chen X, Huang Y, et al (2021) Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2554–2562. https://doi.org/10.1109/CVPR46437.2021.00258
Pham H, Liang PP, Manzini T, et al (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892
Wang Y, Shen Y, Liu Z, et al (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216
Wang Z, Gao P, Chu X (2022) Sentiment analysis from Customer-generated online videos on product review using topic modeling and Multi-attention BLSTM. Adv Eng Inform 52:101588. https://doi.org/10.1016/j.aei.2022.101588
Article Google Scholar
Wang Z, Xu G, Zhou X et al (2022) Deep tensor evidence fusion network for sentiment classification. IEEE Trans Comput Soc Syst. https://doi.org/10.1109/TCSS.2022.3197994
Article Google Scholar

Download references

Author's contribution

CH contributed to the methodology, experiments and the main manuscript writing. JL contributed to the supervision and the manuscript review and revision. XL contributed to the manuscript review and revision. ML and HH contributed to the curation of data resources.

Funding

This work is supported by the National Key Research and Development Program of China (No.2021YFC2801001), and the Major Research plan of the National Social Science Foundation of China (No.20 &ZD130).

Author information

Authors and Affiliations

College of Information Engineering, Shanghai Maritime University, 1550 Haigang Avenue, Shanghai, 201306, China
Chengyu Hu, Jin Liu, Xingye Li & Meijing Li
College of Early Childhood Education, Shanghai Normal University, 100 Guilin Road, Shanghai, 200233, China
Huihua He

Authors

Chengyu Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xingye Li
View author publications
You can also search for this author in PubMed Google Scholar
Meijing Li
View author publications
You can also search for this author in PubMed Google Scholar
Huihua He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jin Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Hu, C., Liu, J., Li, X. et al. MST-ARGCN: modality-squeeze transformer with attentional recurrent graph capsule network for multimodal sentiment analysis. J Supercomput 81, 86 (2025). https://doi.org/10.1007/s11227-024-06588-7

Download citation

Accepted: 07 October 2024
Published: 26 October 2024
DOI: https://doi.org/10.1007/s11227-024-06588-7

MST-ARGCN: modality-squeeze transformer with attentional recurrent graph capsule network for multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis

TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis

MECG: modality-enhanced convolutional graph for unbalanced multimodal representations

Data Availability Statement

References

Author's contribution

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

MST-ARGCN: modality-squeeze transformer with attentional recurrent graph capsule network for multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A graph convolution-based heterogeneous fusion network for multimodal sentiment analysis

TEMM: text-enhanced multi-interactive attention and multitask learning network for multimodal sentiment analysis

MECG: modality-enhanced convolutional graph for unbalanced multimodal representations

Data Availability Statement

References

Author's contribution

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation