[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

MST-ARGCN: modality-squeeze transformer with attentional recurrent graph capsule network for multimodal sentiment analysis

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Multimodal sentiment analysis (MSA) is a heat topic in the deep learning research. Despite the progress in previous studies, existing methods based on multimodal transformer (MulT) tend to be extremely expensive since they often neglect to simplify the complicated computational procedures and does not take the nonequilibrium contributions of each modality into account, which completely adopt six crossmodal transformers to extract the representation sequences. In addition, these methods cannot effectively manage long-range dependency relationships, resulting in poor performance. Aiming at these problems, we propose a modality-squeeze transformer with attentional recurrent graph capsule network (MST-ARGCN) for MSA. It first squeezes three modalities through low-rank fusion to obtain the multimodal fused vector. After that, it utilizes only one crossmodal transformer, setting the multimodal fused vector as the source modality and the text as the target modality, to extract the representation sequence for subsequent networks, which greatly reduced the number of network parameters. In addition, ARGCN is presented to enhance the capability for learning the long-range dependency relationship during the outer-loop graph aggregation stage for further performance improving. We evaluate our model on the CMU-MOSEI and CMU-MOSI datasets. The experiment result proves that our model can achieve a competitive performance with low computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability Statement

The datasets are openly available in a public repository.

References

  1. Poria S, Cambria E, Bajpai R et al (2017) A review of affective computing: From unimodal analysis to multimodal fusion. Inf Fusion 37:98–125. https://doi.org/10.1016/j.inffus.2017.02.003

    Article  Google Scholar 

  2. Poria S, Hazarika D, Majumder N et al (2023) Beneath the tip of the iceberg: current challenges and new directions in sentiment analysis research. IEEE Trans Aff Comput 14(1):108–132. https://doi.org/10.1109/TAFFC.2020.3038167. arXiv:2005.00357

    Article  Google Scholar 

  3. Zadeh A, Chen M, Poria S, et al (2017) Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Copenhagen, Denmark, pp 1103–1114. https://doi.org/10.18653/v1/D17-1115

  4. Liu Z, Shen Y, Lakshminarasimhan VB, et al (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2247–2256. https://doi.org/10.18653/v1/P18-1209

  5. Tsai YHH, Bai S, Liang PP, et al (2019) Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Volume 1: Long Papers. Association for Computational Linguistics, Italy, pp 6558–6569. https://doi.org/10.18653/v1/p19-1656

  6. Wen H, You S, Fu Y (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recogn Lett 146:252–259. https://doi.org/10.1016/j.patrec.2021.03.025 (https://www.sciencedirect.com/science/article/pii/S0167865521001124)

    Article  Google Scholar 

  7. Wu J, Mai S, Hu H (2021) Graph capsule aggregation for unaligned multimodal sequences. In: Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, ICMI ’21, pp 521–529. https://doi.org/10.1145/3462244.3479931

  8. Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Guyon I, Luxburg UV, Bengio S et al (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, USA

    Google Scholar 

  9. Gori M, Monfardini G, Scarselli F (2005) A new model for learning in graph domains. In: Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., pp 729–734 vol. 2. https://doi.org/10.1109/IJCNN.2005.1555942

  10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  11. Poria S, Chaturvedi I, Cambria E, et al (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp 439–448. https://doi.org/10.1109/ICDM.2016.0055

  12. Sarkar C, Bhatia S, Agarwal A, et al (2014) Feature analysis for computational personality recognition using youtube personality data set. In: Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition. Association for Computing Machinery, New York, NY, USA, WCPR ’14, pp 11–14. https://doi.org/10.1145/2659522.2659528

  13. Wörtwein T, Scherer S (2017) What really matters—An information gain analysis of questions and reactions in automated PTSD screenings. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp 15–20. https://doi.org/10.1109/ACII.2017.8273573

  14. Yamasaki T, Fukushima Y, Furuta R, et al (2015) Prediction of user ratings of oral presentations using label relations. In: Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia. Association for Computing Machinery, New York, NY, USA, ASM ’15, pp 33–38. https://doi.org/10.1145/2813524.2813533

  15. Gong P, Liu J, Wu Z et al (2023) A multi-level circulant cross-modal transformer for multimodal speech emotion recognition. Comput Mater Continua 74(2):4203–4220. https://doi.org/10.32604/cmc.2023.028291

    Article  Google Scholar 

  16. Tsai YHH, Ma M, Yang M, et al (2020) Multimodal routing: improving local and global interpretability of multimodal language analysis. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, pp 1823–1833. https://doi.org/10.18653/v1/2020.emnlp-main.143

  17. Sahay S, Okur E, H Kumar S, et al (2020) Low rank fusion based transformers for multimodal sequences. In: Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Association for Computational Linguistics, Seattle, USA, pp 29–34. https://doi.org/10.18653/v1/2020.challengehml-1.4

  18. Liu Y, Li S, Wu Y, et al (2022) UMT: unified multi-modal transformers for joint video moment retrieval and highlight detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 3032–3041. https://doi.org/10.1109/CVPR52688.2022.00305

  19. Gong P, Liu J, Zhang X, et al (2023) A multi-stage hierarchical relational graph neural network for multimodal sentiment analysis. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096644

  20. Chang S, Liu J (2020) Multi-lane capsule network for classifying images with complex background. IEEE Access 8:79876–79886. https://doi.org/10.1109/ACCESS.2020.2990700

    Article  Google Scholar 

  21. Kipf TN, Welling M (2017) Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings. arXiv:1609.02907

  22. Liu F, Zheng J, Zheng L et al (2020) Combining attention-based bidirectional gated recurrent neural network and two-dimensional convolutional neural network for document-level sentiment classification. Neurocomputing 371:39–50. https://doi.org/10.1016/j.neucom.2019.09.012

    Article  Google Scholar 

  23. Zhou P, Shi W, Tian J, et al (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Berlin, Germany, pp 207–212. https://doi.org/10.18653/v1/P16-2034

  24. Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S et al (eds) Advances in neural information processing systems, vol 30. Curran Associates Inc, USA

    Google Scholar 

  25. Cho K, van Merriënboer B, Bahdanau D, et al (2014) On the properties of neural machine translation: encoder–decoder approaches. In: Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, pp 103–111. https://doi.org/10.3115/v1/W14-4012

  26. Liu J, Yang Y, Lv S et al (2019) Attention-based BiGRU-CNN for Chinese question classification. J Ambient Intell Humaniz Comput. https://doi.org/10.1007/s12652-019-01344-9

    Article  Google Scholar 

  27. Zadeh A, Zellers R, Pincus E et al (2016) Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell Syst 31(6):82–88. https://doi.org/10.1109/MIS.2016.94

    Article  Google Scholar 

  28. Bagher Zadeh A, Liang PP, Poria S, et al (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, pp 2236–2246. https://doi.org/10.18653/v1/P18-1208

  29. Pennington J, Socher R, Manning C (2014) GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162

  30. Degottex G, Kane J, Drugman T, et al (2014) COVAREP—A collaborative voice analysis repository for speech technologies. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 960–964. https://doi.org/10.1109/ICASSP.2014.6853739

  31. iMotions (2017) Facial expression analysis

  32. Yuan J, Liberman M et al (2008) Speaker identification on the SCOTUS corpus. J Acoust Soc Am 123(5):3878. https://doi.org/10.1121/1.2935783

    Article  Google Scholar 

  33. Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: 7th International Conference on Learning Representations, ICLR 2019. arXiv:1711.05101

  34. Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings. arXiv:1412.6980

  35. Liu F, Shen SY, Fu ZW et al (2022) Lgcct: a light gated and crossed complementation transformer for multimodal speech emotion recognition. Entropy. https://doi.org/10.3390/e24071010

    Article  Google Scholar 

  36. Wen H, You S, Fu Y (2021) Cross-modal dynamic convolution for multi-modal emotion recognition. J Vis Commun Image Represent 78:103178. https://doi.org/10.1016/j.jvcir.2021.103178 (https://www.sciencedirect.com/science/article/pii/S1047320321001085)

    Article  Google Scholar 

  37. Mai S, Xing S, He J et al (2023) Multimodal graph for unaligned multimodal sequence analysis via graph convolution and graph pooling. ACM Trans Multimedia Comput Commun Appl. https://doi.org/10.1145/3542927

    Article  Google Scholar 

  38. Wang B, Dong G, Zhao Y et al (2022) Non-uniform attention network for multi-modal sentiment analysis. In: Þór Jónsson B, Gurrin C, Tran MT et al (eds) MultiMedia modeling. Springer, Cham, pp 612–623

    Chapter  Google Scholar 

  39. Li Q, Gkoumas D, Lioma C et al (2021) Quantum-inspired multimodal fusion for video sentiment analysis. Inf Fusion 65:58–71. https://doi.org/10.1016/j.inffus.2020.08.006

    Article  Google Scholar 

  40. Lv F, Chen X, Huang Y, et al (2021) Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2554–2562. https://doi.org/10.1109/CVPR46437.2021.00258

  41. Pham H, Liang PP, Manzini T, et al (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892

  42. Wang Y, Shen Y, Liu Z, et al (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 7216–7223. https://doi.org/10.1609/aaai.v33i01.33017216

  43. Wang Z, Gao P, Chu X (2022) Sentiment analysis from Customer-generated online videos on product review using topic modeling and Multi-attention BLSTM. Adv Eng Inform 52:101588. https://doi.org/10.1016/j.aei.2022.101588

    Article  Google Scholar 

  44. Wang Z, Xu G, Zhou X et al (2022) Deep tensor evidence fusion network for sentiment classification. IEEE Trans Comput Soc Syst. https://doi.org/10.1109/TCSS.2022.3197994

    Article  Google Scholar 

Download references

Author's contribution

CH contributed to the methodology, experiments and the main manuscript writing. JL contributed to the supervision and the manuscript review and revision. XL contributed to the manuscript review and revision. ML and HH contributed to the curation of data resources.

Funding

This work is supported by the National Key Research and Development Program of China (No.2021YFC2801001), and the Major Research plan of the National Social Science Foundation of China (No.20 &ZD130).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jin Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, C., Liu, J., Li, X. et al. MST-ARGCN: modality-squeeze transformer with attentional recurrent graph capsule network for multimodal sentiment analysis. J Supercomput 81, 86 (2025). https://doi.org/10.1007/s11227-024-06588-7

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06588-7

Keywords

Navigation