A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis

Ganesh Kumar P¹,
Arul Antran Vijay S ORCID: orcid.org/0000-0002-5543-7547²^na1,
Jothi Prakash V²^na1,
Anand Paul³^na1 &
…
Anand Nayyar⁴^na1

707 Accesses
4 Citations
Explore all metrics

Abstract

One of the most appealing multidisciplinary research areas in Artificial Intelligence (AI) is Sentiment Analysis (SA). Due to the intricate and complementary interactions between several modalities, Multimodal Sentiment Analysis (MSA) is an extremely difficult work that has a wide range of applications. In the subject of multimodal sentiment analysis, numerous deep learning models and different techniques have been suggested, but they do not investigate the explicit context of words and are unable to model diverse components of a sentence. Hence, the full potential of such diverse data has not been explored. In this research, a Context-Sensitive Multi-Tier Deep Learning Framework (CS-MDF) is proposed for sentiment analysis on multimodal data. The CS-MDF uses a three-tier architecture for extracting context-sensitive information. The first tier utilizes Convolutional Neural Network (CNN) for extracting text-based features, 3D-CNN model for extracting visual features and open-Source Media Interpretation by Large feature-space Extraction (openSMILE) tool kit for audio feature extraction.The first tier focuses on extracting the unimodal features from the utterances. This level of extraction ignores context-sensitive data while determining the feature.CNNs are suitable for text data because they are particularly useful for identifying local patterns and dependencies in data.The second tier uses the features extracted from the first tier.The context-sensitive unimodal characteristics are extracted in this tier using the Bi-directional Gated Recurrent Unit (BiGRU), which is used to comprehend inter-utterance links and uncover contextual evidence.The output from tier two is combined and passed to the third tier, which fuses the features from different modalities and trains a single BiGRU model that provides the final classification.This method applies the BiGRU model to sequential data processing, using the advantages of both modalities and capturing their interdependencies.Experimental results obtained on six real-life datasets (Flickr Images dataset, Multi-View Sentiment Analysis dataset, Getty Images dataset, Balanced Twitter for Sentiment Analysis dataset, CMU-MOSI Dataset) show that the proposed CS-MDF model has achieved better performance compared with ten state-of-the-art approaches, which are validated by F1 score, precision, accuracy, and recall metrics.An ablation study is carried out on the proposed framework that demonstrates the viability of the design. The GradCAM visualization technique is applied to visualize the aligned input image-text pairs learned by the proposed CS-MDF model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on deep learning-based approaches for multimodal sentiment analysis

Article 28 July 2023

Multimodal Sentiment Analysis Using Deep Learning: A Review

Enhancing Sentiment Analysis Accuracy Through Multimodal Data Fusion: A Deep Learning Approach

Code availability

Not Applicable

Availability of data and materials

Not Applicable

References

Sánchez-Rada JF, Iglesias, CA (2019) Social context in sentiment analysis: Formal definition, overview of current trends and framework for comparison. Inf Fus 52:344–356. https://doi.org/10.1016/j.inffus.2019.05.003
Praveena HD, Guptha NS, Kazemzadeh A, Parameshachari BD, Hemalatha KL (2022) Effective cbmir system using hybrid features-based independent condensed nearest neighbor model. J Health Eng 2022:1–9. https://doi.org/10.1155/2022/3297316
Article Google Scholar
Thouheed Ahmed SS, Thanuja K,Guptha NS, Narasimha S (2016) Telemedicine approach for remote patient monitoring system using smart phones with an economical hardware kit. In: 2016 international conference on computing technologies and intelligent data engineering (ICCTIDE’16), pp 1–4. https://doi.org/10.1109/ICCTIDE.2016.7725324
Guptha N, Patil K (2018) Detection of macro and micro nodule using online region based-active contour model in histopathological liver cirrhosis. Int J Intell Eng Syst 11:256–265. https://doi.org/10.1109/ICCTIDE.2016.7725324
Article Google Scholar
Guptha NS, Patil KK (2017) Earth mover’s distance-based cbir using adaptive regularised kernel fuzzy c-means method of liver cirrhosis histopathological segmentation. Int J Signal Imaging Syst Eng 10:39. https://doi.org/10.1504/IJSISE.2017.084568
Article Google Scholar
Abd El-Jawad MH, Hodhod R, Omar YMK (2018) Sentiment analysis of social media networks using machine learning. In: 2018 14th international computer engineering conference (ICENCO), pp 174–176. https://doi.org/10.1109/ICENCO.2018.8636124
Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis 1
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544. Assoc Comput Linguist, Lisbon, Portugal. https://doi.org/10.18653/v1/D15-1303
Cambria E, White B (2014) Jumping nlp curves: A review of natural language processing research [review article]. IEEE computational intelligence magazine 9:48–57. https://doi.org/10.1109/MCI.2014.2307227
Article Google Scholar
Ahmed ST, Guptha NS, Lavanya NL, Basha SM, Fathima AS (2022) Improving medical image pixel quality using micq unsupervised machine learning technique. Malays J Comput Sci 53–64. https://doi.org/10.22452/mjcs.sp2022no2.5
Guptha NS, Balamurugan V, Megharaj G, Sattar KNA, Rose JD (2011) Cross lingual handwritten character recognition using long short term memory network with aid of elephant herding optimization algorithm. Pattern Recogn Lett 159:16–22. https://doi.org/10.1016/j.patrec.2022.04.038
Article Google Scholar
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Bagher Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2247–2256. Assoc Comput Linguist, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1209
Hu G, Hua Y, Yuan Y, Zhang Z, Lu Z, Mukherjee SS, Hospedales TM, Robertson NM, Yang Y (2017) Attribute-enhanced face recognition with neural tensor fusion networks. In: 2017 IEEE international conference on computer vision (ICCV), pp 3764–3773. https://doi.org/10.1109/ICCV.2017.404
Chen M, Wang S, Liang PP, Baltrušaitis T, Zadeh A, Morency L-P (2017) Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM international conference on multimodal interaction. ICMI ’17, pp 163–171. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3136755.3136801
Mai S, Hu H, Xing S (2020) Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. Proceedings of the AAAI conference on artificial intelligence 34:164–172. https://doi.org/10.1609/aaai.v34i01.5347
Article Google Scholar
Yang J, She D, Sun M, Cheng M-M, Rosin PL, Wang L (2018) Visual sentiment prediction based on automatic discovery of affective regions. IEEE transactions on multimedia 20:2513–2525. https://doi.org/10.1109/TMM.2018.2803520
Article Google Scholar
Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 873–883. Assoc Comput Linguist, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1081
Gu Y, Li X, Huang K, Fu S, Yang K, Chen S, Zhou M, Marsic I (2018) Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder. In: Proceedings of the 26th ACM international conference on multimedia. MM ’18, pp 537–545. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3240508.3240714
Akhtar MS, Chauhan DS, Ghosal D, Poria S, Ekbal A, Bhattacharyya P (2019) Multi-task learning for multi-modal emotion recognition and sentiment analysis 1
Huang F, Zhang X, Zhao Z, Xu J, Li Z (2019) Image-text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems 167:26–37. https://doi.org/10.1016/j.knosys.2019.01.019
Article Google Scholar
Pham H, Manzini T, Liang PP, Poczós B (2018) Seq2Seq2Sentiment: Multimodal sequence to sequence models for sentiment analysis. In: Proceedings of grand challenge and workshop on human multimodal language (Challenge-HML), pp 53–63. Assoc Comput Linguist, Melbourne, Australia. https://doi.org/10.18653/v1/W18-3308
Pham H, Liang PP, Manzini T, Morency L-P, Póczos B (2019) Found in translation: Learning robust joint representations by cyclic translations between modalities. Proceedings of the AAAI conference on artificial intelligence 33:6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892
Article Google Scholar
Sun Z, Sarma P, Sethares W, Liang Y (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. Proceedings of the AAAI conference on artificial intelligence 34:8992–8999. https://doi.org/10.1609/aaai.v34i05.6431
Article Google Scholar
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Dasgupta S, McAllester D (eds) Proceedings of the 30th international conference on machine learning. Proceedings of machine learning research, vol 28, pp 1247–1255. PMLR, Atlanta, Georgia, USA. https://proceedings.mlr.press/v28/andrew13.html
Park G, Im W (2016) Image-text multi-modal representation learning by adversarial backpropagation 1
Peng Y, Qi J (2019) Cm-gans. ACM Trans Multimed Comput Commun Appl 15:1–24. https://doi.org/10.1145/3284750
Article Google Scholar
Zhu Q, Jiang X, Ye R (2021) Sentiment analysis of review text based on bigru-attention and hybrid cnn. IEEE Access 9:149077–149088. https://doi.org/10.1109/ACCESS.2021.3118537
Article Google Scholar
Liu Y, Lu J, Yang J, Mao F (2020) Sentiment analysis for e-commerce product reviews by deep learning model of bert-bigru-softmax. Math Biosci Eng 17:7819–7837
Article MathSciNet Google Scholar
Stateczny A, Narahari SC, Vurubindi P, Guptha NS, Srinivas K (2023) Underground water level prediction in remote sensing images using improved hydro index value with ensemble classifier. Remote Sensing 15. https://doi.org/10.3390/rs15082015
Kim T, Lee B (2020) Multi-attention multimodal sentiment analysis. In: Proceedings of the 2020 international conference on multimedia retrieval. ICMR ’20, pp 436–441. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3372278.3390698
Han W, Chen H, Gelbukh A, Zadeh A, Morency L-p, Poria S (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 international conference on multimodal interaction. ICMI ’21, pp 6–15. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3462244.3479919
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space 1
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: The munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia. MM ’10, pp 1459–1462. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/1873951.1874246
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks, vol 1
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Cho K, Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation 1
Ma Y, Peng H, Cambria E (2018) Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive lstm. Proceedings of the AAAI conference on artificial intelligence 32. https://doi.org/10.1609/aaai.v32i1.12048
Wang Y, Huang M, Zhu X, Zhao L (2016) Attention-based lstm for aspect-level sentiment classification 1:606–615
Wang Z, Yang B (2020) Attention-based bidirectional long short-term memory networks for relation classification using knowledge distillation from bert. In: 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, pp 562–568. https://doi.org/10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00100
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling 1
Borth D, Ji R, Chen T, Breuel T, Chang S-F (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia. MM ’13, pp 223–232. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/2502081.2502282
Niu T, Zhu S, Pang L, Saddik AE (2016) Sentiment Analysis on Multi-View Social Data. https://doi.org/10.1007/978-3-319-27674-8_2
You Q, Luo J, Jin H, Yang J (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the ninth ACM international conference on web search and data mining. WSDM ’16, pp 13–22. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/2835776.2835779
Vadicamo L, Carrara F, Cimino A, Cresci S, Dell’Orletta F, Falchi F, Tesconi M (2017) Cross-media learning for image sentiment analysis in the wild, vol 1
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE intelligent systems 31:82–88. https://doi.org/10.1109/MIS.2016.94
Article Google Scholar
Yang K, Xu H, Gao K (2020) Cm-bert: Cross-modal bert for text-audio sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia. MM ’20, pp 521–528. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3394171.3413690
Huang F, Wei K, Weng J, Li Z (2020) Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 16:1–19. https://doi.org/10.1145/3388861
Article Google Scholar
Xu N, Mao W, Chen G (2018) A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR conference on research & development in information retrieval. SIGIR ’18, pp 929–932. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3209978.3210093
Jiang T, Wang J, Liu Z, Ling Y (2020) Fusion-Extraction Network for Multimodal Sentiment. Analysis. https://doi.org/10.1007/978-3-030-47436-2_59
Article Google Scholar
Xu N, Mao W (2017) Multisentinet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on conference on information and knowledge management. CIKM ’17, pp 2399–2402. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3132847.3133142
Rahman W, Hasan MK, Lee S, Bagher Zadeh A, Mao C, Morency L-P, Hoque E (2020) Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 2359–2369. Assoc Comput Linguist, Online. https://doi.org/10.18653/v1/2020.acl-main.214, https://aclanthology.org/2020.acl-main.214
Xu J, Huang F, Zhang X, Wang S, Li C, Li Z, He Y (2019) Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowledge-Based Systems 178:61–73. https://doi.org/10.1016/j.knosys.2019.04.018
Article Google Scholar

Download references

Acknowledgements

This work is supported by National Research Foundation (NRF) of Korea. Grant number: 2020R1A2C1012196.

Author information

Arul Antran Vijay S, Jothi Prakash V, Ganesh Kumar P, Anand Paul and Anand Nayyar have contributed equally to this work.

Authors and Affiliations

Department of Computer Science and Engineering, College of Engineering, Guindy, Anna University, Chennai, Tamil Nadu, India
Ganesh Kumar P
Karpagam College of Engineering, Coimbatore, Tamil Nadu, India
Arul Antran Vijay S & Jothi Prakash V
The School of Computer Science and Engineering, Kyungpook National University, Daegu, South Korea
Anand Paul
School of Computer Science, Faculty of Information Technology, Duy Tan University, Da Nang, Vietnam
Anand Nayyar

Authors

Ganesh Kumar P
View author publications
You can also search for this author in PubMed Google Scholar
Arul Antran Vijay S
View author publications
You can also search for this author in PubMed Google Scholar
Jothi Prakash V
View author publications
You can also search for this author in PubMed Google Scholar
Anand Paul
View author publications
You can also search for this author in PubMed Google Scholar
Anand Nayyar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: Jothi Prakash V and Arul Antran Vijay S; Methodology: Jothi Prakash V and Arul Antran Vijay S; Formal analysis and investigation: GaneshKumar P, Anand Paul and Anand Nayyar; Writing - original draft preparation: Jothi Prakash V and Arul Antran Vijay S; Writing - review and editing: GaneshKumar P, Anand Paul and Anand Nayyar.

Corresponding authors

Correspondence to Arul Antran Vijay S or Jothi Prakash V.

Ethics declarations

Competing interests

The authors declare there is no conflicts of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

P, G.K., S, A.A.V., V, J.P. et al. A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis. Multimed Tools Appl 83, 54249–54278 (2024). https://doi.org/10.1007/s11042-023-17601-1

Download citation

Received: 26 May 2023
Revised: 04 October 2023
Accepted: 22 October 2023
Published: 01 December 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11042-023-17601-1

A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive survey on deep learning-based approaches for multimodal sentiment analysis

Multimodal Sentiment Analysis Using Deep Learning: A Review

Enhancing Sentiment Analysis Accuracy Through Multimodal Data Fusion: A Deep Learning Approach

Code availability

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A comprehensive survey on deep learning-based approaches for multimodal sentiment analysis

Multimodal Sentiment Analysis Using Deep Learning: A Review

Enhancing Sentiment Analysis Accuracy Through Multimodal Data Fusion: A Deep Learning Approach

Code availability

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation