Abstract
One of the most appealing multidisciplinary research areas in Artificial Intelligence (AI) is Sentiment Analysis (SA). Due to the intricate and complementary interactions between several modalities, Multimodal Sentiment Analysis (MSA) is an extremely difficult work that has a wide range of applications. In the subject of multimodal sentiment analysis, numerous deep learning models and different techniques have been suggested, but they do not investigate the explicit context of words and are unable to model diverse components of a sentence. Hence, the full potential of such diverse data has not been explored. In this research, a Context-Sensitive Multi-Tier Deep Learning Framework (CS-MDF) is proposed for sentiment analysis on multimodal data. The CS-MDF uses a three-tier architecture for extracting context-sensitive information. The first tier utilizes Convolutional Neural Network (CNN) for extracting text-based features, 3D-CNN model for extracting visual features and open-Source Media Interpretation by Large feature-space Extraction (openSMILE) tool kit for audio feature extraction.The first tier focuses on extracting the unimodal features from the utterances. This level of extraction ignores context-sensitive data while determining the feature.CNNs are suitable for text data because they are particularly useful for identifying local patterns and dependencies in data.The second tier uses the features extracted from the first tier.The context-sensitive unimodal characteristics are extracted in this tier using the Bi-directional Gated Recurrent Unit (BiGRU), which is used to comprehend inter-utterance links and uncover contextual evidence.The output from tier two is combined and passed to the third tier, which fuses the features from different modalities and trains a single BiGRU model that provides the final classification.This method applies the BiGRU model to sequential data processing, using the advantages of both modalities and capturing their interdependencies.Experimental results obtained on six real-life datasets (Flickr Images dataset, Multi-View Sentiment Analysis dataset, Getty Images dataset, Balanced Twitter for Sentiment Analysis dataset, CMU-MOSI Dataset) show that the proposed CS-MDF model has achieved better performance compared with ten state-of-the-art approaches, which are validated by F1 score, precision, accuracy, and recall metrics.An ablation study is carried out on the proposed framework that demonstrates the viability of the design. The GradCAM visualization technique is applied to visualize the aligned input image-text pairs learned by the proposed CS-MDF model.
Similar content being viewed by others
Code availability
Not Applicable
Availability of data and materials
Not Applicable
References
Sánchez-Rada JF, Iglesias, CA (2019) Social context in sentiment analysis: Formal definition, overview of current trends and framework for comparison. Inf Fus 52:344–356. https://doi.org/10.1016/j.inffus.2019.05.003
Praveena HD, Guptha NS, Kazemzadeh A, Parameshachari BD, Hemalatha KL (2022) Effective cbmir system using hybrid features-based independent condensed nearest neighbor model. J Health Eng 2022:1–9. https://doi.org/10.1155/2022/3297316
Thouheed Ahmed SS, Thanuja K,Guptha NS, Narasimha S (2016) Telemedicine approach for remote patient monitoring system using smart phones with an economical hardware kit. In: 2016 international conference on computing technologies and intelligent data engineering (ICCTIDE’16), pp 1–4. https://doi.org/10.1109/ICCTIDE.2016.7725324
Guptha N, Patil K (2018) Detection of macro and micro nodule using online region based-active contour model in histopathological liver cirrhosis. Int J Intell Eng Syst 11:256–265. https://doi.org/10.1109/ICCTIDE.2016.7725324
Guptha NS, Patil KK (2017) Earth mover’s distance-based cbir using adaptive regularised kernel fuzzy c-means method of liver cirrhosis histopathological segmentation. Int J Signal Imaging Syst Eng 10:39. https://doi.org/10.1504/IJSISE.2017.084568
Abd El-Jawad MH, Hodhod R, Omar YMK (2018) Sentiment analysis of social media networks using machine learning. In: 2018 14th international computer engineering conference (ICENCO), pp 174–176. https://doi.org/10.1109/ICENCO.2018.8636124
Zadeh A, Chen M, Poria S, Cambria E, Morency L-P (2017) Tensor fusion network for multimodal sentiment analysis 1
Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 2539–2544. Assoc Comput Linguist, Lisbon, Portugal. https://doi.org/10.18653/v1/D15-1303
Cambria E, White B (2014) Jumping nlp curves: A review of natural language processing research [review article]. IEEE computational intelligence magazine 9:48–57. https://doi.org/10.1109/MCI.2014.2307227
Ahmed ST, Guptha NS, Lavanya NL, Basha SM, Fathima AS (2022) Improving medical image pixel quality using micq unsupervised machine learning technique. Malays J Comput Sci 53–64. https://doi.org/10.22452/mjcs.sp2022no2.5
Guptha NS, Balamurugan V, Megharaj G, Sattar KNA, Rose JD (2011) Cross lingual handwritten character recognition using long short term memory network with aid of elephant herding optimization algorithm. Pattern Recogn Lett 159:16–22. https://doi.org/10.1016/j.patrec.2022.04.038
Liu Z, Shen Y, Lakshminarasimhan VB, Liang PP, Bagher Zadeh A, Morency L-P (2018) Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2247–2256. Assoc Comput Linguist, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1209
Hu G, Hua Y, Yuan Y, Zhang Z, Lu Z, Mukherjee SS, Hospedales TM, Robertson NM, Yang Y (2017) Attribute-enhanced face recognition with neural tensor fusion networks. In: 2017 IEEE international conference on computer vision (ICCV), pp 3764–3773. https://doi.org/10.1109/ICCV.2017.404
Chen M, Wang S, Liang PP, Baltrušaitis T, Zadeh A, Morency L-P (2017) Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM international conference on multimodal interaction. ICMI ’17, pp 163–171. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3136755.3136801
Mai S, Hu H, Xing S (2020) Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. Proceedings of the AAAI conference on artificial intelligence 34:164–172. https://doi.org/10.1609/aaai.v34i01.5347
Yang J, She D, Sun M, Cheng M-M, Rosin PL, Wang L (2018) Visual sentiment prediction based on automatic discovery of affective regions. IEEE transactions on multimedia 20:2513–2525. https://doi.org/10.1109/TMM.2018.2803520
Poria S, Cambria E, Hazarika D, Majumder N, Zadeh A, Morency L-P (2017) Context-dependent sentiment analysis in user-generated videos. In: Proceedings of the 55th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 873–883. Assoc Comput Linguist, Vancouver, Canada. https://doi.org/10.18653/v1/P17-1081
Gu Y, Li X, Huang K, Fu S, Yang K, Chen S, Zhou M, Marsic I (2018) Human conversation analysis using attentive multimodal networks with hierarchical encoder-decoder. In: Proceedings of the 26th ACM international conference on multimedia. MM ’18, pp 537–545. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3240508.3240714
Akhtar MS, Chauhan DS, Ghosal D, Poria S, Ekbal A, Bhattacharyya P (2019) Multi-task learning for multi-modal emotion recognition and sentiment analysis 1
Huang F, Zhang X, Zhao Z, Xu J, Li Z (2019) Image-text sentiment analysis via deep multimodal attentive fusion. Knowledge-Based Systems 167:26–37. https://doi.org/10.1016/j.knosys.2019.01.019
Pham H, Manzini T, Liang PP, Poczós B (2018) Seq2Seq2Sentiment: Multimodal sequence to sequence models for sentiment analysis. In: Proceedings of grand challenge and workshop on human multimodal language (Challenge-HML), pp 53–63. Assoc Comput Linguist, Melbourne, Australia. https://doi.org/10.18653/v1/W18-3308
Pham H, Liang PP, Manzini T, Morency L-P, Póczos B (2019) Found in translation: Learning robust joint representations by cyclic translations between modalities. Proceedings of the AAAI conference on artificial intelligence 33:6892–6899. https://doi.org/10.1609/aaai.v33i01.33016892
Sun Z, Sarma P, Sethares W, Liang Y (2020) Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. Proceedings of the AAAI conference on artificial intelligence 34:8992–8999. https://doi.org/10.1609/aaai.v34i05.6431
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Dasgupta S, McAllester D (eds) Proceedings of the 30th international conference on machine learning. Proceedings of machine learning research, vol 28, pp 1247–1255. PMLR, Atlanta, Georgia, USA. https://proceedings.mlr.press/v28/andrew13.html
Park G, Im W (2016) Image-text multi-modal representation learning by adversarial backpropagation 1
Peng Y, Qi J (2019) Cm-gans. ACM Trans Multimed Comput Commun Appl 15:1–24. https://doi.org/10.1145/3284750
Zhu Q, Jiang X, Ye R (2021) Sentiment analysis of review text based on bigru-attention and hybrid cnn. IEEE Access 9:149077–149088. https://doi.org/10.1109/ACCESS.2021.3118537
Liu Y, Lu J, Yang J, Mao F (2020) Sentiment analysis for e-commerce product reviews by deep learning model of bert-bigru-softmax. Math Biosci Eng 17:7819–7837
Stateczny A, Narahari SC, Vurubindi P, Guptha NS, Srinivas K (2023) Underground water level prediction in remote sensing images using improved hydro index value with ensemble classifier. Remote Sensing 15. https://doi.org/10.3390/rs15082015
Kim T, Lee B (2020) Multi-attention multimodal sentiment analysis. In: Proceedings of the 2020 international conference on multimedia retrieval. ICMR ’20, pp 436–441. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3372278.3390698
Han W, Chen H, Gelbukh A, Zadeh A, Morency L-p, Poria S (2021) Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In: Proceedings of the 2021 international conference on multimodal interaction. ICMI ’21, pp 6–15. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3462244.3479919
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space 1
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: The munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on multimedia. MM ’10, pp 1459–1462. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/1873951.1874246
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks, vol 1
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Cho K, Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation 1
Ma Y, Peng H, Cambria E (2018) Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive lstm. Proceedings of the AAAI conference on artificial intelligence 32. https://doi.org/10.1609/aaai.v32i1.12048
Wang Y, Huang M, Zhu X, Zhao L (2016) Attention-based lstm for aspect-level sentiment classification 1:606–615
Wang Z, Yang B (2020) Attention-based bidirectional long short-term memory networks for relation classification using knowledge distillation from bert. In: 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, pp 562–568. https://doi.org/10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00100
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling 1
Borth D, Ji R, Chen T, Breuel T, Chang S-F (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia. MM ’13, pp 223–232. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/2502081.2502282
Niu T, Zhu S, Pang L, Saddik AE (2016) Sentiment Analysis on Multi-View Social Data. https://doi.org/10.1007/978-3-319-27674-8_2
You Q, Luo J, Jin H, Yang J (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: Proceedings of the ninth ACM international conference on web search and data mining. WSDM ’16, pp 13–22. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/2835776.2835779
Vadicamo L, Carrara F, Cimino A, Cresci S, Dell’Orletta F, Falchi F, Tesconi M (2017) Cross-media learning for image sentiment analysis in the wild, vol 1
Zadeh A, Zellers R, Pincus E, Morency L-P (2016) Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE intelligent systems 31:82–88. https://doi.org/10.1109/MIS.2016.94
Yang K, Xu H, Gao K (2020) Cm-bert: Cross-modal bert for text-audio sentiment analysis. In: Proceedings of the 28th ACM international conference on multimedia. MM ’20, pp 521–528. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3394171.3413690
Huang F, Wei K, Weng J, Li Z (2020) Attention-based modality-gated networks for image-text sentiment analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 16:1–19. https://doi.org/10.1145/3388861
Xu N, Mao W, Chen G (2018) A co-memory network for multimodal sentiment analysis. In: The 41st International ACM SIGIR conference on research & development in information retrieval. SIGIR ’18, pp 929–932. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3209978.3210093
Jiang T, Wang J, Liu Z, Ling Y (2020) Fusion-Extraction Network for Multimodal Sentiment. Analysis. https://doi.org/10.1007/978-3-030-47436-2_59
Xu N, Mao W (2017) Multisentinet: A deep semantic network for multimodal sentiment analysis. In: Proceedings of the 2017 ACM on conference on information and knowledge management. CIKM ’17, pp 2399–2402. Assoc Comput Mach, New York, NY, USA. https://doi.org/10.1145/3132847.3133142
Rahman W, Hasan MK, Lee S, Bagher Zadeh A, Mao C, Morency L-P, Hoque E (2020) Integrating multimodal information in large pretrained transformers. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp 2359–2369. Assoc Comput Linguist, Online. https://doi.org/10.18653/v1/2020.acl-main.214, https://aclanthology.org/2020.acl-main.214
Xu J, Huang F, Zhang X, Wang S, Li C, Li Z, He Y (2019) Visual-textual sentiment classification with bi-directional multi-level attention networks. Knowledge-Based Systems 178:61–73. https://doi.org/10.1016/j.knosys.2019.04.018
Acknowledgements
This work is supported by National Research Foundation (NRF) of Korea. Grant number: 2020R1A2C1012196.
Author information
Authors and Affiliations
Contributions
Conceptualization: Jothi Prakash V and Arul Antran Vijay S; Methodology: Jothi Prakash V and Arul Antran Vijay S; Formal analysis and investigation: GaneshKumar P, Anand Paul and Anand Nayyar; Writing - original draft preparation: Jothi Prakash V and Arul Antran Vijay S; Writing - review and editing: GaneshKumar P, Anand Paul and Anand Nayyar.
Corresponding authors
Ethics declarations
Competing interests
The authors declare there is no conflicts of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
P, G.K., S, A.A.V., V, J.P. et al. A context-sensitive multi-tier deep learning framework for multimodal sentiment analysis. Multimed Tools Appl 83, 54249–54278 (2024). https://doi.org/10.1007/s11042-023-17601-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17601-1