[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

AMSA: Adaptive Multimodal Learning for Sentiment Analysis

Published: 24 February 2023 Publication History

Abstract

Efficient recognition of emotions has attracted extensive research interest, which makes new applications in many fields possible, such as human-computer interaction, disease diagnosis, service robots, and so forth. Although existing work on sentiment analysis relying on sensors or unimodal methods performs well for simple contexts like business recommendation and facial expression recognition, it does far below expectations for complex scenes, such as sarcasm, disdain, and metaphors. In this article, we propose a novel two-stage multimodal learning framework, called AMSA, to adaptively learn correlation and complementarity between modalities for dynamic fusion, achieving more stable and precise sentiment analysis results. Specifically, a multiscale attention model with a slice positioning scheme is proposed to get stable quintuplets of sentiment in images, texts, and speeches in the first stage. Then a Transformer-based self-adaptive network is proposed to assign weights flexibly for multimodal fusion in the second stage and update the parameters of the loss function through compensation iteration. To quickly locate key areas for efficient affective computing, a patch-based selection scheme is proposed to iteratively remove redundant information through a novel loss function before fusion. Extensive experiments have been conducted on both machine weakly labeled and manually annotated datasets of self-made Video-SA, CMU-MOSEI, and CMU-MOSI. The results demonstrate the superiority of our approach through comparison with baselines.

References

[1]
Kashif Ahmad, Mohamed Lamine Mekhalfi, Nicola Conci, Farid Melgani, and Francesco De Natale. 2018. Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 2 (2018), 1–20.
[2]
Jessica Elan Chung and Eni Mustafaraj. 2011. Can collective sentiment expressed on twitter predict political elections? In 25th AAAI Conference on Artificial Intelligence.
[3]
Hang Cui, Vibhu Mittal, and Mayur Datar. 2006. Comparative experiments on sentiment classification for online product reviews. In AAAI, Vol. 6. 30.
[4]
Masoud Mazloom, Robert Rietveld, Stevan Rudinac, Marcel Worring, and Willemijn Van Dolen. 2016. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the 24th ACM International Conference on Multimedia. 197–201.
[5]
Sicheng Zhao, Shangfei Wang, Mohammad Soleymani, Dhiraj Joshi, and Qiang Ji. 2019. Affective computing for large-scale heterogeneous multimedia data: A survey. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 3s (2019), 1–32.
[6]
Wentao Jiang, Si Liu, Chen Gao, Jie Cao, Ran He, Jiashi Feng, and Shuicheng Yan. 2020. PSGAN: Pose and expression robust spatial-aware GAN for customizable makeup transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5194–5202.
[7]
Luntian Mou, Chao Zhou, Pengfei Zhao, Bahareh Nakisa, Mohammad Naim Rastgoo, Ramesh Jain, and Wen Gao. 2021. Driver stress detection via multimodal fusion using attention-based CNN-LSTM. Expert Systems with Applications 173 (2021), 114693.
[8]
Mohammad Soleymani, David Garcia, Brendan Jou, Björn Schuller, Shih-Fu Chang, and Maja Pantic. 2017. A survey of multimodal sentiment analysis. Image and Vision Computing 65 (2017), 3–14.
[9]
Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. Sentiment analysis using deep learning architectures: A review. Artificial Intelligence Review 53, 6 (2020), 4335–4385.
[10]
Luntian Mou, Chao Zhou, Pengtao Xie, Pengfei Zhao, Ramesh C. Jain, Wen Gao, and Baocai Yin. 2021. Isotropic self-supervised learning for driver drowsiness detection with attention-based multimodal fusion. IEEE Transactions on Multimedia (2021).
[11]
Sicheng Zhao, Yue Gao, Xiaolei Jiang, Hongxun Yao, Tat-Seng Chua, and Xiaoshuai Sun. 2014. Exploring principles-of-art features for image emotion recognition. In Proceedings of the 22nd ACM International Conference on Multimedia. 47–56.
[12]
Monisha Kanakaraj and Ram Mohana Reddy Guddeti. 2015. Performance analysis of ensemble methods on Twitter sentiment analysis using NLP techniques. In Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC’15). IEEE, 169–170.
[13]
Ezgi Yıldırım, Fatih Samet Çetin, Gülşen Eryiğit, and Tanel Temel. 2015. The impact of NLP on Turkish sentiment analysis. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 7, 1 (2015), 43–51.
[14]
Shangfei Wang and Qiang Ji. 2015. Video affective content analysis: A survey of state-of-the-art methods. IEEE Transactions on Affective Computing 6, 4 (2015), 410–430.
[15]
Sicheng Zhao, Hongxun Yao, Yue Gao, Rongrong Ji, and Guiguang Ding. 2016. Continuous probability distribution prediction of image emotions via multitask shared sparse regression. IEEE Transactions on Multimedia 19, 3 (2016), 632–645.
[16]
Nusrat J. Shoumy, Li-Minn Ang, Kah Phooi Seng, D. M. Motiur Rahaman, and Tanveer Zia. 2020. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. Journal of Network and Computer Applications 149 (2020), 102447.
[17]
Sudhanshu Kumar, Mahendra Yadava, and Partha Pratim Roy. 2019. Fusion of EEG response and sentiment analysis of products review to predict customer satisfaction. Information Fusion 52 (2019), 41–52.
[18]
Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia analysis: A survey. Multimedia Systems 16, 6 (2010), 345–379.
[19]
Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 973–982.
[20]
Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2015. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2539–2544.
[21]
Ningning Liu, Emmanuel Dellandréa, Liming Chen, Chao Zhu, Yu Zhang, Charles-Edmond Bichot, Stéphane Bres, and Bruno Tellez. 2013. Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme. Computer Vision and Image Understanding 117, 5 (2013), 493–512.
[22]
Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. 2020. Ch-sims: A Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3718–3727.
[23]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163–171.
[24]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. 2017. Multi-level multiple attentions for contextual multimodal sentiment analysis. In 2017 IEEE International Conference on Data Mining (ICDM’17). IEEE, 1033–1038.
[25]
Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. 2016. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the 9th ACM International Conference on Web Search and Data Mining. 13–22.
[26]
Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[27]
Yagya Raj Pandeya, Bhuwan Bhattarai, and Joonwhoan Lee. 2021. Deep-learning-based multimodal emotion classification for music videos. Sensors 21, 14 (2021), 4927.
[28]
Linghui Li, Sheng Tang, Lixi Deng, Yongdong Zhang, and Qi Tian. 2017. Image caption with global-local attention. In 31st AAAI Conference on Artificial Intelligence.
[29]
Jie Wu, Haifeng Hu, and Yi Wu. 2018. Image captioning via semantic guidance attention and consensus selection strategy. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 4 (2018), 1–19.
[30]
Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1s (2019), 1–17.
[31]
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 37 (2017), 98–125. DOI:
[32]
Junjun Chen. 2021. Refining the teacher emotion model: Evidence from a review of literature published between 1985 and 2019. Cambridge Journal of Education 51, 3 (2021), 327–357.
[33]
Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering 23, 1 (2017), 155–162.
[34]
Vishwanath A. Sindagi and Vishal M. Patel. 2018. A survey of recent advances in CNN-based single image crowd counting and density estimation. Pattern Recognition Letters 107 (2018), 3–16.
[35]
Guozhen Zhao, Jinjing Song, Yan Ge, Yongjin Liu, Lin Yao, and Tao Wen. 2016. Advances in emotion recognition based on physiological big data. Journal of Computer Research and Development 53, 1 (2016), 80.
[36]
ReadFace. 2020. ReadFace webpage on 36Kr. http://36kr.com/p/5038637.html. (2020).
[37]
Srikumar Krishnamoorthy. 2018. Sentiment analysis of financial news articles using performance indicators. Knowledge and Information Systems 56, 2 (2018), 373–394.
[38]
Xiaodong Li, Haoran Xie, Li Chen, Jianping Wang, and Xiaotie Deng. 2014. News impact on stock price return via sentiment analysis. Knowledge-Based Systems 69 (2014), 14–23.
[39]
Qianren Mao, Jianxin Li, Senzhang Wang, Yuanning Zhang, Hao Peng, Min He, and Lihong Wang. 2019. Aspect-based sentiment classification with attentive neural turing machines. In IJCAI. 5139–5145.
[40]
Yanghui Rao, Jingsheng Lei, Liu Wenyin, Qing Li, and Mingliang Chen. 2014. Building emotional dictionary for sentiment analysis of online news. World Wide Web 17, 4 (2014), 723–742.
[41]
Yuxiang Zhang, Jiamei Fu, Dongyu She, Ying Zhang, Senzhang Wang, and Jufeng Yang. 2018. Text emotion distribution learning via multi-task convolutional neural network. In IJCAI. 4595–4601.
[42]
Dushyant Singh Chauhan, S. R. Dhanush, Asif Ekbal, and Pushpak Bhattacharyya. 2020. Sentiment and emotion help sarcasm? A multi-task learning framework for multi-modal sarcasm, sentiment and emotion analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4351–4360.
[43]
Ngoc-Dau Mai, Boon-Giin Lee, and Wan-Young Chung. 2021. Affective computing on machine learning-based emotion recognition using a self-made EEG device. Sensors 21, 15 (2021), 5135. DOI:
[44]
Dhanesh Ramachandram and Graham W. Taylor. 2017. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine 34, 6 (2017), 96–108.
[45]
Sijie Mai, Haifeng Hu, and Songlong Xing. 2019. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 481–492.
[46]
Dilana Hazer-Rau, Sascha Meudt, Andreas Daucher, Jennifer Spohrs, Holger Hoffmann, Friedhelm Schwenker, and Harald C. Traue. 2020. The uulmMAC database-A multimodal affective corpus for affective computing in human-computer interaction. Sensors 20, 8 (2020), 2308.
[47]
Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, and Xiangang Li. 2019. Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:1909.05645 (2019).
[48]
Yubo Xie, Junze Li, and Pearl Pu. 2020. Uncertainty and surprisal jointly deliver the punchline: Exploiting incongruity-based features for humor recognition. arXiv preprint arXiv:2012.12007 (2020).
[49]
Xiangyu Wang and Chengqing Zong. 2021. Distributed representations of emotion categories in emotion space. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2364–2375.
[50]
Mohammad Ehsan Basiri, Shahla Nemati, Moloud Abdar, Erik Cambria, and U. Rajendra Acharya. 2021. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Generation Computer Systems 115 (2021), 279–294.
[51]
Yuxiao Chen, Jianbo Yuan, Quanzeng You, and Jiebo Luo. 2018. Twitter sentiment analysis via bi-sense emoji embedding and attention-based LSTM. In Proceedings of the 26th ACM International Conference on Multimedia. 117–125.
[52]
Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 371–378.
[53]
Xiao Bai, Xiang Wang, Xianglong Liu, Qiang Liu, Jingkuan Song, Nicu Sebe, and Been Kim. 2021. Explainable deep learning for efficient and robust pattern recognition: A survey of recent developments. Pattern Recognition 120 (2021), 108102.
[54]
Chen Wang, Xiang Wang, Jiawei Zhang, Liang Zhang, Xiao Bai, Xin Ning, Jun Zhou, and Edwin Hancock. 2022. Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recognition 124 (2022), 108498.
[55]
Xinyu Ou, Hefei Ling, Han Yu, Ping Li, Fuhao Zou, and Si Liu. 2017. Adult image and video recognition by a deep multicontext network and fine-to-coarse strategy. ACM Transactions on Intelligent Systems and Technology (TIST) 8, 5 (2017), 1–25.
[56]
Xiaocui Yang, Shi Feng, Daling Wang, and Yifei Zhang. 2020. Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia 23 (2020), 4014–4026.
[57]
Quoc-Tuan Truong and Hady W. Lauw. 2019. Vistanet: Visual aspect attention network for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 305–312.
[58]
Nan Xu and Wenji Mao. 2017. Multisentinet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2399–2402.
[59]
Tuan-Linh Nguyen, Swathi Kavuri, and Minho Lee. 2019. A multimodal convolutional neuro-fuzzy network for emotion understanding of movie clips. Neural Networks 118 (2019), 208–219.
[60]
Usman Naseem, Imran Razzak, Katarzyna Musial, and Muhammad Imran. 2020. Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Generation Computer Systems 113 (2020), 58–69.
[61]
Chaozhuo Li, Senzhang Wang, Yukun Wang, Philip Yu, Yanbo Liang, Yun Liu, and Zhoujun Li. 2019. Adversarial learning for weakly-supervised social network alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 996–1003.
[62]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.
[63]
Jonghwan Mun, Minsu Cho, and Bohyung Han. 2017. Text-guided attention model for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[64]
Tingting Qiao, Jianfeng Dong, and Duanqing Xu. 2018. Exploring human-like attention supervision in visual question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[65]
Luzi Sennhauser and Robert C. Berwick. 2018. Evaluating the ability of LSTMs to learn context-free grammars. arXiv preprint arXiv:1811.02611 (2018).
[66]
Amir Zadeh and Paul Pu. 2018. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers).
[67]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016).
[68]
Dou Hu, Lingwei Wei, and Xiaoyong Huai. 2021. DialogueCRN: Contextual reasoning networks for emotion recognition in conversations. arXiv preprint arXiv:2106.01978 (2021).
[69]
Ruifan Li, Hao Chen, Fangxiang Feng, Zhanyu Ma, Xiaojie Wang, and Eduard Hovy. 2021. Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 6319–6329.
[70]
Akmaljon Palvanov and Young Im Cho. 2019. Visnet: Deep convolutional neural networks for forecasting atmospheric visibility. Sensors 19, 6 (2019), 1343.
[71]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).

Cited By

View all
  • (2024)A Multi-Attention Feature Distillation Neural Network for Lightweight Single Image Super-ResolutionInternational Journal of Intelligent Systems10.1155/2024/32552332024Online publication date: 15-Feb-2024
  • (2024)Robust Multimodal Representation under Uncertain Missing ModalitiesACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370200321:1(1-23)Online publication date: 26-Oct-2024
  • (2024)Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664654Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 19, Issue 3s
June 2023
270 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3582887
  • Editor:
  • Abdulmotaleb El Saddik
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2023
Online AM: 01 December 2022
Accepted: 20 November 2022
Revised: 06 October 2022
Received: 17 July 2022
Published in TOMM Volume 19, Issue 3s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Sentiment analysis
  2. multimodal fusion
  3. self-adaptive mechanism
  4. Transformer
  5. patch-based selection

Qualifiers

  • Research-article

Funding Sources

  • Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)392
  • Downloads (Last 6 weeks)51
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Multi-Attention Feature Distillation Neural Network for Lightweight Single Image Super-ResolutionInternational Journal of Intelligent Systems10.1155/2024/32552332024Online publication date: 15-Feb-2024
  • (2024)Robust Multimodal Representation under Uncertain Missing ModalitiesACM Transactions on Multimedia Computing, Communications, and Applications10.1145/370200321:1(1-23)Online publication date: 26-Oct-2024
  • (2024)Spatiotemporal Inconsistency Learning and Interactive Fusion for Deepfake Video DetectionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3664654Online publication date: 13-May-2024
  • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
  • (2024)Pseudo Content Hallucination for Unpaired Image CaptioningProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658080(320-329)Online publication date: 30-May-2024
  • (2024)MF2ShrT: Multimodal Feature Fusion Using Shared Layered Transformer for Face Anti-spoofingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364081720:6(1-21)Online publication date: 8-Mar-2024
  • (2024)Dynamic Weighted Adversarial Learning for Semi-Supervised Classification under Intersectional Class MismatchACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363531020:4(1-24)Online publication date: 11-Jan-2024
  • (2024)Deep Modular Co-Attention Shifting Network for Multimodal Sentiment AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363470620:4(1-23)Online publication date: 11-Jan-2024
  • (2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
  • (2024)Learning a Novel Ensemble Tracker for Robust Visual TrackingIEEE Transactions on Multimedia10.1109/TMM.2023.330793926(3194-3206)Online publication date: 1-Jan-2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media