[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503161.3547754acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Disentangled Representation Learning for Multimodal Emotion Recognition

Published: 10 October 2022 Publication History

Abstract

Multimodal emotion recognition aims to identify human emotions from text, audio, and visual modalities. Previous methods either explore correlations between different modalities or design sophisticated fusion strategies. However, the serious problem is that the distribution gap and information redundancy often exist across heterogeneous modalities, resulting in learned multimodal representations that may be unrefined. Motivated by these observations, we propose a Feature-Disentangled Multimodal Emotion Recognition (FDMER) method, which learns the common and private feature representations for each modality. Specifically, we design the common and private encoders to project each modality into modality-invariant and modality-specific subspaces, respectively. The modality-invariant subspace aims to explore the commonality among different modalities and reduce the distribution gap sufficiently. The modality-specific subspaces attempt to enhance the diversity and capture the unique characteristics of each modality. After that, a modality discriminator is introduced to guide the parameter learning of the common and private encoders in an adversarial manner. We achieve the modality consistency and disparity constraints by designing tailored losses for the above subspaces. Furthermore, we present a cross-modal attention fusion module to learn adaptive weights for obtaining effective multimodal representations. The final representation is used for different downstream tasks. Experimental results show that the FDMER outperforms the state-of-the-art methods on two multimodal emotion recognition benchmarks. Moreover, we further verify the effectiveness of our model via experiments on the multimodal humor detection task.

Supplementary Material

MP4 File (MM22-fp0048.mp4)
Recently, multimodal emotion recognition has become an active research area with essential applications. Previous methods either explore correlations between different modalities or design fusion strategies. However, the inherent heterogeneity across modalities often introduces information redundancy and distribution gap, resulting in learned representations that may be unrefined. In this paper, we propose a feature disentangled multimodal emotion recognition method to deal with modality heterogeneity by learning two distinct representations. The first is the common representations, which aim to explore the commonality and reduce the distribution gap. The second is the private representations, which attempt to enhance the diversity and capture the unique characteristics of each modality. Also, we present a cross-modal attention fusion module to learn adaptive weights for obtaining effective representations. Experimental results show that our method outperforms the previous methods on three benchmarks.

References

[1]
Hassan Akbari, Linagzhe Yuan, Rui Qian,Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv preprint arXiv:2104.11178 (2021).
[2]
Saima Aman and Stan Szpakowicz. 2007. Identifying expressions of emotion in text. In International Conference on Text, Speech and Dialogue. Springer, 196--205.
[3]
Tadas Baltru"aitis, Peter Robinson, and Louis-Philippe Morency. 2016. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1--10.
[4]
Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. 2016. Domain separation networks. Advances in neural information processing systems 29 (2016).
[5]
Scott Brave and Cliff Nass. 2007. Emotion in human-computer interaction. In The human-computer interaction handbook. CRC Press, 103--118.
[6]
Dong Chen, Xudong Cao, Liwei Wang, Fang Wen, and Jian Sun. 2012. Bayesian face revisited: A joint formulation. In European conference on computer vision. Springer, 566--579.
[7]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltru"aitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163--171.
[8]
Zhaoyu Chen, Bo Li, Jianghe Xu, Shuang Wu, Shouhong Ding, and Wenqiang Zhang. 2022. Towards Practical Certifiable Patch Defense with Vision Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15148--15158.
[9]
Gilles Degottex, John Kane, Thomas Drugman, Tuomo Raitio, and Stefan Scherer. 2014. COVAREP-A collaborative voice analysis repository for speech technologies. In 2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 960--964.
[10]
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690--4699.
[11]
Hazarika Devamanyu, Zimmermann Roger, and Poria Soujanya. 2020. MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Vol. 34. 1122--1131.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[13]
Faiyaz Doctor, Charalampos Karyotis, Rahat Iqbal, and Anne James. 2016. An intelligent framework for emotion aware e-healthcare support systems. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 1--8.
[14]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180--1189.
[15]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
[16]
Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. 2005. Measuring statistical dependence with Hilbert-Schmidt norms. In International conference on algorithmic learning theory. Springer, 63--77.
[17]
Weikuo Guo, Huaibo Huang, Xiangwei Kong, and Ran He. 2019. Learning disentangled representation for cross-modal retrieval with deep mutual information estimation. In Proceedings of the 27th ACM International Conference on Multimedia. 1712--1720.
[18]
Md Kamrul Hasan, Wasifur Rahman, Amir Zadeh, Jianyuan Zhong, Md Iftekhar Tanveer, Louis-Philippe Morency, et al. 2019. Ur-funny: A multimodal language dataset for understanding humor. arXiv preprint arXiv:1904.06618 (2019).
[19]
Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
[20]
Haiping Huang, Zhenchao Hu, Wenming Wang, and Min Wu. 2019. Multimodal emotion recognition based on ensemble convolutional neural network. IEEE Access 8 (2019), 3265--3271.
[21]
Hao Huang, Yongtao Wang, Zhaoyu Chen, Yuze Zhang, Yuheng Li, Zhi Tang, Wei Chu, Jingdong Chen,Weisi Lin, and Kai-Kuang Ma. 2022. Carnegie Mellon UniversityA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes. Proceedings of the AAAI Conference on Artificial Intelligence 36, 1 (Jun. 2022), 989--997. https://doi.org/10.1609/aaai.v36i1.19982
[22]
Jian Huang, Ya Li, Jianhua Tao, Zheng Lian, Zhengqi Wen, Minghao Yang, and Jiangyan Yi. 2017. Continuous multimodal emotion prediction based on long short term memory recurrent neural network. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 11--18.
[23]
iMotions. 2017. Facial expression analysis.
[24]
Hyunjik Kim and Andriy Mnih. 2018. Disentangling by factorising. In International Conference on Machine Learning. PMLR, 2649--2658.
[25]
Tao Liang, Guosheng Lin, Lei Feng, Yan Zhang, and Fengmao Lv. 2021. Attention Is Not Enough: Mitigating the Distribution Discrepancy in Asynchronous Multimodal Sequence Fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8148--8156.
[26]
Siao Liu, Zhaoyu Chen, Wei Li, Jiwei Zhu, Jiafeng Wang, Wenqiang Zhang, and Zhongxue Gan. 2022. Efficient Universal Shuffle Attack for Visual Object Tracking. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2739--2743. https://doi.org/10.1109/ICASSP43922.2022.9747773
[27]
Yang Liu, Jing Liu, Mengyang Zhao, Shuang Li, and Liang Song. 2022. Collaborative Normality Learning Framework for Weakly Supervised Video Anomaly Detection. IEEE Transactions on Circuits and Systems II: Express Briefs 69, 5 (2022), 2508--2512. https://doi.org/10.1109/TCSII.2022.3161061
[28]
Yang Liu, Jing Liu, Xiaoguang Zhu, Donglai Wei, Xiaohong Huang, and Liang Song. 2022. Learning Task-Specific Representation for Video Anomaly Detection with Spatial-Temporal Attention. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2190--2194. https://doi.org/10.1109/ICASSP43922.2022.9746822
[29]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018).
[30]
Fengmao Lv, Xiang Chen, Yanyong Huang, Lixin Duan, and Guosheng Lin. 2021. Progressive Modality Reinforcement for Human Multimodal Emotion Recognition From Unaligned Multimodal Sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554--2562.
[31]
Sijie Mai, Haifeng Hu, and Songlong Xing. 2020. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 164--172.
[32]
François Michaud, Paolo Pirjanian, Jonathan Audet, and Dominic Létourneau. 2000. Artificial emotion and social robotics. In Distributed autonomous robotic systems 4. Springer, 121--130.
[33]
Trisha Mittal, Aniket Bera, and Dinesh Manocha. 2021. Multimodal and Context-Aware Emotion Perception Model With Multiplicative Fusion. IEEE MultiMedia 28, 2 (2021), 67--75.
[34]
Donglin Niu, Jennifer G Dy, and Michael I Jordan. 2010. Multiple non-redundant spectral clustering views. In ICML.
[35]
Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning. PMLR, 2642--2651.
[36]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[37]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[38]
Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-Philippe Morency, and Barnabás Póczos. 2019. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6892--6899.
[39]
Amir Shirian and Tanaya Guha. 2021. Compact graph architecture for speech emotion recognition. In ICASSP 2021--2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6284--6288.
[40]
Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo. 2007. Supervised feature selection via dependence estimation. In Proceedings of the 24th international conference on Machine learning. 823--830.
[41]
Licai Sun, Zheng Lian, Jianhua Tao, Bin Liu, and Mingyue Niu. 2020. Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop. 27--34.
[42]
Zhongkai Sun, Prathusha Sarma, William Sethares, and Yingyu Liang. 2020. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992--8999.
[43]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
[44]
Yao-Hung Hubert Tsai, Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2018. Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 (2018).
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[46]
Shunli Wang, Dingkang Yang, Peng Zhai, Chixiao Chen, and Lihua Zhang. 2021. Tsa-net: Tube self-attention network for action quality assessment. In Proceedings of the 29th ACM International Conference on Multimedia. 4902--4910.
[47]
Xiao Wang, Meiqi Zhu, Deyu Bo, Peng Cui, Chuan Shi, and Jian Pei. 2020. Amgcn: Adaptive multi-channel graph convolutional networks. In Proceedings of the 26th ACM SIGKDD International conference on knowledge discovery & data mining. 1243--1253.
[48]
Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2019. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216--7223.
[49]
Xiang Wu, Huaibo Huang, Vishal M Patel, Ran He, and Zhenan Sun. 2019. Disentangled variational representation for heterogeneous face recognition. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 9005--9012.
[50]
Yang Wu, Zijie Lin, Yanyan Zhao, Bing Qin, and Li-Nan Zhu. 2021. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4730--4738.
[51]
Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017).
[52]
Amir Zadeh and Paul Pu. 2018. Multimodal language analysis in the wild: Cmumosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers).
[53]
Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31, 6 (2016), 82--88.
[54]
Peng Zhai, Jie Luo, Zhiyan Dong, Lihua Zhang, ShunliWang, and Dingkang Yang. 2022. Robust Adversarial Reinforcement Learning with Dissipation Inequation Constraint. (2022).
[55]
Changqing Zhang, Ziwei Yu, Qinghua Hu, Pengfei Zhu, Xinwang Liu, and Xiaobo Wang. 2018. Latent semantic aware multi-view multi-label classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[56]
Yi Zhang, Mingyuan Chen, Jundong Shen, and Chongjun Wang. 2022. Tailor Versatile Multi-modal Learning for Multi-label Emotion Recognition. arXiv preprint arXiv:2201.05834 (2022).

Cited By

View all
  • (2025)A Social Media Dataset and H-GNN-Based Contrastive Learning Scheme for Multimodal Sentiment AnalysisApplied Sciences10.3390/app1502063615:2(636)Online publication date: 10-Jan-2025
  • (2025)Fcdnet: Fuzzy Cognition-Based Dynamic Fusion Network for Multimodal Sentiment AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.340773933:1(3-14)Online publication date: Jan-2025
  • (2025)Adaptive Multimodal Graph Integration Network for Multimodal Sentiment AnalysisIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.350757633(23-36)Online publication date: 2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adversarial learning
  2. disentangled representation learning
  3. emotion recognition
  4. multimodal fusion

Qualifiers

  • Research-article

Funding Sources

  • Shanghai Municipal Science and Technology Major Project
  • National Natural Science Foundation of China under Grant
  • National Key R&D Program of China

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)841
  • Downloads (Last 6 weeks)105
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Social Media Dataset and H-GNN-Based Contrastive Learning Scheme for Multimodal Sentiment AnalysisApplied Sciences10.3390/app1502063615:2(636)Online publication date: 10-Jan-2025
  • (2025)Fcdnet: Fuzzy Cognition-Based Dynamic Fusion Network for Multimodal Sentiment AnalysisIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2024.340773933:1(3-14)Online publication date: Jan-2025
  • (2025)Adaptive Multimodal Graph Integration Network for Multimodal Sentiment AnalysisIEEE Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.350757633(23-36)Online publication date: 2025
  • (2025)Multi-Level Contrastive Learning: Hierarchical Alleviation of Heterogeneity in Multimodal Sentiment AnalysisIEEE Transactions on Affective Computing10.1109/TAFFC.2024.342367116:1(207-222)Online publication date: Jan-2025
  • (2025)Dynamic Causal Disentanglement Model for Dialogue Emotion DetectionIEEE Transactions on Affective Computing10.1109/TAFFC.2024.340671016:1(1-14)Online publication date: Jan-2025
  • (2025)Decoupled Feature and Self-Knowledge Distillation for Speech Emotion RecognitionIEEE Access10.1109/ACCESS.2025.354294813(33275-33285)Online publication date: 2025
  • (2025)Global distilling framework with cognitive gravitation for multimodal emotion recognitionNeurocomputing10.1016/j.neucom.2024.129306622(129306)Online publication date: Mar-2025
  • (2025)SDDA: A progressive self-distillation with decoupled alignment for multimodal image–text classificationNeurocomputing10.1016/j.neucom.2024.128794614(128794)Online publication date: Jan-2025
  • (2025)DCCMA-Net: Disentanglement-based cross-modal clues mining and aggregation network for explainable multimodal fake news detectionInformation Processing & Management10.1016/j.ipm.2025.10408962:4(104089)Online publication date: Jul-2025
  • (2025)Decoupled cross-attribute correlation network for multimodal sentiment analysisInformation Fusion10.1016/j.inffus.2024.102897117(102897)Online publication date: May-2025
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media