[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations

Published: 18 October 2024 Publication History

Abstract

With the rapid development of social media and human–computer interaction, multimodal emotion recognition in conversations (MERC) tasks have begun to receive widespread research attention. The MERC task is to extract and fuse complementary semantic information from different modalities to classify the speaker’s emotion. However, the existing feature fusion methods usually directly map the features of other modalities into the same feature space for information fusion, which cannot eliminate the heterogeneity between different modalities and make the subsequent emotion class boundary learning more difficult. In addition, existing graph contrastive learning methods obtain consistent feature representations by maximizing mutual information between multiple views, which may lead to overfitting of the model. To tackle the above problem, we propose a novel Adversarial Alignment and Graph Fusion via Information Bottleneck for Multimodal Emotion Recognition in Conversations (AGF-IB) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features, respectively, through adversarial representation to achieve information interaction between modalities and eliminate the heterogeneity among modalities. Thirdly, we introduce graph contrastive representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Furthermore, instead of maximizing the mutual information (MI) between multiple views, we use information bottleneck theory to minimize the MI between views. Specifically, we construct a graph structure for the three modal features respectively and perform contrastive representation learning on nodes with different emotions in the same modality and nodes with the same emotion in different modalities, to improve the feature representation ability of nodes. Finally, we use MLP to complete the emotional classification of the speaker. Extensive experiments show that AGF-IB can improve emotion recognition accuracy on IEMOCAP and MELD datasets. Furthermore, since AGF-IB is a general multimodal fusion and contrastive learning method, it can be applied to other multimodal tasks in a plug-and-play manner, e.g., humor detection.

Highlights

A multimodal emotion recognition architecture through adversarial alignment and graph fusion is proposed.
A cross-modal feature alignment method with adversarial learning is designed to eliminate inter-modal heterogeneity.
A graph contrastive learning method via information bottleneck is proposed to enhance multimodal semantic association.
Our method can be applied to other multimodal tasks in a plug-and-play manner, e.g., humor detection.

References

[1]
Huang F., Li X., Yuan C., Zhang S., Zhang J., Qiao S., Attention-emotion-enhanced convolutional LSTM for sentiment analysis, IEEE Trans. Neural Netw. Learn. Syst. 33 (9) (2022) 4332–4345.
[2]
Khare S.K., Bajaj V., Time–frequency representation and convolutional neural network-based emotion recognition, IEEE Trans. Neural Netw. Learn. Syst. 32 (7) (2020) 2901–2909.
[3]
Qian S., Xue D., Fang Q., Xu C., Integrating multi-label contrastive learning with dual adversarial graph neural networks for cross-modal retrieval, IEEE Trans. Pattern Anal. Mach. Intell. (2022) 1–18.
[4]
Guo J., Song B., Zhang P., Ma M., Luo W., et al., Affective video content analysis based on multimodal data fusion in heterogeneous networks, Inf. Fusion 51 (2019) 224–232.
[5]
Zhao W., Zhao Y., Lu X., Cauain: Causal aware interaction network for emotion recognition in conversations, in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, Morgan Kaufmann, 2022, pp. 4524–4530.
[6]
A. Zadeh, M. Chen, S. Poria, E. Cambria, L.-P. Morency, Tensor Fusion Network for Multimodal Sentiment Analysis, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.
[7]
Z. Liu, Y. Shen, V.B. Lakshminarasimhan, P.P. Liang, A.B. Zadeh, L.-P. Morency, Efficient Low-rank Multimodal Fusion With Modality-Specific Factors, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2247–2256.
[8]
J. Hu, Y. Liu, J. Zhao, Q. Jin, MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5666–5675.
[9]
Liu S., Gao P., Li Y., Fu W., Ding W., Multi-modal fusion network with complementarity and importance for emotion recognition, Inform. Sci. 619 (2023) 679–694.
[10]
Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V., Roberta: A robustly optimized bert pretraining approach, 2019, arXiv preprint arXiv:1907.11692.
[11]
N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, E. Cambria, Dialoguernn: An attentive rnn for emotion detection in conversations, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, (01) 2019, pp. 6818–6825.
[12]
F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versatile and fast open-source audio feature extractor, in: Proceedings of the 18th ACM International Conference on Multimedia, 2010, pp. 1459–1462.
[13]
Lian Z., Liu B., Tao J., PIRNet: Personality-enhanced iterative refinement network for emotion recognition in conversation, IEEE Trans. Neural Netw. Learn. Syst. (2022) 1–12.
[14]
Poria S., Cambria E., Hazarika D., Majumder N., Zadeh A., Morency L.-P., Context-dependent sentiment analysis in user-generated videos, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL, 2017, pp. 873–883.
[15]
Beard R., Das R., Ng R.W.M., Gopalakrishnan P.G.K., Eerens L., Swietojanski P., Miksik O., Multi-modal sequence fusion via recursive attention for emotion recognition, in: Proceedings of the 22nd Conference on Computational Natural Language Learning, ACL, 2018, pp. 251–259.
[16]
Ren M., Huang X., Li W., Song D., Nie W., LR-GCN: Latent relation-aware graph convolutional network for conversational emotion recognition, IEEE Trans. Multimed. (2021) 1.
[17]
Nie W., Ren M., Nie J., Zhao S., C-GCN: correlation based graph convolutional network for audio-video emotion recognition, IEEE Trans. Multimed. 23 (2020) 3793–3804.
[18]
Wu S., Zhou L., Hu Z., Liu J., Hierarchical context-based emotion recognition with scene graphs, IEEE Trans. Neural Netw. Learn. Syst. (2022) 1–15.
[19]
Chang C.-M., Lee C.-C., Learning enhanced acoustic latent representation for small scale affective corpus with adversarial cross corpora integration, IEEE Trans. Affect. Comput. (2021) 1.
[20]
Li M., Yang B., Levy J., Stolcke A., Rozgic V., Matsoukas S., Papayiannis C., Bone D., Wang C., Contrastive unsupervised learning for speech emotion recognition, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2021, pp. 6329–6333.
[21]
Kim D., Song B.C., Contrastive adversarial learning for person independent facial emotion recognition, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, AAAI, 2021, pp. 5948–5956.
[22]
Wang X., Zhang D., Tan H.-Z., Lee D.-J., A self-fusion network based on contrastive learning for group emotion recognition, IEEE Trans. Comput. Soc. Syst. (2022) 1–12.
[23]
Kim T., Vossen P., Emoberta: Speaker-aware emotion recognition in conversation with Roberta, Comput. Res. Repos.-arXiv 2021 (2021) 1–7.
[24]
V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, N. Onoe, M2fnet: Multi-modal fusion network for emotion recognition in conversation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4652–4661.
[25]
W. Shen, S. Wu, Y. Yang, X. Quan, Directed Acyclic Graph Network for Conversational Emotion Recognition, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 1551–1560.
[26]
Li Z., Tang F., Zhao M., Zhu Y., EmoCaps: Emotion capsule based model for conversational emotion recognition, in: Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 1610–1618.
[27]
Ghosal D., Majumder N., Poria S., Chhaya N., Gelbukh A., DialogueGCN: A graph convolutional neural network for emotion recognition in conversation, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, ACL, 2019, pp. 154–164.
[28]
A.A. Alemi, I. Fischer, J.V. Dillon, K. Murphy, Deep Variational Information Bottleneck, in: International Conference on Learning Representations, 2016.
[29]
Poria S., Hazarika D., Majumder N., Naik G., Cambria E., Mihalcea R., MELD: A multimodal multi-party dataset for emotion recognition in conversations, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL, 2019, pp. 527–536.
[30]
Busso C., Bulut M., Lee C.-C., Kazemzadeh A., Mower E., Kim S., Chang J.N., Lee S., Narayanan S.S., IEMOCAP: Interactive emotional dyadic motion capture database, Lang. Resour. Eval. 42 (4) (2008) 335–359.
[31]
D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: International Conference on Learning Representations, 2015.
[32]
Kim Y., Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, ACL, 2014, pp. 1746–1751.
[33]
Lian Z., Liu B., Tao J., CTNet: Conversational transformer network for emotion recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process. 29 (2021) 985–1000.
[34]
Hu D., Hou X., Wei L., Jiang L., Mo Y., MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations, in: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2022, pp. 7037–7041.
[35]
W. Han, H. Chen, S. Poria, Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 9180–9192.
[36]
S. Mai, H. Hu, S. Xing, Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, (01) 2020, pp. 164–172.
[37]
Lian Z., Chen L., Sun L., Liu B., Tao J., Gcnet: Graph completion network for incomplete multimodal learning in conversation, IEEE Trans. Pattern Anal. Mach. Intell. (2023).
[38]
M.K. Hasan, W. Rahman, A.B. Zadeh, J. Zhong, M.I. Tanveer, L.-P. Morency, M.E. Hoque, UR-FUNNY: A Multimodal Language Dataset for Understanding Humor, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, 2019, pp. 2046–2056.
[39]
P. Veličković, W. Fedus, W.L. Hamilton, P. Liò, Y. Bengio, R.D. Hjelm, Deep Graph Infomax, in: International Conference on Learning Representations, 2018.
[40]
F.-Y. Sun, J. Hoffman, V. Verma, J. Tang, InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization, in: International Conference on Learning Representations, 2019.

Index Terms

  1. Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations
                Index terms have been assigned to the content through auto-classification.

                Recommendations

                Comments

                Please enable JavaScript to view thecomments powered by Disqus.

                Information & Contributors

                Information

                Published In

                cover image Information Fusion
                Information Fusion  Volume 112, Issue C
                Dec 2024
                818 pages

                Publisher

                Elsevier Science Publishers B. V.

                Netherlands

                Publication History

                Published: 18 October 2024

                Author Tags

                1. Adversarial representation learning
                2. Feature fusion
                3. Graph contrastive representation learning
                4. Multimodal emotion recognition in conversations
                5. Information bottleneck

                Qualifiers

                • Research-article

                Contributors

                Other Metrics

                Bibliometrics & Citations

                Bibliometrics

                Article Metrics

                • 0
                  Total Citations
                • 0
                  Total Downloads
                • Downloads (Last 12 months)0
                • Downloads (Last 6 weeks)0
                Reflects downloads up to 15 Jan 2025

                Other Metrics

                Citations

                View Options

                View options

                Media

                Figures

                Other

                Tables

                Share

                Share

                Share this Publication link

                Share on social media