More Web Proxy on the site http://driver.im/

research-article

A multimodal emotion recognition method based on multiple fusion of audio-visual modalities

Authors:

Jun WangAuthors Info & Claims

VSIP '23: Proceedings of the 2023 5th International Conference on Video, Signal and Image Processing

Pages 108 - 114

https://doi.org/10.1145/3638682.3638698

Published: 22 May 2024 Publication History

Abstract

Human emotions are usually expressed in multiple ways, including speech, facial expressions, body language, etc. However, in multimodal fusion emotion recognition, there are often challenges such as difficulty in data alignment, significant feature heterogeneity, and differences in the overall contribution of different modalities to emotion recognition. Considering the significant feature heterogeneity and poor fusion effect when different modalities are merged, this paper proposes an emotion recognition model that selects two modalities of human facial expressions and speech as input and can solve these problems in an end-to-end manner. The model contains two different modal fusion mechanisms, the core of which is to focus on the interaction between the two modalities, allowing the features of both modalities to better adapt to each other, and be re-fused in the architecture for better fusion effect. We found that these two models significantly improve performance compared to previous methods and can still maintain good performance in scenarios where certain modalities are missing.

References

[1]

Zhou, H., Huang, M., Zhang, T., Zhu, X. and Liu, B., 2018, April. Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 32, No. 1).

[2]

Huang, C., Zaiane, O.R., Trabelsi, A. and Dziri, N., 2018, June. Automatic dialogue generation with expressed emotions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 49-54).

[3]

Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S. and McClosky, D., 2014, June. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations (pp. 55-60).

[4]

Abdullah, M., Ahmad, M. and Han, D., 2020, January. Facial expression recognition in videos: An CNN-LSTM based model for video classification. In 2020 International Conference on Electronics, Information, and Communication (ICEIC) (pp. 1-3). IEEE.

[5]

Jalal, M.A., Milner, R. and Hain, T., 2020, October. Empirical interpretation of speech emotion perception with attention based model for speech emotion recognition. In Proceedings of Interspeech 2020 (pp. 4113-4117). International Speech Communication Association (ISCA).

[6]

Xia, R. and Ding, Z., 2019. Emotion-cause pair extraction: A new task to emotion analysis in texts. arXiv preprint arXiv:1906.01267.

[7]

Yu, W., Xu, H., Meng, F., Zhu, Y., Ma, Y., Wu, J., Zou, J. and Yang, K., 2020, July. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 3718-3727).

[8]

Mai, S., Hu, H. and Xing, S., 2020, April. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 01, pp. 164-172).

[9]

Wang, Y., Wu, J. and Hoashi, K., 2019, October. Multi-attention fusion network for video-based emotion recognition. In 2019 International Conference on Multimodal Interaction (pp. 595-601).

[10]

Tsai, Yao-Hung Hubert, "Multimodal transformer for unaligned multimodal language sequences." Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 2019. NIH Public Access, 2019.

[11]

Yoon, S., Byun, S. and Jung, K., 2018, December. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 112-118). IEEE.

[12]

Chudasama, V., Kar, P., Gudmalwar, A., Shah, N., Wasnik, P. and Onoe, N., 2022. M2FNet: multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4652-4661).

[13]

Chumachenko, K., Iosifidis, A. and Gabbouj, M., 2022, August. Self-attention fusion for audiovisual emotion recognition with incomplete data. In 2022 26th International Conference on Pattern Recognition (ICPR) (pp. 2822-2828). IEEE.

[14]

Nogueiras, A., Moreno, A., Bonafonte, A. and Mariño, J.B., 2001. Speech emotion recognition using hidden Markov models. In Seventh European conference on speech communication and technology.

[15]

Neiberg, D., Elenius, K. and Laskowski, K., 2006. Emotion recognition in spontaneous speech using GMMs. In Ninth international conference on spoken language processing.

[16]

Mower, E., Matarić, M.J. and Narayanan, S., 2010. A framework for automatic human emotion classification using emotion profiles. IEEE Transactions on Audio, Speech, and Language Processing, 19(5), pp.1057-1070.

Digital Library

[17]

Han, K., Yu, D. and Tashev, I., 2014, September. Speech emotion recognition using deep neural network and extreme learning machine. In Interspeech 2014.

[18]

Mirsamadi, S., Barsoum, E. and Zhang, C., 2017, March. Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International conference on acoustics, speech and signal processing (ICASSP) (pp. 2227-2231). IEEE.

[19]

Zadeh, A. B., Liang, P. P., Poria, S., Cambria, E., & Morency, L. P. (2018, July). Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 2236-2246).

[20]

Bertero, D. and Fung, P., 2017, March. A first look into a convolutional neural network for speech emotion detection. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5115-5119). IEEE.

[21]

Al-agha, Lecturer Salwa A., P. H. H. Saleh, and P. R. F. Ghani. "Geometric-based feature extraction and classification for emotion expressions of 3D video film." Journal of Advances in Information Technology Vol 8.2 (2017).

[22]

Zadeh, A., Chen, M., Poria, S., Cambria, E. and Morency, L.P., 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.

[23]

Sadok, S., Leglaive, S., & Séguier, R. (2023). A vector quantized masked autoencoder for speech emotion recognition. arXiv preprint arXiv:2304.11117.

[24]

Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2017. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), pp.84-90.

Digital Library

[25]

Zhao, S., Ma, Y., Gu, Y., Yang, J., Xing, T., Xu, P., Hu, R., Chai, H. and Keutzer, K., 2020, April. An end-to-end visual-audio attention network for emotion recognition in user-generated videos. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 01, pp. 303-311).

[26]

Xu, H., Zhang, H., Han, K., Wang, Y., Peng, Y. and Li, X., 2019. Learning alignment for multimodal emotion recognition from speech. arXiv preprint arXiv:1909.05645.

[27]

Zhao, Z., Liu, Q. and Zhou, F., 2021, May. Robust lightweight facial expression recognition network with label distribution training. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 4, pp. 3510-3519).

[28]

Livingstone, S.R. and Russo, F.A., 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 13(5), p.e0196391.

[29]

Luna-Jiménez, C., Griol, D., Callejas, Z., Kleinlein, R., Montero, J.M. and Fernández-Martínez, F., 2021. Multimodal emotion recognition on ravdess dataset using transfer learning. Sensors, 21(22), p.7665.

[30]

Luna-Jiménez, C., Kleinlein, R., Griol, D., Callejas, Z., Montero, J.M. and Fernández-Martínez, F., 2021. A proposal for multimodal emotion recognition using aural transformers and action units on RAVDESS dataset. Applied Sciences, 12(1), p.327.

[31]

Ghaleb, E., Niehues, J. and Asteriadis, S., 2020, October. Multimodal attention-mechanism for temporal emotion recognition. In 2020 IEEE International Conference on Image Processing (ICIP) (pp. 251-255). IEEE.

[32]

Chen, W., Xing, X., Xu, X., Pang, J., & Du, L. (2022). SpeechFormer: A hierarchical efficient framework incorporating the characteristics of speech. arXiv preprint arXiv:2203.03812.

[33]

Kanani, C. S., Gill, K. S., Behera, S., Choubey, A., Gupta, R. K., & Misra, R. (2020, July). Shallow over deep neural networks: a empirical analysis for human emotion classification using audio data. In International Conference on Internet of Things and Connected Technologies (pp. 134-146). Cham: Springer International Publishing.

Index Terms

A multimodal emotion recognition method based on multiple fusion of audio-visual modalities
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Sentiment analysis
  2. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects
Abstract
Emotion recognition has recently attracted extensive interest due to its significant applications to human–computer interaction. The expression of human emotion depends on various verbal and non-verbal languages like audio, visual, text, etc. ...
Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities
Abstract
Emotion identification based on multimodal data (e.g., audio, video, text, etc.) is one of the most demanding and important research fields, with various uses. In this context, this research work has conducted a rigorous exploration of ...
Highlights
- Deep learning-based feature extractor networks for video and audio data are proposed.
Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech
Affect and Emotion in Human-Computer Interaction

In this paper we present a multimodal approach for the recognition of eight emotions. Our approach integrates information from facial expressions, body movement and gestures and speech. We trained and tested a model with a Bayesian classifier, using a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

VSIP '23: Proceedings of the 2023 5th International Conference on Video, Signal and Image Processing

November 2023

237 pages

ISBN:9798400709272

DOI:10.1145/3638682

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

VSIP 2023

VSIP 2023: 2023 the 5th International Conference on Video, Signal and Image Processing

November 24 - 26, 2023

Harbin, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
38
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten