[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

A multi-modal deep learning system for Arabic emotion recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Emotion analysis is divided into emotion detection, where the system detects if there is an emotional state, and emotion recognition where the system identifies the label of the emotion. In this paper, we provide a multimodal system for emotion detection and recognition using Arabic dataset. We evaluated the performance of both audio and visual data as a unimodal system, then, we exposed the impact of integrating the information sources into one model. We examined the effect of gender identification on the performance. Our results show that identifying speaker’s gender beforehand increases the performance of emotion recognition especially for the models that rely on audio data. Comparing the audio-based system with the visual-based system demonstrates that each model performs better for a specific emotional label. 70% of the angry labels were predicted correctly using the audio model while this percentage was less using the visual model (63%). The accuracy obtained for the surprise class was (40.6%) using the audio model while it was (56.2%) using the visual model. The combination of both modalities improves accuracy. Our final result for the multimodal system was (75%) for the emotion detection task and (60.11%) for emotion recognition task and these results are among the top results achieved in this field and the first which focus on Arabic content. Specifically, the novelty of this work is expressed by exploiting deep learning and multimodal models in emotion analysis and applying it on a natural audio and video dataset for Arabic speaking persons.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Abdel-Hamid, L. (2020). Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication, 122, 19–30.

    Article  Google Scholar 

  • Al-Azani, S., & El-Alfy, E. S. M. (2017). Hybrid deep learning for sentiment polarity determination of Arabic microblogs. In International conference on neural information processing (pp. 491–500). Springer.

  • Alhumoud, S. O., Altuwaijri, M. I., Albuhairi, T. M., & Alohaideb, W. M. (2015). Survey on Arabic sentiment analysis in twitter. International Science Index, 9(1), 364–368.

    Google Scholar 

  • Bal, E., Harden, E., Lamb, D., Van Hecke, A. V., Denver, J. W., & Porges, S. W. (2010). Emotion recognition in children with autism spectrum disorders: Relations to eye gaze and autonomic state. Journal of Autism and Developmental Disorders, 40(3), 358–370.

    Article  Google Scholar 

  • Bänziger, T., Grandjean, D., & Scherer, K. R. (2009). Emotion recognition from expressions in face, voice, and body: The Multimodal Emotion Recognition Test (MERT). Emotion, 9(5), 691.

    Article  Google Scholar 

  • Bänziger, T., & Scherer, K. R. (2010). Introducing the Geneva multimodal emotion portrayal (gemep) corpus. Blueprint for Affective Computing: A Sourcebook, 2010, 271–294.

    Google Scholar 

  • Brave, S., & Nass, C. (2009). Emotion in human-computer interaction. Human-Computer Interaction Fundamentals, 20094635, 53–68.

    Article  Google Scholar 

  • Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., & Narayanan, S. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on multimodal interfaces (pp. 205–211).

  • Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., & Provost, E. M. (2016). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 8(1), 67–80.

    Article  Google Scholar 

  • Buyukyilmaz, M., & Cibikdiken, A. O. (2016). Voice gender recognition using deep learning. In 2016 international conference on modeling, simulation and optimization technologies and applications (MSOTA2016). Atlantis Press.

  • Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014). CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4), 377–390.

    Article  Google Scholar 

  • Castellano, G., Kessous, L., & Caridakis, G. (2008). Emotion recognition through multiple modalities: face, body gesture, speech. In Affect and emotion in human-computer interaction (pp. 92–103). Springer.

  • Chen, H. B. (1998). Detection and transmission of facial expression for low speed web-based teaching (Doctoral dissertation, Thesis for Degree of Bachelor of Engineering, National University of Singapore).

  • Colneriĉ, N., & Demsar, J. (2018). Emotion recognition on twitter: Comparative study and training a unison model. IEEE Transactions on Affective Computing., 11(3), 433–446.

    Article  Google Scholar 

  • De Silva, L. C., & Ng, P. C. (2000, March). Bimodal emotion recognition. In Proceedings fourth IEEE international conference on automatic face and gesture recognition (Cat. No. PR00580) (pp. 332–335). IEEE.

  • Dhall, A., Goecke, R., Lucey, S., & Gedeon, T. (2011). Acted facial expressions in the wild database. Australian National University, Canberra, Australia, Technical Report TR-CS-11, 2, 1.‏

  • Dupuis, K., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). University of Toronto.

  • Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., & Pal, C. (2015). Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM international conference on multimodal interaction (pp. 467–474).

  • Ekman, P. (1992a). Are there basic emotions? Psychological Review, 99(3), 550–553.

    Article  Google Scholar 

  • Ekman, P. (1992b). An argument for basic emotions. Cognition & Emotion, 6(3–4), 169–200.

    Article  Google Scholar 

  • Engelmann, J. B., & Pogosyan, M. (2013). Emotion perception across cultures: The role of cognitive mechanisms. Frontiers in Psychology, 4, 118.

    Article  Google Scholar 

  • Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., & Truong, K. P. (2015). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.

    Article  Google Scholar 

  • Grimm, M., Kroschel, K., & Narayanan, S. (2008, June). The Vera am Mittag German audio-visual emotional speech database. In 2008 IEEE international conference on multimedia and expo (pp. 865–868). IEEE.

  • Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., et al. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint, arXiv:1412.5567

  • Hifny, Y., & Ali, A. (2019). Efficient Arabic emotion recognition using deep neural networks. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP-2019) (pp. 6710–6714).

  • Horvat, M., Popović, S., & Cosić, K. (2013). Multimedia stimuli databases usage patterns: A survey report. In The 36th international convention on information and communication technology, electronics and microelectronics (MIPRO) (pp. 993–997). IEEE.‏

  • Huang, Y., Tian, K., Wu, A., & Zhang, G. (2019). Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. Journal of Ambient Intelligence and Humanized Computing, 10(5), 1787–1798.

    Article  Google Scholar 

  • Jack, R. E., Blais, C., Scheepers, C., Schyns, P. G., & Caldara, R. (2009). Cultural confusions show that facial expressions are not universal. Current Biology, 19(18), 1543–1548.

    Article  Google Scholar 

  • Jackson, P., & Haq, S. (2014). Surrey audio-visual expressed emotion (savee) database. University of Surrey.

  • Kadiri, S. R., Gangamohan, P., Mittal, V. K., & Yegnanarayana, B. (2014, December). Naturalistic audio-visual emotion database. In Proceedings of the 11th international conference on natural language processing (pp. 206–213).‏

  • Kahou, S. E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K., & Ferrari, R. C. (2016). Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2), 99–111.

    Article  Google Scholar 

  • Kang, D., & Park, Y. (2014). Based measurement of customer satisfaction in mobile service: Sentiment analysis and VIKOR approach. Expert Systems with Applications, 41(4), 1041–1050.

    Article  Google Scholar 

  • Kanjo, E., Al-Husain, L., & Chamberlain, A. (2015). Emotions in context: Examining pervasive affective sensing systems, applications, and analyses. Personal and Ubiquitous Computing, 19(7), 1197–1212.

    Article  Google Scholar 

  • Kao, E. C. C., Liu, C. C., Yang, T. H., Hsieh, C. T., & Soo, V. W. (2009). Towards text-based emotion detection a survey and possible improvements. In 2009 International conference on information management and engineering (pp. 70–74). IEEE.

  • Kemper, T. D. (1981). Social constructionist and positivist approaches to the sociology of emotions. American Journal of Sociology, 87(2), 336–362.

    Article  Google Scholar 

  • Khasawneh, R. T., Wahsheh, H. A., Alsmadi, I. M., & AI-Kabi, M. N. (2015). Arabic sentiment polarity identification using a hybrid approach. In 2015 6th international conference on information and communication systems (ICICS) (pp. 148–153). IEEE.

  • Kim, Y., Moon, J., Sung, N. J., & Hong, M. (2019). Correlation between selected gait variables and emotion using virtual reality. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-019-01456-2

    Article  Google Scholar 

  • Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820–857.

    Article  Google Scholar 

  • Klaylat, S., Osman, Z., Hamandi, L., & Zantout, R. (2018). Emotion recognition in Arabic speech. Analog Integrated Circuits and Signal Processing, 96(2), 337–351.

    Article  Google Scholar 

  • Koelstra, S., Muhl, C., Soleymani, M., Lee, J. S., Yazdani, A., Ebrahimi, T., & Patras, I. (2011). Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing, 3(1), 18–31.

    Article  Google Scholar 

  • Kołakowska, A., Landowska, A., Szwoch, M., Szwoch, W., & Wrobel, M. R. (2014). Emotion recognition and its applications. In Human-computer systems interaction: Backgrounds and applications 3 (pp. 51–62). Springer.

  • Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis. In International conference on contemporary computing (pp. 485–492). Springer.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • Legge, J. (1885). The sacred books of china, the texts of confucianism. Translated by James Legge. Oxford: Clarendon Press.

  • Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27(1), 151–167.

    Article  Google Scholar 

  • Liu, M., Wang, R., Li, S., Shan, S., Huang, Z., & Chen, X. (2014). Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th international conference on multimodal interaction (pp. 494–501). https://doi.org/10.1145/2663204.2666274.

  • Liu, Y., Sourina, O., & Nguyen, M. K. (2011). Real-time EEG-based emotion recognition and its applications. In Transactions on computational science XII (pp. 256–277). Springer.

  • Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.

    Article  Google Scholar 

  • Martin, O., Kotsia, I., Macq, B., & Pitas, I. (2006). The eNTERFACE'05 audio-visual emotion database. In 22nd international conference on data engineering workshops (ICDEW'06) (pp. 8–8), 2006. https://doi.org/10.1109/ICDEW.2006.145.

  • Mattila, A. S., & Enz, C. A. (2002). The role of emotions in service encounters. Journal of Service Research, 4(4), 268–277.

    Article  Google Scholar 

  • McKeown, G., Valstar, M., Cowie, R., Pantic, M., & Schroder, M. (2011). The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 3(1), 5–17.

    Article  Google Scholar 

  • Meddeb, M., Karray, H., & Alimi, A. M. (2015). Speech emotion recognition based on Arabic features. In 2015 15th international conference on intelligent systems design and applications (ISDA) (pp. 46–51). IEEE.‏

  • Najar, D., & Mesfar, S. (2017). Opinion mining and sentiment analysis for Arabic on-line texts: Application on the political domain. International Journal of Speech Technology, 20(3), 575–585.

    Article  Google Scholar 

  • Paleari, M., Huet, B., & Chellali, R. (2010, July). Towards multimodal emotion recognition: a new approach. In Proceedings of the ACM international conference on image and video retrieval (pp. 174–181).

  • Parmar, D. N., & Mehta, B. B. (2014). Face recognition methods & applications. arXiv preprint arXiv:1403.0485.

  • Petrushin, V. (1999). Emotion in speech: Recognition and application to call centers. In Proceedings of artificial neural networks in engineering, pp. 7–10.

  • Petrushin, V. A. (2000). Emotion recognition in speech signal: experimental study, development, and application. In Sixth international conference on spoken language processing.

  • Plutchik, R. (1984). Emotions: A general psychoevolutionary theory. In K. R. Scherer & P. Ekman (Eds.), Approaches to emotion (pp. 197–219). Erlbaum.

  • Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist, 89(4), 344–350.

    Article  Google Scholar 

  • Ranganathan, H., Chakraborty, S., & Panchanathan, S. (2016). Multimodal emotion recognition using deep learning architectures. In 2016 IEEE winter conference on applications of computer vision (WACV) (pp. 1–9). IEEE.

  • Ringeval, F., Sonderegger, A., Sauer, J., & Lalanne, D. (2013). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG) (pp. 1–8). IEEE.‏

  • Sawada, L. O., Mano, L. Y., Neto, J. R. T., & Ueyama, J. (2019). A module-based framework to emotion recognition by speech: A case study in clinical simulation. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-019-01280-8

    Article  Google Scholar 

  • Shaqra, F. A., Duwairi, R., & Al-Ayyoub, M. (2019a). Recognizing emotion from speech based on age and gender using hierarchical models. Procedia Computer Science, 151, 37–44.

    Article  Google Scholar 

  • Shaqra, F. A., Duwairi, R., & Al-Ayyoub, M. (2019b, August). The Audio-Visual Arabic Dataset for Natural Emotions. In 2019b 7th international conference on future internet of things and cloud (FiCloud) (pp. 324–329). IEEE.

  • Soleymani, M., Chanel, G., Kierkels, J. J., & Pun, T. (2008). Affective characterization of movie scenes based on multimedia content analysis and user's physiological emotional responses. In 2008 Tenth IEEE international symposium on multimedia (pp. 228–235). IEEE.‏

  • Soleymani, M., Lichtenauer, J., Pun, T., & Pantic, M. (2011). A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing, 3(1), 42–55.

    Article  Google Scholar 

  • Suarez, M. T., Cu, J., & Sta, M. (2012). Building a multimodal laughter database for emotion recognition. In LREC, (pp. 2347–2350).‏

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 1–9).‏

  • Titze, I. R. (1989). Physiologic and acoustic differences between male and female voices. The Journal of the Acoustical Society of America, 85(4), 1699–1707.

    Article  Google Scholar 

  • Tokuno, S., Tsumatori, G., Shono, S., Takei, E., Yamamoto, T., Suzuki, G., & Shimura, M. (2011). Usage of emotion recognition in military health care. In 2011 defense science research conference and expo (DSR) (pp. 1–5). IEEE.

  • Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301–1309.

    Article  Google Scholar 

  • Wang, Y. (2019). Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion. Personal and Ubiquitous Computing, 23(3–4), 383–392.

    Article  Google Scholar 

  • Wu, C. H., Lin, J. C., & Wei, W. L. (2014). Survey on audiovisual emotion recognition: Databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2014.11

    Article  Google Scholar 

  • Xie, B., Sidulova, M., & Park, C. H. (2021). Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors, 21(14), 4913.

    Article  Google Scholar 

  • Yu, Z., & Zhang, C. (2015). Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 435–442).‏

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818–833). Springer.‏

  • Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2008). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58.

    Article  Google Scholar 

  • Zhong-xiu, S. O. N. G. (2008). An assessment of James Legge's translation of culturally-loaded words in the book of rites. Journal of Sanming University, (pp. 301–30).

Download references

Acknowledgements

The AVANEmo dataset, which was used in this manuscript, has been developed by the same authors and it was published in a conference paper and was properly cited in the current manuscript in (Shaqra et al. 2019b).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rehab Duwairi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abu Shaqra, F., Duwairi, R. & Al-Ayyoub, M. A multi-modal deep learning system for Arabic emotion recognition. Int J Speech Technol 26, 123–139 (2023). https://doi.org/10.1007/s10772-022-09981-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-022-09981-w

Keywords

Navigation