Abstract
Emotion analysis is divided into emotion detection, where the system detects if there is an emotional state, and emotion recognition where the system identifies the label of the emotion. In this paper, we provide a multimodal system for emotion detection and recognition using Arabic dataset. We evaluated the performance of both audio and visual data as a unimodal system, then, we exposed the impact of integrating the information sources into one model. We examined the effect of gender identification on the performance. Our results show that identifying speaker’s gender beforehand increases the performance of emotion recognition especially for the models that rely on audio data. Comparing the audio-based system with the visual-based system demonstrates that each model performs better for a specific emotional label. 70% of the angry labels were predicted correctly using the audio model while this percentage was less using the visual model (63%). The accuracy obtained for the surprise class was (40.6%) using the audio model while it was (56.2%) using the visual model. The combination of both modalities improves accuracy. Our final result for the multimodal system was (75%) for the emotion detection task and (60.11%) for emotion recognition task and these results are among the top results achieved in this field and the first which focus on Arabic content. Specifically, the novelty of this work is expressed by exploiting deep learning and multimodal models in emotion analysis and applying it on a natural audio and video dataset for Arabic speaking persons.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abdel-Hamid, L. (2020). Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Communication, 122, 19–30.
Al-Azani, S., & El-Alfy, E. S. M. (2017). Hybrid deep learning for sentiment polarity determination of Arabic microblogs. In International conference on neural information processing (pp. 491–500). Springer.
Alhumoud, S. O., Altuwaijri, M. I., Albuhairi, T. M., & Alohaideb, W. M. (2015). Survey on Arabic sentiment analysis in twitter. International Science Index, 9(1), 364–368.
Bal, E., Harden, E., Lamb, D., Van Hecke, A. V., Denver, J. W., & Porges, S. W. (2010). Emotion recognition in children with autism spectrum disorders: Relations to eye gaze and autonomic state. Journal of Autism and Developmental Disorders, 40(3), 358–370.
Bänziger, T., Grandjean, D., & Scherer, K. R. (2009). Emotion recognition from expressions in face, voice, and body: The Multimodal Emotion Recognition Test (MERT). Emotion, 9(5), 691.
Bänziger, T., & Scherer, K. R. (2010). Introducing the Geneva multimodal emotion portrayal (gemep) corpus. Blueprint for Affective Computing: A Sourcebook, 2010, 271–294.
Brave, S., & Nass, C. (2009). Emotion in human-computer interaction. Human-Computer Interaction Fundamentals, 20094635, 53–68.
Busso, C., Deng, Z., Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., & Narayanan, S. (2004). Analysis of emotion recognition using facial expressions, speech and multimodal information. In Proceedings of the 6th international conference on multimodal interfaces (pp. 205–211).
Busso, C., Parthasarathy, S., Burmania, A., AbdelWahab, M., Sadoughi, N., & Provost, E. M. (2016). MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing, 8(1), 67–80.
Buyukyilmaz, M., & Cibikdiken, A. O. (2016). Voice gender recognition using deep learning. In 2016 international conference on modeling, simulation and optimization technologies and applications (MSOTA2016). Atlantis Press.
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., & Verma, R. (2014). CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4), 377–390.
Castellano, G., Kessous, L., & Caridakis, G. (2008). Emotion recognition through multiple modalities: face, body gesture, speech. In Affect and emotion in human-computer interaction (pp. 92–103). Springer.
Chen, H. B. (1998). Detection and transmission of facial expression for low speed web-based teaching (Doctoral dissertation, Thesis for Degree of Bachelor of Engineering, National University of Singapore).
Colneriĉ, N., & Demsar, J. (2018). Emotion recognition on twitter: Comparative study and training a unison model. IEEE Transactions on Affective Computing., 11(3), 433–446.
De Silva, L. C., & Ng, P. C. (2000, March). Bimodal emotion recognition. In Proceedings fourth IEEE international conference on automatic face and gesture recognition (Cat. No. PR00580) (pp. 332–335). IEEE.
Dhall, A., Goecke, R., Lucey, S., & Gedeon, T. (2011). Acted facial expressions in the wild database. Australian National University, Canberra, Australia, Technical Report TR-CS-11, 2, 1.
Dupuis, K., & Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). University of Toronto.
Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., & Pal, C. (2015). Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 ACM international conference on multimodal interaction (pp. 467–474).
Ekman, P. (1992a). Are there basic emotions? Psychological Review, 99(3), 550–553.
Ekman, P. (1992b). An argument for basic emotions. Cognition & Emotion, 6(3–4), 169–200.
Engelmann, J. B., & Pogosyan, M. (2013). Emotion perception across cultures: The role of cognitive mechanisms. Frontiers in Psychology, 4, 118.
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., & Truong, K. P. (2015). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.
Grimm, M., Kroschel, K., & Narayanan, S. (2008, June). The Vera am Mittag German audio-visual emotional speech database. In 2008 IEEE international conference on multimedia and expo (pp. 865–868). IEEE.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., et al. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint, arXiv:1412.5567
Hifny, Y., & Ali, A. (2019). Efficient Arabic emotion recognition using deep neural networks. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP-2019) (pp. 6710–6714).
Horvat, M., Popović, S., & Cosić, K. (2013). Multimedia stimuli databases usage patterns: A survey report. In The 36th international convention on information and communication technology, electronics and microelectronics (MIPRO) (pp. 993–997). IEEE.
Huang, Y., Tian, K., Wu, A., & Zhang, G. (2019). Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. Journal of Ambient Intelligence and Humanized Computing, 10(5), 1787–1798.
Jack, R. E., Blais, C., Scheepers, C., Schyns, P. G., & Caldara, R. (2009). Cultural confusions show that facial expressions are not universal. Current Biology, 19(18), 1543–1548.
Jackson, P., & Haq, S. (2014). Surrey audio-visual expressed emotion (savee) database. University of Surrey.
Kadiri, S. R., Gangamohan, P., Mittal, V. K., & Yegnanarayana, B. (2014, December). Naturalistic audio-visual emotion database. In Proceedings of the 11th international conference on natural language processing (pp. 206–213).
Kahou, S. E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V., Konda, K., & Ferrari, R. C. (2016). Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10(2), 99–111.
Kang, D., & Park, Y. (2014). Based measurement of customer satisfaction in mobile service: Sentiment analysis and VIKOR approach. Expert Systems with Applications, 41(4), 1041–1050.
Kanjo, E., Al-Husain, L., & Chamberlain, A. (2015). Emotions in context: Examining pervasive affective sensing systems, applications, and analyses. Personal and Ubiquitous Computing, 19(7), 1197–1212.
Kao, E. C. C., Liu, C. C., Yang, T. H., Hsieh, C. T., & Soo, V. W. (2009). Towards text-based emotion detection a survey and possible improvements. In 2009 International conference on information management and engineering (pp. 70–74). IEEE.
Kemper, T. D. (1981). Social constructionist and positivist approaches to the sociology of emotions. American Journal of Sociology, 87(2), 336–362.
Khasawneh, R. T., Wahsheh, H. A., Alsmadi, I. M., & AI-Kabi, M. N. (2015). Arabic sentiment polarity identification using a hybrid approach. In 2015 6th international conference on information and communication systems (ICICS) (pp. 148–153). IEEE.
Kim, Y., Moon, J., Sung, N. J., & Hong, M. (2019). Correlation between selected gait variables and emotion using virtual reality. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-019-01456-2
Klatt, D. H., & Klatt, L. C. (1990). Analysis, synthesis, and perception of voice quality variations among female and male talkers. The Journal of the Acoustical Society of America, 87(2), 820–857.
Klaylat, S., Osman, Z., Hamandi, L., & Zantout, R. (2018). Emotion recognition in Arabic speech. Analog Integrated Circuits and Signal Processing, 96(2), 337–351.
Koelstra, S., Muhl, C., Soleymani, M., Lee, J. S., Yazdani, A., Ebrahimi, T., & Patras, I. (2011). Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing, 3(1), 18–31.
Kołakowska, A., Landowska, A., Szwoch, M., Szwoch, W., & Wrobel, M. R. (2014). Emotion recognition and its applications. In Human-computer systems interaction: Backgrounds and applications 3 (pp. 51–62). Springer.
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao, K. S. (2009). IITKGP-SESC: speech database for emotion analysis. In International conference on contemporary computing (pp. 485–492). Springer.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Legge, J. (1885). The sacred books of china, the texts of confucianism. Translated by James Legge. Oxford: Clarendon Press.
Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27(1), 151–167.
Liu, M., Wang, R., Li, S., Shan, S., Huang, Z., & Chen, X. (2014). Combining multiple kernel methods on riemannian manifold for emotion recognition in the wild. In Proceedings of the 16th international conference on multimodal interaction (pp. 494–501). https://doi.org/10.1145/2663204.2666274.
Liu, Y., Sourina, O., & Nguyen, M. K. (2011). Real-time EEG-based emotion recognition and its applications. In Transactions on computational science XII (pp. 256–277). Springer.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.
Martin, O., Kotsia, I., Macq, B., & Pitas, I. (2006). The eNTERFACE'05 audio-visual emotion database. In 22nd international conference on data engineering workshops (ICDEW'06) (pp. 8–8), 2006. https://doi.org/10.1109/ICDEW.2006.145.
Mattila, A. S., & Enz, C. A. (2002). The role of emotions in service encounters. Journal of Service Research, 4(4), 268–277.
McKeown, G., Valstar, M., Cowie, R., Pantic, M., & Schroder, M. (2011). The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Transactions on Affective Computing, 3(1), 5–17.
Meddeb, M., Karray, H., & Alimi, A. M. (2015). Speech emotion recognition based on Arabic features. In 2015 15th international conference on intelligent systems design and applications (ISDA) (pp. 46–51). IEEE.
Najar, D., & Mesfar, S. (2017). Opinion mining and sentiment analysis for Arabic on-line texts: Application on the political domain. International Journal of Speech Technology, 20(3), 575–585.
Paleari, M., Huet, B., & Chellali, R. (2010, July). Towards multimodal emotion recognition: a new approach. In Proceedings of the ACM international conference on image and video retrieval (pp. 174–181).
Parmar, D. N., & Mehta, B. B. (2014). Face recognition methods & applications. arXiv preprint arXiv:1403.0485.
Petrushin, V. (1999). Emotion in speech: Recognition and application to call centers. In Proceedings of artificial neural networks in engineering, pp. 7–10.
Petrushin, V. A. (2000). Emotion recognition in speech signal: experimental study, development, and application. In Sixth international conference on spoken language processing.
Plutchik, R. (1984). Emotions: A general psychoevolutionary theory. In K. R. Scherer & P. Ekman (Eds.), Approaches to emotion (pp. 197–219). Erlbaum.
Plutchik, R. (2001). The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist, 89(4), 344–350.
Ranganathan, H., Chakraborty, S., & Panchanathan, S. (2016). Multimodal emotion recognition using deep learning architectures. In 2016 IEEE winter conference on applications of computer vision (WACV) (pp. 1–9). IEEE.
Ringeval, F., Sonderegger, A., Sauer, J., & Lalanne, D. (2013). Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG) (pp. 1–8). IEEE.
Sawada, L. O., Mano, L. Y., Neto, J. R. T., & Ueyama, J. (2019). A module-based framework to emotion recognition by speech: A case study in clinical simulation. Journal of Ambient Intelligence and Humanized Computing. https://doi.org/10.1007/s12652-019-01280-8
Shaqra, F. A., Duwairi, R., & Al-Ayyoub, M. (2019a). Recognizing emotion from speech based on age and gender using hierarchical models. Procedia Computer Science, 151, 37–44.
Shaqra, F. A., Duwairi, R., & Al-Ayyoub, M. (2019b, August). The Audio-Visual Arabic Dataset for Natural Emotions. In 2019b 7th international conference on future internet of things and cloud (FiCloud) (pp. 324–329). IEEE.
Soleymani, M., Chanel, G., Kierkels, J. J., & Pun, T. (2008). Affective characterization of movie scenes based on multimedia content analysis and user's physiological emotional responses. In 2008 Tenth IEEE international symposium on multimedia (pp. 228–235). IEEE.
Soleymani, M., Lichtenauer, J., Pun, T., & Pantic, M. (2011). A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing, 3(1), 42–55.
Suarez, M. T., Cu, J., & Sta, M. (2012). Building a multimodal laughter database for emotion recognition. In LREC, (pp. 2347–2350).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, (pp. 1–9).
Titze, I. R. (1989). Physiologic and acoustic differences between male and female voices. The Journal of the Acoustical Society of America, 85(4), 1699–1707.
Tokuno, S., Tsumatori, G., Shono, S., Takei, E., Yamamoto, T., Suzuki, G., & Shimura, M. (2011). Usage of emotion recognition in military health care. In 2011 defense science research conference and expo (DSR) (pp. 1–5). IEEE.
Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301–1309.
Wang, Y. (2019). Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion. Personal and Ubiquitous Computing, 23(3–4), 383–392.
Wu, C. H., Lin, J. C., & Wei, W. L. (2014). Survey on audiovisual emotion recognition: Databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2014.11
Xie, B., Sidulova, M., & Park, C. H. (2021). Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors, 21(14), 4913.
Yu, Z., & Zhang, C. (2015). Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on international conference on multimodal interaction (pp. 435–442).
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818–833). Springer.
Zeng, Z., Pantic, M., Roisman, G. I., & Huang, T. S. (2008). A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1), 39–58.
Zhong-xiu, S. O. N. G. (2008). An assessment of James Legge's translation of culturally-loaded words in the book of rites. Journal of Sanming University, (pp. 301–30).
Acknowledgements
The AVANEmo dataset, which was used in this manuscript, has been developed by the same authors and it was published in a conference paper and was properly cited in the current manuscript in (Shaqra et al. 2019b).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Abu Shaqra, F., Duwairi, R. & Al-Ayyoub, M. A multi-modal deep learning system for Arabic emotion recognition. Int J Speech Technol 26, 123–139 (2023). https://doi.org/10.1007/s10772-022-09981-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-022-09981-w