Multilingual Speech Emotion Recognition on Japanese, English, and German

Panikos Heracleous⁸,
Keiji Yasuda^8,9 &
Akio Yoneyama⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13452))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

470 Accesses

Abstract

The current study focuses on human emotion recognition based on speech, and particularly on multilingual speech emotion recognition using Japanese, English, and German emotional corpora. The proposed method exploits conditional random fields (CRF) classifiers in a two-level classification scheme. Specifically, in the first level, the language spoken is identified, and in the second level, speech emotion recognition is carried out using emotion models specific to the identified language. In both the first and second levels, CRF classifiers fed with acoustic features are applied. The CRF classifier is a popular probabilistic method for structured prediction, and is widely applied in natural language processing, computer vision, and bioinformatics. In the current study, the use of CRF in speech emotion recognition when limited training data are available is experimentally investigated. The results obtained show the effectiveness of using CRF when only a small amount of training data are available and methods based on a deep neural networks (DNN) are less effective. Furthermore, the proposed method is also compared with two popular classifiers, namely, support vector machines (SVM), and probabilistic linear discriminant analysis (PLDA) and higher accuracy was obtained using the proposed method. For the classification of four emotions (i.e., neutral, happy, angry, sad) the proposed method based on CRF achieved classification rates of 93.8% for English, 95.0% for German, and 88.8% for Japanese. These results are very promising, and superior to the results obtained in other similar studies on multilingual or even monolingual speech emotion recognition .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Emotion Recognition from Speech Signal in Multilingual Experiments

Integrating Language and Emotion Features for Multilingual Speech Emotion Recognition

An Efficient Language-Independent Acoustic Emotion Classification System

Article 20 December 2019

References

Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social emotions in nature and artifact: emotions in human and human-computer interaction, pp. 110–127. Oxford University Press, New York (2013)
Chapter Google Scholar
Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted gaussian mixture models. In: Proceedings of ICME, pp. 294–297 (2009)
Google Scholar
Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Google Scholar
Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appli. 9(4), 290–296 (2000)
Article Google Scholar
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 223–227 (2014)
Google Scholar
Polzehl, T., Schmitt, A., Metze, F.: Approaching multi-lingual emotion recognition from speech-on language dependency of acoustic prosodic features for anger detection. In: Proceedings of Speech Prosody (2010)
Google Scholar
Bhaykar, M., Yadav, J., Rao, K.S.: Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. In: 2013 National Conference on Communications (NCC), pp. 1–5. IEEE (2013)
Google Scholar
Eyben, F., Batliner, A., Schuller, B., Seppi, D., Steidl, S.: Crosscorpus classification of realistic emotions - some pilot experiments. In: Proceedings of the Third International Workshop on EMOTION (satellite of LREC) (2010)
Google Scholar
Sagha, H., Matejka, P., Gavryukova, M., Povolny, F., Marchi, E., Schuller, B.: Enhancing multilingual recognition of emotion in speech by language identification. In: Proceedings of Interspeech (2016)
Google Scholar
Li, X., Akagi, M.: A three-layer emotion perception model for valence and arousal-based detection from multilingual speech. In: Proceedings of Interspeech, pp. 3643–3647 (2018)
Google Scholar
Li, H., Ma, B., Lee, K.A.: Spoken language recognition: From fundamentals to practice. In: Proceedings of the IEEE, vol. 101(5), pp. 1136–1159 (2013)
Google Scholar
Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. lEEE Trans. Speech Audio Process. 4(1), 31–44 (1996)
Google Scholar
Caseiro, D., Trancoso, I.: Spoken language identification using the speechdat corpus. In: Proceedings of ICSLP 1998 (1998)
Google Scholar
Siniscalchi, S.M., Reed, J., Svendsen, T., Lee, C.-H.: Universal attribute characterization of spoken languages for automatic spoken language recognition. Comput. Speech Lang. 27, 209–227 (2013)
Article Google Scholar
Lee, C.-H.: principles of spoken language recognition. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 785–796. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9_39
Chapter Google Scholar
Reynolds, D.A., Campbell, W.M., Shen, W., Singer, E.: Automatic language recognition via spectral and token based approaches. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 811–824. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9_41
Chapter Google Scholar
Cole, R., Inouye, J., Muthusamy, Y., Gopalakrishnan, M.: Language identification with neural networks: a feasibility study. In: Proceedings of IEEE Pacific Rim Conference, pp. 525–529 (1989)
Google Scholar
Leena, M., Rao, K.S., Yegnanarayana, B.: Neural network classifiers for language identification using phonotactic and prosodic features. In: Proceedings of Intelligent Sensing and Information Processing, pp. 404–408(2005)
Google Scholar
Montavon, G.: Deep learning for spoken language identification. In: NIPS workshop on Deep Learning for Speech Recognition and Related Applications (2009)
Google Scholar
Moreno, I.L., Dominguez, J.G., Plchot, O., Martinez, D., Rodriguez, J.G., Moreno, P.: Automatic language identification using deep neural networks. In: Proceedings of ICASSP, pp. 5337–5341 (2014)
Google Scholar
Heracleous, P., Takai, K., Yasuda, K., Mohammad, Y., Yoneyama, A.: Comparative Study on Spoken Language Identification Based on Deep Learning. In: Proceedings of EUSIPCO (2018)
Google Scholar
Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R.: Deep bottleneck features for spoken language identification. PLoS ONE 9(7), 1–11 (2010)
Google Scholar
Zazo, R., Diez, A.L., Dominguez, J.G., Toledano, D.T., Rodriguez, J.G.: Language identification in short utterances using long short-term memory (lstm) recurrent neural networks. PLoS ONE 11(1), e0146917 (2016)
Article Google Scholar
Heracleous, P., Mohammad, Y., Takai, K., Yasuda, K., Yoneyama, A.: Spoken Language Identification Based on I-vectors and Conditional Random Fields. In: Proceedings of IWCMC, pp. 1443–1447 (2018)
Google Scholar
Cristianini, N., Taylor, J.S.: Support vector machines. Cambridge University Press, Cambridge (2000)
Google Scholar
Dehak, N., Carrasquillo, P.A.T., Reynolds, D., Dehak, R.: Language recognition via ivectors and dimensionality reduction. In: Proceedings of Interspeech, pp. 857–860 (2011)
Google Scholar
Shen, P., Lu, X., Liu, L., Kawai, H.: Local fisher discriminant analysis for spoken language identification. In: Proceedings of ICASSP, pp. 5825–5829 (2016)
Google Scholar
Livingstone, S.R., Peck, K., F.A., Russo: RAVDESS: The ryerson audio-visual database of emotional speech and song. In: 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS), Kingston, ON (2012)
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech (2005)
Google Scholar
Reiter, S., Schuller, B., Rigoll, G.: Hidden conditional random fields for meeting segmentation. In: Proceedings of ICME, pp. 639–642 (2007)
Google Scholar
Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random fields for phone classification. In: Proceedings of Interspeech, pp. 1117–1120 (2005)
Google Scholar
Llorens, H., Saquete, E., Colorado, B.N.: TimeML events recognition and classification: learning crf models with semantic roles. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 725–733 (2010)
Google Scholar
Yu, D., Wang, S., Karam, Z., Deng, L.: Language recognition using deep-structured conditional random fields. In: Proceedings of ICASSP, pp. 5030–5033 (2010)
Google Scholar
Quattoni, A., Collins, M., Darrell, T.: Conditional random fields for object recognition. In: Saul, L.K., Weiss, Y., Bottou, L., (eds.) Advances in Neural Information Processing Systems 17, MIT Press, pp. 1097–1104 (2005)
Google Scholar
Yu, C., Liu, G., Hansen, J.H.L.: Acoustic feature transformation using ubm-based lda for speaker recognition. In: Proceedings of Interspeech, pp. 1851–1854 (2014)
Google Scholar
Li, X., Akagi, M.: Multilingual speech emotion recognition system based on a three-layer model. In: Proceedimgs of Interspeech, pp. 3606–3612 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Education and Medical ICT Laboratory, KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
Panikos Heracleous, Keiji Yasuda & Akio Yoneyama
Nara Institute of Science and Technology, Ikoma, Japan
Keiji Yasuda

Authors

Panikos Heracleous
View author publications
You can also search for this author in PubMed Google Scholar
Keiji Yasuda
View author publications
You can also search for this author in PubMed Google Scholar
Akio Yoneyama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panikos Heracleous .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heracleous, P., Yasuda, K., Yoneyama, A. (2023). Multilingual Speech Emotion Recognition on Japanese, English, and German. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-24340-0_27
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multilingual Speech Emotion Recognition on Japanese, English, and German

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Recognition from Speech Signal in Multilingual Experiments

Integrating Language and Emotion Features for Multilingual Speech Emotion Recognition

An Efficient Language-Independent Acoustic Emotion Classification System

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Multilingual Speech Emotion Recognition on Japanese, English, and German

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Emotion Recognition from Speech Signal in Multilingual Experiments

Integrating Language and Emotion Features for Multilingual Speech Emotion Recognition

An Efficient Language-Independent Acoustic Emotion Classification System

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation