Abstract
The current study focuses on human emotion recognition based on speech, and particularly on multilingual speech emotion recognition using Japanese, English, and German emotional corpora. The proposed method exploits conditional random fields (CRF) classifiers in a two-level classification scheme. Specifically, in the first level, the language spoken is identified, and in the second level, speech emotion recognition is carried out using emotion models specific to the identified language. In both the first and second levels, CRF classifiers fed with acoustic features are applied. The CRF classifier is a popular probabilistic method for structured prediction, and is widely applied in natural language processing, computer vision, and bioinformatics. In the current study, the use of CRF in speech emotion recognition when limited training data are available is experimentally investigated. The results obtained show the effectiveness of using CRF when only a small amount of training data are available and methods based on a deep neural networks (DNN) are less effective. Furthermore, the proposed method is also compared with two popular classifiers, namely, support vector machines (SVM), and probabilistic linear discriminant analysis (PLDA) and higher accuracy was obtained using the proposed method. For the classification of four emotions (i.e., neutral, happy, angry, sad) the proposed method based on CRF achieved classification rates of 93.8% for English, 95.0% for German, and 88.8% for Japanese. These results are very promising, and superior to the results obtained in other similar studies on multilingual or even monolingual speech emotion recognition .
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social emotions in nature and artifact: emotions in human and human-computer interaction, pp. 110–127. Oxford University Press, New York (2013)
Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted gaussian mixture models. In: Proceedings of ICME, pp. 294–297 (2009)
Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appli. 9(4), 290–296 (2000)
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 223–227 (2014)
Polzehl, T., Schmitt, A., Metze, F.: Approaching multi-lingual emotion recognition from speech-on language dependency of acoustic prosodic features for anger detection. In: Proceedings of Speech Prosody (2010)
Bhaykar, M., Yadav, J., Rao, K.S.: Speaker dependent, speaker independent and cross language emotion recognition from speech using GMM and HMM. In: 2013 National Conference on Communications (NCC), pp. 1–5. IEEE (2013)
Eyben, F., Batliner, A., Schuller, B., Seppi, D., Steidl, S.: Crosscorpus classification of realistic emotions - some pilot experiments. In: Proceedings of the Third International Workshop on EMOTION (satellite of LREC) (2010)
Sagha, H., Matejka, P., Gavryukova, M., Povolny, F., Marchi, E., Schuller, B.: Enhancing multilingual recognition of emotion in speech by language identification. In: Proceedings of Interspeech (2016)
Li, X., Akagi, M.: A three-layer emotion perception model for valence and arousal-based detection from multilingual speech. In: Proceedings of Interspeech, pp. 3643–3647 (2018)
Li, H., Ma, B., Lee, K.A.: Spoken language recognition: From fundamentals to practice. In: Proceedings of the IEEE, vol. 101(5), pp. 1136–1159 (2013)
Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. lEEE Trans. Speech Audio Process. 4(1), 31–44 (1996)
Caseiro, D., Trancoso, I.: Spoken language identification using the speechdat corpus. In: Proceedings of ICSLP 1998 (1998)
Siniscalchi, S.M., Reed, J., Svendsen, T., Lee, C.-H.: Universal attribute characterization of spoken languages for automatic spoken language recognition. Comput. Speech Lang. 27, 209–227 (2013)
Lee, C.-H.: principles of spoken language recognition. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 785–796. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9_39
Reynolds, D.A., Campbell, W.M., Shen, W., Singer, E.: Automatic language recognition via spectral and token based approaches. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 811–824. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9_41
Cole, R., Inouye, J., Muthusamy, Y., Gopalakrishnan, M.: Language identification with neural networks: a feasibility study. In: Proceedings of IEEE Pacific Rim Conference, pp. 525–529 (1989)
Leena, M., Rao, K.S., Yegnanarayana, B.: Neural network classifiers for language identification using phonotactic and prosodic features. In: Proceedings of Intelligent Sensing and Information Processing, pp. 404–408(2005)
Montavon, G.: Deep learning for spoken language identification. In: NIPS workshop on Deep Learning for Speech Recognition and Related Applications (2009)
Moreno, I.L., Dominguez, J.G., Plchot, O., Martinez, D., Rodriguez, J.G., Moreno, P.: Automatic language identification using deep neural networks. In: Proceedings of ICASSP, pp. 5337–5341 (2014)
Heracleous, P., Takai, K., Yasuda, K., Mohammad, Y., Yoneyama, A.: Comparative Study on Spoken Language Identification Based on Deep Learning. In: Proceedings of EUSIPCO (2018)
Jiang, B., Song, Y., Wei, S., Liu, J.-H., McLoughlin, I.V., Dai, L.-R.: Deep bottleneck features for spoken language identification. PLoS ONE 9(7), 1–11 (2010)
Zazo, R., Diez, A.L., Dominguez, J.G., Toledano, D.T., Rodriguez, J.G.: Language identification in short utterances using long short-term memory (lstm) recurrent neural networks. PLoS ONE 11(1), e0146917 (2016)
Heracleous, P., Mohammad, Y., Takai, K., Yasuda, K., Yoneyama, A.: Spoken Language Identification Based on I-vectors and Conditional Random Fields. In: Proceedings of IWCMC, pp. 1443–1447 (2018)
Cristianini, N., Taylor, J.S.: Support vector machines. Cambridge University Press, Cambridge (2000)
Dehak, N., Carrasquillo, P.A.T., Reynolds, D., Dehak, R.: Language recognition via ivectors and dimensionality reduction. In: Proceedings of Interspeech, pp. 857–860 (2011)
Shen, P., Lu, X., Liu, L., Kawai, H.: Local fisher discriminant analysis for spoken language identification. In: Proceedings of ICASSP, pp. 5825–5829 (2016)
Livingstone, S.R., Peck, K., F.A., Russo: RAVDESS: The ryerson audio-visual database of emotional speech and song. In: 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS), Kingston, ON (2012)
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: Proceedings of Interspeech (2005)
Reiter, S., Schuller, B., Rigoll, G.: Hidden conditional random fields for meeting segmentation. In: Proceedings of ICME, pp. 639–642 (2007)
Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random fields for phone classification. In: Proceedings of Interspeech, pp. 1117–1120 (2005)
Llorens, H., Saquete, E., Colorado, B.N.: TimeML events recognition and classification: learning crf models with semantic roles. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 725–733 (2010)
Yu, D., Wang, S., Karam, Z., Deng, L.: Language recognition using deep-structured conditional random fields. In: Proceedings of ICASSP, pp. 5030–5033 (2010)
Quattoni, A., Collins, M., Darrell, T.: Conditional random fields for object recognition. In: Saul, L.K., Weiss, Y., Bottou, L., (eds.) Advances in Neural Information Processing Systems 17, MIT Press, pp. 1097–1104 (2005)
Yu, C., Liu, G., Hansen, J.H.L.: Acoustic feature transformation using ubm-based lda for speaker recognition. In: Proceedings of Interspeech, pp. 1851–1854 (2014)
Li, X., Akagi, M.: Multilingual speech emotion recognition system based on a three-layer model. In: Proceedimgs of Interspeech, pp. 3606–3612 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Heracleous, P., Yasuda, K., Yoneyama, A. (2023). Multilingual Speech Emotion Recognition on Japanese, English, and German. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-031-24340-0_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)