Multi-Label Speech Emotion Recognition via Inter-Class Difference Loss Under Response Residual Network
Abstract
Speech emotion recognition has always been a challenging task due to the difference in emotion expression and perception. Currently, in the supervised speech emotion recognition systems, the soft label overcomes the disadvantage of the hard label losing annotations variability and emotion perception subjectivity, but it only considers the emotion perceptions of a few annotators and thus still brings high statistical error. For this issue, this paper redefines the target and designs a novel loss function (denoted as inter-class difference loss), which enables the network to adaptively learn an emotion distribution in all utterances. This not only restricts the negative class probability less than the positive class probability, but also limits the negative class probability close to zero. To make the speech emotion recognition system more efficient, this paper proposes an end-to-end network, called response residual network (R-ResNet), which incorporates the ResNet for features extraction, together with the emotion response module for data augmentation and variable-length data processing. Finally, the experimental results not only demonstrate the advanced performance of our work, but also confirm that the ambiguous utterances contain emotional characteristics. In addition, another interesting finding is that, on the unbalanced dataset, the batch normalization (BN) after addition performs better than BN before addition.
References
[1]
S. Ramakrishnan and I. M. M. El Emary, “Speech emotion recognition approaches in human computer interaction,” Telecommun. Syst., vol. 52, no. 3, pp. 1467–1478, Mar. 2013.
[2]
M. Chen and Y. Hao, “Label-less learning for emotion cognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2430–2440, Jul. 2020.
[3]
M. Alam, L. S. Vidyaratne, and K. M. Iftekharuddin, “Sparse simultaneous recurrent deep learning for robust facial expression recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10, pp. 4905–4916, Oct. 2018.
[4]
S. T. Shivappa, M. M. Trivedi, and B. D. Rao, “Audiovisual information fusion in human-computer interfaces and intelligent environments: A survey,” Proc. IEEE, vol. 98, no. 10, pp. 1692–1715, Oct. 2010.
[5]
Y. Jiang, W. Xiao, R. Wang, and A. Barnawi, “Smart urban living: Enabling emotion-guided interaction with next generation sensing fabric,” IEEE Access, vol. 8, no. 1, pp. 28395–28402, Dec. 2019.
[6]
P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee, “Enabling intelligent environment by the design of emotionally aware virtual assistant: A case of smart campus,” IEEE Access, vol. 8, no. 1, pp. 62032–62041, Mar. 2020.
[7]
A. Ando et al., “Soft-target training with ambiguous emotional utterances for DNN-based speech emotion classification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 4964–4968.
[8]
Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec. 2014.
[9]
S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans. Multimedia, vol. 20, no. 1, pp. 1576–1590, Jun. 2018.
[10]
L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection,” Neural Netw., vol. 18, no. 1, pp. 407–422, May 2005.
[11]
C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang. Resour. Eval., vol. 42, no. 1, pp. 335–359, Nov. 2008.
[12]
E. Mower et al., “Interpreting ambiguous emotional expressions,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, Amsterdam, The Netherlands, 2009, pp. 1–8.
[13]
S. Steidl, M. Levit, A. Batliner, E. Noth, and H. Niemann, ““Of all things the measure is man” automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition],” in Proc. Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, USA, 2005, pp. 317–320.
[14]
H. M. Fayek, M. Lech, and L. Cavedon, “Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels,” in Proc. Int. Joint Conf. Neural Netw., Vancouver, BC, Canada, 2016, pp. 566–570.
[15]
R. Lotfian and C. Busso, “Formulating emotion perception as a probabilistic model with application to categorical emotion classification,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, San Antonio, TX, USA, 2017, pp. 415–420.
[16]
S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio-visual emotion recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 1, pp. 3030–3043, Oct. 2018.
[17]
M. Chen, X. He, J. Yang, and H. Zhang, “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Process. Lett., vol. 25, no. 1, pp. 1440–1444, Oct. 2018.
[18]
J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D and 2D CNN LSTM networks,” Biomed. Signal Process. Control, vol. 47, no. 1, pp. 312–323, Jan. 2019.
[19]
R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Trans. Affect. Comput., vol. 8, no. 1, pp. 3–14, Jan.–Mar. 2017.
[20]
M. S. Hossain and G. Muhammadb, “Emotion recognition using deep learning approach from audio-visual emotional big data,” Inf. Fusion, vol. 49, no. 1, pp. 69–78, Sep. 2019.
[21]
E. Mower, M. J. Matarić, and S. Narayanan, “A framework for automatic human emotion classification using emotion profiles,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 5, pp. 1057–1070, Jul. 2011.
[22]
Y. Kim and J. Kim, “Human-like emotion recognition: Multi-label learning from noisy labeled audio-visual expressive speech,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 5104–5108.
[23]
A. Ando, R. Masumura, H. Kamiyama, S. Kobashikawa, and Y. Aono, “Speech emotion recognition based on multi-label emotion existence model,” in Proc. Interspeech, Graz, Austria, 2019, pp. 2818–2822.
[24]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, 2016, pp. 770–778.
[25]
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, The Netherlands, 2016, pp. 630–645.
[26]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn., Lille, France, 2015, pp. 448–456.
[27]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 1137–1149, Jun. 2017.
[28]
F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 815–823.
[29]
H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Netw., vol. 92, no. 1, pp. 60–68, Aug. 2017.
[30]
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion recognition: A benchmark comparison of performances,” in Proc. Autom. Speech Recognit. Understanding, Merano/Meran, Italy, 2009, pp. 552–557.
[31]
F. Eyben, F. Weninger, F. Gro, and B. Schuller, “Recent developments in openSMILE, the munich open-source multimedia feature extractor,” in Proc. Assoc. Comput. Machinery Multimedia, Barcelona, Spain, 2013, pp. 835–838.
[32]
N. Takahashi, M. Gygli, and L. V. Gool, “AENet: Learning deep audio features for video analysis,” IEEE Trans. Multimedia, vol. 20, no. 3, pp. 513–524, Mar. 2018.
[33]
F. Tao and C. Busso, “End-to-end audiovisual speech recognition system with multitask learning,” IEEE Trans. Multimedia, vol. 23, pp. 1–11, 2021.
[34]
F. Eyben, M. Wöllmer, and B. Schuller, “OpenEAR - Introducing the munich open-source emotion and affect recognition toolkit,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, Amsterdam, The Netherlands, 2009, pp. 1–6.
[35]
L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 85, pp. 2579–2605, Nov. 2008.
[36]
F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari, “Audio-visual emotion recognition in video clips,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 60–75, Jan.–Mar. 2019.
[37]
Y. Xie et al., “Speech emotion classification using attention-based LSTM,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 11, pp. 1675–1685, Nov. 2019.
[38]
Z. Zhao et al., “Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition,” IEEE Access, vol. 7, no. 1, pp. 97515–97525, Jul. 2019.
Recommendations
Emotion recognition from speech: a review
Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on ...
Emotion modulates early auditory response to speech
In order to understand how emotional state influences the listener's physiological response to speech, subjects looked at emotion-evoking pictures while 32-channel EEG evoked responses (ERPs) to an unchanging auditory stimulus ("danny") were collected. ...
Comments
Please enable JavaScript to view thecomments powered by Disqus.Information & Contributors
Information
Published In
1520-9210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.
Publisher
IEEE Press
Publication History
Published: 01 January 2023
Qualifiers
- Research-article
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 0Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 30 Dec 2024