[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Multi-Label Speech Emotion Recognition via Inter-Class Difference Loss Under Response Residual Network

Published: 01 January 2023 Publication History

Abstract

Speech emotion recognition has always been a challenging task due to the difference in emotion expression and perception. Currently, in the supervised speech emotion recognition systems, the soft label overcomes the disadvantage of the hard label losing annotations variability and emotion perception subjectivity, but it only considers the emotion perceptions of a few annotators and thus still brings high statistical error. For this issue, this paper redefines the target and designs a novel loss function (denoted as inter-class difference loss), which enables the network to adaptively learn an emotion distribution in all utterances. This not only restricts the negative class probability less than the positive class probability, but also limits the negative class probability close to zero. To make the speech emotion recognition system more efficient, this paper proposes an end-to-end network, called response residual network (R-ResNet), which incorporates the ResNet for features extraction, together with the emotion response module for data augmentation and variable-length data processing. Finally, the experimental results not only demonstrate the advanced performance of our work, but also confirm that the ambiguous utterances contain emotional characteristics. In addition, another interesting finding is that, on the unbalanced dataset, the batch normalization (BN) after addition performs better than BN before addition.

References

[1]
S. Ramakrishnan and I. M. M. El Emary, “Speech emotion recognition approaches in human computer interaction,” Telecommun. Syst., vol. 52, no. 3, pp. 1467–1478, Mar. 2013.
[2]
M. Chen and Y. Hao, “Label-less learning for emotion cognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2430–2440, Jul. 2020.
[3]
M. Alam, L. S. Vidyaratne, and K. M. Iftekharuddin, “Sparse simultaneous recurrent deep learning for robust facial expression recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10, pp. 4905–4916, Oct. 2018.
[4]
S. T. Shivappa, M. M. Trivedi, and B. D. Rao, “Audiovisual information fusion in human-computer interfaces and intelligent environments: A survey,” Proc. IEEE, vol. 98, no. 10, pp. 1692–1715, Oct. 2010.
[5]
Y. Jiang, W. Xiao, R. Wang, and A. Barnawi, “Smart urban living: Enabling emotion-guided interaction with next generation sensing fabric,” IEEE Access, vol. 8, no. 1, pp. 28395–28402, Dec. 2019.
[6]
P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee, “Enabling intelligent environment by the design of emotionally aware virtual assistant: A case of smart campus,” IEEE Access, vol. 8, no. 1, pp. 62032–62041, Mar. 2020.
[7]
A. Ando et al., “Soft-target training with ambiguous emotional utterances for DNN-based speech emotion classification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 4964–4968.
[8]
Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec. 2014.
[9]
S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans. Multimedia, vol. 20, no. 1, pp. 1576–1590, Jun. 2018.
[10]
L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection,” Neural Netw., vol. 18, no. 1, pp. 407–422, May 2005.
[11]
C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang. Resour. Eval., vol. 42, no. 1, pp. 335–359, Nov. 2008.
[12]
E. Mower et al., “Interpreting ambiguous emotional expressions,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, Amsterdam, The Netherlands, 2009, pp. 1–8.
[13]
S. Steidl, M. Levit, A. Batliner, E. Noth, and H. Niemann, ““Of all things the measure is man” automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition],” in Proc. Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, USA, 2005, pp. 317–320.
[14]
H. M. Fayek, M. Lech, and L. Cavedon, “Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels,” in Proc. Int. Joint Conf. Neural Netw., Vancouver, BC, Canada, 2016, pp. 566–570.
[15]
R. Lotfian and C. Busso, “Formulating emotion perception as a probabilistic model with application to categorical emotion classification,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, San Antonio, TX, USA, 2017, pp. 415–420.
[16]
S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio-visual emotion recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 1, pp. 3030–3043, Oct. 2018.
[17]
M. Chen, X. He, J. Yang, and H. Zhang, “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Process. Lett., vol. 25, no. 1, pp. 1440–1444, Oct. 2018.
[18]
J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D and 2D CNN LSTM networks,” Biomed. Signal Process. Control, vol. 47, no. 1, pp. 312–323, Jan. 2019.
[19]
R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Trans. Affect. Comput., vol. 8, no. 1, pp. 3–14, Jan.–Mar. 2017.
[20]
M. S. Hossain and G. Muhammadb, “Emotion recognition using deep learning approach from audio-visual emotional big data,” Inf. Fusion, vol. 49, no. 1, pp. 69–78, Sep. 2019.
[21]
E. Mower, M. J. Matarić, and S. Narayanan, “A framework for automatic human emotion classification using emotion profiles,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 5, pp. 1057–1070, Jul. 2011.
[22]
Y. Kim and J. Kim, “Human-like emotion recognition: Multi-label learning from noisy labeled audio-visual expressive speech,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 5104–5108.
[23]
A. Ando, R. Masumura, H. Kamiyama, S. Kobashikawa, and Y. Aono, “Speech emotion recognition based on multi-label emotion existence model,” in Proc. Interspeech, Graz, Austria, 2019, pp. 2818–2822.
[24]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, 2016, pp. 770–778.
[25]
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, The Netherlands, 2016, pp. 630–645.
[26]
S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn., Lille, France, 2015, pp. 448–456.
[27]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 1137–1149, Jun. 2017.
[28]
F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 815–823.
[29]
H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Netw., vol. 92, no. 1, pp. 60–68, Aug. 2017.
[30]
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion recognition: A benchmark comparison of performances,” in Proc. Autom. Speech Recognit. Understanding, Merano/Meran, Italy, 2009, pp. 552–557.
[31]
F. Eyben, F. Weninger, F. Gro, and B. Schuller, “Recent developments in openSMILE, the munich open-source multimedia feature extractor,” in Proc. Assoc. Comput. Machinery Multimedia, Barcelona, Spain, 2013, pp. 835–838.
[32]
N. Takahashi, M. Gygli, and L. V. Gool, “AENet: Learning deep audio features for video analysis,” IEEE Trans. Multimedia, vol. 20, no. 3, pp. 513–524, Mar. 2018.
[33]
F. Tao and C. Busso, “End-to-end audiovisual speech recognition system with multitask learning,” IEEE Trans. Multimedia, vol. 23, pp. 1–11, 2021.
[34]
F. Eyben, M. Wöllmer, and B. Schuller, “OpenEAR - Introducing the munich open-source emotion and affect recognition toolkit,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, Amsterdam, The Netherlands, 2009, pp. 1–6.
[35]
L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 85, pp. 2579–2605, Nov. 2008.
[36]
F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari, “Audio-visual emotion recognition in video clips,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 60–75, Jan.–Mar. 2019.
[37]
Y. Xie et al., “Speech emotion classification using attention-based LSTM,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 11, pp. 1675–1685, Nov. 2019.
[38]
Z. Zhao et al., “Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition,” IEEE Access, vol. 7, no. 1, pp. 97515–97525, Jul. 2019.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 25, Issue
2023
8932 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2023

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media