More Web Proxy on the site http://driver.im/

research-article

Multi-Label Speech Emotion Recognition via Inter-Class Difference Loss Under Response Residual Network

Authors:

Yong XiangAuthors Info & Claims

IEEE Transactions on Multimedia, Volume 25

Pages 3230 - 3244

https://doi.org/10.1109/TMM.2022.3157485

Published: 01 January 2023 Publication History

Abstract

Speech emotion recognition has always been a challenging task due to the difference in emotion expression and perception. Currently, in the supervised speech emotion recognition systems, the soft label overcomes the disadvantage of the hard label losing annotations variability and emotion perception subjectivity, but it only considers the emotion perceptions of a few annotators and thus still brings high statistical error. For this issue, this paper redefines the target and designs a novel loss function (denoted as inter-class difference loss), which enables the network to adaptively learn an emotion distribution in all utterances. This not only restricts the negative class probability less than the positive class probability, but also limits the negative class probability close to zero. To make the speech emotion recognition system more efficient, this paper proposes an end-to-end network, called response residual network (R-ResNet), which incorporates the ResNet for features extraction, together with the emotion response module for data augmentation and variable-length data processing. Finally, the experimental results not only demonstrate the advanced performance of our work, but also confirm that the ambiguous utterances contain emotional characteristics. In addition, another interesting finding is that, on the unbalanced dataset, the batch normalization (BN) after addition performs better than BN before addition.

References

[1]

S. Ramakrishnan and I. M. M. El Emary, “Speech emotion recognition approaches in human computer interaction,” Telecommun. Syst., vol. 52, no. 3, pp. 1467–1478, Mar. 2013.

Digital Library

[2]

M. Chen and Y. Hao, “Label-less learning for emotion cognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 7, pp. 2430–2440, Jul. 2020.

[3]

M. Alam, L. S. Vidyaratne, and K. M. Iftekharuddin, “Sparse simultaneous recurrent deep learning for robust facial expression recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 10, pp. 4905–4916, Oct. 2018.

[4]

S. T. Shivappa, M. M. Trivedi, and B. D. Rao, “Audiovisual information fusion in human-computer interfaces and intelligent environments: A survey,” Proc. IEEE, vol. 98, no. 10, pp. 1692–1715, Oct. 2010.

[5]

Y. Jiang, W. Xiao, R. Wang, and A. Barnawi, “Smart urban living: Enabling emotion-guided interaction with next generation sensing fabric,” IEEE Access, vol. 8, no. 1, pp. 28395–28402, Dec. 2019.

[6]

P.-S. Chiu, J.-W. Chang, M.-C. Lee, C.-H. Chen, and D.-S. Lee, “Enabling intelligent environment by the design of emotionally aware virtual assistant: A case of smart campus,” IEEE Access, vol. 8, no. 1, pp. 62032–62041, Mar. 2020.

[7]

A. Ando et al., “Soft-target training with ambiguous emotional utterances for DNN-based speech emotion classification,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 4964–4968.

[8]

Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Trans. Multimedia, vol. 16, no. 8, pp. 2203–2213, Dec. 2014.

[9]

S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Trans. Multimedia, vol. 20, no. 1, pp. 1576–1590, Jun. 2018.

[10]

L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection,” Neural Netw., vol. 18, no. 1, pp. 407–422, May 2005.

Digital Library

[11]

C. Busso et al., “IEMOCAP: Interactive emotional dyadic motion capture database,” Lang. Resour. Eval., vol. 42, no. 1, pp. 335–359, Nov. 2008.

[12]

E. Mower et al., “Interpreting ambiguous emotional expressions,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, Amsterdam, The Netherlands, 2009, pp. 1–8.

[13]

S. Steidl, M. Levit, A. Batliner, E. Noth, and H. Niemann, ““Of all things the measure is man” automatic classification of emotions and inter-labeler consistency [speech-based emotion recognition],” in Proc. Int. Conf. Acoust., Speech, Signal Process., Philadelphia, PA, USA, 2005, pp. 317–320.

[14]

H. M. Fayek, M. Lech, and L. Cavedon, “Modeling subjectiveness in emotion recognition with deep neural networks: Ensembles vs soft labels,” in Proc. Int. Joint Conf. Neural Netw., Vancouver, BC, Canada, 2016, pp. 566–570.

[15]

R. Lotfian and C. Busso, “Formulating emotion perception as a probabilistic model with application to categorical emotion classification,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, San Antonio, TX, USA, 2017, pp. 415–420.

[16]

S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio-visual emotion recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 1, pp. 3030–3043, Oct. 2018.

Digital Library

[17]

M. Chen, X. He, J. Yang, and H. Zhang, “3-D convolutional recurrent neural networks with attention model for speech emotion recognition,” IEEE Signal Process. Lett., vol. 25, no. 1, pp. 1440–1444, Oct. 2018.

[18]

J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D and 2D CNN LSTM networks,” Biomed. Signal Process. Control, vol. 47, no. 1, pp. 312–323, Jan. 2019.

[19]

R. Xia and Y. Liu, “A multi-task learning framework for emotion recognition using 2D continuous space,” IEEE Trans. Affect. Comput., vol. 8, no. 1, pp. 3–14, Jan.–Mar. 2017.

Digital Library

[20]

M. S. Hossain and G. Muhammadb, “Emotion recognition using deep learning approach from audio-visual emotional big data,” Inf. Fusion, vol. 49, no. 1, pp. 69–78, Sep. 2019.

Digital Library

[21]

E. Mower, M. J. Matarić, and S. Narayanan, “A framework for automatic human emotion classification using emotion profiles,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 5, pp. 1057–1070, Jul. 2011.

Digital Library

[22]

Y. Kim and J. Kim, “Human-like emotion recognition: Multi-label learning from noisy labeled audio-visual expressive speech,” in Proc. Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 2018, pp. 5104–5108.

[23]

A. Ando, R. Masumura, H. Kamiyama, S. Kobashikawa, and Y. Aono, “Speech emotion recognition based on multi-label emotion existence model,” in Proc. Interspeech, Graz, Austria, 2019, pp. 2818–2822.

[24]

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. Conf. Comput. Vis. Pattern Recognit., Las Vegas, NV, USA, 2016, pp. 770–778.

[25]

K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, The Netherlands, 2016, pp. 630–645.

[26]

S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int. Conf. Mach. Learn., Lille, France, 2015, pp. 448–456.

[27]

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1, pp. 1137–1149, Jun. 2017.

Digital Library

[28]

F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. Conf. Comput. Vis. Pattern Recognit., Boston, MA, USA, 2015, pp. 815–823.

[29]

H. M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for speech emotion recognition,” Neural Netw., vol. 92, no. 1, pp. 60–68, Aug. 2017.

[30]

B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, “Acoustic emotion recognition: A benchmark comparison of performances,” in Proc. Autom. Speech Recognit. Understanding, Merano/Meran, Italy, 2009, pp. 552–557.

[31]

F. Eyben, F. Weninger, F. Gro, and B. Schuller, “Recent developments in openSMILE, the munich open-source multimedia feature extractor,” in Proc. Assoc. Comput. Machinery Multimedia, Barcelona, Spain, 2013, pp. 835–838.

[32]

N. Takahashi, M. Gygli, and L. V. Gool, “AENet: Learning deep audio features for video analysis,” IEEE Trans. Multimedia, vol. 20, no. 3, pp. 513–524, Mar. 2018.

Digital Library

[33]

F. Tao and C. Busso, “End-to-end audiovisual speech recognition system with multitask learning,” IEEE Trans. Multimedia, vol. 23, pp. 1–11, 2021.

[34]

F. Eyben, M. Wöllmer, and B. Schuller, “OpenEAR - Introducing the munich open-source emotion and affect recognition toolkit,” in Proc. Int. Conf. Affect. Comput. Intell. Interaction, Amsterdam, The Netherlands, 2009, pp. 1–6.

[35]

L. V. D. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 85, pp. 2579–2605, Nov. 2008.

[36]

F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, and G. Anbarjafari, “Audio-visual emotion recognition in video clips,” IEEE Trans. Affect. Comput., vol. 10, no. 1, pp. 60–75, Jan.–Mar. 2019.

Digital Library

[37]

Y. Xie et al., “Speech emotion classification using attention-based LSTM,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 11, pp. 1675–1685, Nov. 2019.

Digital Library

[38]

Z. Zhao et al., “Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition,” IEEE Access, vol. 7, no. 1, pp. 97515–97525, Jul. 2019.

Recommendations

Emotion Recognition in Continuous Mandarin Chinese Speech: Visualizing Emotional Expression from Continuous Speech in a 2D Emotional Space
Emotion recognition from speech: a review

Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on ...
Emotion modulates early auditory response to speech

In order to understand how emotional state influences the listener's physiological response to speech, subjects looked at emotion-evoking pictures while 32-channel EEG evoked responses (ERPs) to an unchanging auditory stimulus ("danny") were collected. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia

IEEE Transactions on Multimedia Volume 25, Issue

2023

8932 pages

ISSN:1520-9210

Issue’s Table of Contents

1520-9210 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 01 January 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents