[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Speech emotion recognition using data augmentation method by cycle-generative adversarial networks

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

One of the obstacles in developing speech emotion recognition (SER) systems is the data scarcity problem, i.e., the lack of labeled data for training these systems. Data augmentation is an effective method for increasing the amount of training data. In this paper, we propose a cycle-generative adversarial network (cycle-GAN) for data augmentation in the SER systems. For each of the five emotions considered, an adversarial network is designed to generate data that have a similar distribution to the main data in that class but have a different distribution to those of other classes. These networks are trained in an adversarial way to produce feature vectors similar to those in the training set, which are then added to the original training sets. Instead of using the common cross-entropy loss to train cycle-GANs, we use the Wasserstein divergence to mitigate the gradient vanishing problem and to generate high-quality samples. The proposed network has been applied to SER using the EMO-DB dataset. The quality of the generated data is evaluated using two classifiers based on support vector machine and deep neural network. The results showed that the recognition accuracy in unweighted average recall was about 83.33%, which is better than the baseline methods compared.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011)

    Article  Google Scholar 

  2. Wang, J., Perez, L.: The effectiveness of data augmentation in image classification using deep learning. In: Computer Vision and Pattern Recognition (2017)

  3. Zhang, X., LeCun, Y.: Text understanding from scratch (2015). arXiv preprint arXiv:1502.01710

  4. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: The Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany (2015)

  5. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (2014)

  6. Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., Zafeiriou, S.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5200–5204 (2016)

  7. Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., Cai, L.: Emotion recognition from variable-length speech segments using deep learning on spectrograms. In: Proceedings of Interspeech, pp. 3683–3687 (2018)

  8. Li, P., Song, Y., McLoughlin, I., Guo, W., Dai, L.: An attention pooling-based representation learning method for speech emotion recognition. In: Proceedings of Interspeech, pp. 3087–3091 (2018)

  9. Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder based feature transfer learning for speech emotion recognition. In: Humaine Association Conference on Affective Computing and Intelligent, pp. 511–516 (2013)

  10. Sahu, S., Gupta, R., Espy-Wilson, C.: On enhancing speech emotion recognition using generative adversarial networks (2018). arXiv:1806.06626

  11. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)

    Article  Google Scholar 

  12. Wang, M., Deng, W.: Deep visual domain adaptation: a survey. Neurocomputing 312, 135–153 (2018)

    Article  Google Scholar 

  13. Antoniou, A., Storkey, A., Edwards, H.: Data augmentation generative adversarial networks (2017). arXiv:1711.04340

  14. Zhang, Z., Han, J., Qian, K., Jannett, C., Guo, Y., Schuller, B.: Snore- GANs: improving automatic snore sound classification with synthesized data. IEEE J. Biomed. Health Inf. 24(1), 300–310 (2020)

    Article  Google Scholar 

  15. Park, S., Chan, W., Zhang, Y., Chiu, C., Zoph, B., Cubuk, D., Le, Q.V.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Proceedings of Interspeech, pp. 2613–2617 (2019)

  16. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: Proceedings of International Conference on Machine Learning, pp. 214–223 (2017)

  17. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A. C.: Improved training of Wasserstein GANs. In: Proceedings of Advanced Neural Information Processing Systems, pp. 5767–5777 (2017)

  18. Shen, J., Qu, Y., Zhang, W., Yu, Y.: Wasserstein distance guided representation learning for domain adaptation. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 4058–4065 (2018)

  19. Tiwari, U., Soni, M., Panda, A., Chakraborty, R., Kumar Kopparapu, S.: Multi-conditioning and data augmentation using generative noise model for speech emotion recognition in noisy conditions. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2020)

  20. DeVries, T., Taylor, G.W.: Dataset augmentation in feature space (2017). arXiv:1702.05538

  21. Hu, H., Tan, T., Qian, Y.: Generative adversarial network-based data augmentation for noise-robust speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5044–5048 (2018)

  22. Sahu, S., Gupta, R., Sivaraman, G., Abdalmageed, W., Espy-Wilson, C.: Adversarial auto-encoders for speech-based emotion recognition. In: Proceedings of Interspeech, pp. 1243–1247 (2017)

  23. Hajarolasvadi, N., Bashirov, E., Demirel, H.: Video-based person-dependent and person-independent facial emotion recognition. Signal Image Video Process. 15(5), 1049–1056 (2021)

    Article  Google Scholar 

  24. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. In: 4th International Conference on Learning Representations (ICLR), Puerto Rico (2016)

  25. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion reconition: features, classification schemes, and data-bases. Pattern Recognit. 44(3), 572–587 (2011)

    Article  Google Scholar 

  26. Bao, F., Neumann, M., Vu, N.T.: CycleGAN-based emotion style transfer as data augmentation for speech emotion recognition. In: Interspeech (2019)

  27. Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

  28. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein generative adversarial networks. In: ICML (2017)

  29. Wu, J., Huang, Z., Thoma, J., Acharya, D., Van Gool, L.: Wasserstein divergence for GANs. In: Proceedings of European Conference on Computer Vision (ECCV), pp. 653–668 (2018)

  30. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: Proceedings of 9th European Conference on Speech Communication and Technology, pp. 1–4 (2005)

  31. Kossaifi, J., Walecki, R., Panagakis, Y., Shen, J., Schmitt, M., Ringeval, F., Han, J., Pandit, V., Toisoul, A., Schuller, B., Star, K., Hajiyev, E., Pantic, M.: SEWA DB: a rich database for audio-visual emotion and sentiment research in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 1022–1040 (2021)

    Article  Google Scholar 

  32. Stappen, L., Baird, A., Schumann, L., Schuller, B.: The multimodal sentiment analysis in car reviews (MuSe-CaR) dataset: collection, insights and improvements. In: IEEE Transactions on Affective Computing (EARLY ACCESS) (2021)

  33. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

  34. Kingma, D.P., Adam, J.B.: A method for stochastic optimization. In: Proceedings of 3rd International Conference on Learning Representations (ICLR), pp. 1–15 (2015)

  35. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: 13th Conference on Neural Information Processing Systems, Barcelona, Spain (2016)

  36. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  37. Chen, M., He, X., Yang, J.: 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440–1444 (2018)

    Article  Google Scholar 

  38. Luengo, I., Navas, E., Hernaez, I.: Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multimed. 12(6), 490–501 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arash Shilandari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shilandari, A., Marvi, H., Khosravi, H. et al. Speech emotion recognition using data augmentation method by cycle-generative adversarial networks. SIViP 16, 1955–1962 (2022). https://doi.org/10.1007/s11760-022-02156-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02156-9

Keywords

Navigation