SENet-based speech emotion recognition using synthesis-style transfer data augmentation

Rajeev Rajan^1,3 &
T. V. Hridya Raj^2,3

155 Accesses
1 Citation
Explore all metrics

Abstract

This paper addresses speech emotion recognition using a channel-attention mechanism with a synthesized data augmentation approach. Convolutional neural network (CNN) produces channel attention map by exploiting the inter-channel relationship of features. The main issue faced in the speech emotion recognition domain is insufficient data for building an efficient model. The proposed work uses a style transfer scheme to achieve data augmentation by multi-voice synthesis from the text. It consists of text-to-speech (TTS) and style transfer modules. Synthesized speech is generated from the text for a target speaker’s voice by a TTS converter in the front end. Later, the emotion of the synthesized speech is obtained based on the emotional content fed to the style-transfer module. The text-to-speech module is trained using LibriSpeech and NUS-48E corpus. The quality of the synthesized speech samples is also rated using subjective evaluation through mean opinion score (MOS). The speech emotion recognition approach is systematically evaluated using the Berlin EMO-DB corpus. The channel-attention-based Squeeze and Excitation Network (SEnet) shows its promise in the speech emotion recognition experiment.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Multi-featured Speech Emotion Recognition Using Extended Convolutional Neural Network

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Article 01 May 2018

Speech emotion recognition for human–computer interaction

Article 31 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Availability of data and materials

The datasets analyzed in this manuscript are publicly available.

References

Ali-Gombe, A., & MFC-GAN EE. (2019). Class-imbalanced dataset classification using multiple fake class generative adversarial network. Neurocomputing, 361, 212–221.
Bao, F., Neumann, M., & Vu, T. (2019). Cyclegan-based emotion style transfer as data augmentation for speech emotion recognition. In InterSpeech (pp. 2828–2832).
Chatziagapi, A., Paraskevopoulos, G., Sgouropoulos, D., Pantazopoulos, G., Nikandrou, M., Giannakopoulos, T., Katsamanis, A., Potamianos, A., & Narayanan, S. (2019) Data augmentation using gans for speech emotion recognition. In Interspeech (pp. 171–175).
Dey, S., Rajan, R., Padmanabhan, R., & Murthy, H. A. (2011). Feature diversity for emotion, language and speaker verification. In 2011 national conference on communications (NCC) (pp. 1–5). Bangalore, India. https://doi.org/10.1109/NCC.2011.5734774.
Donahue, C., McAuley, J. J., & Puckette, M. (2019). Adversarial audio synthesis. In Proceedings of international conference on learning representations (ICLR) (pp. 1–16).
Drisya, P. S., & Rajan, R. (2017). Significance of teo slope feature in speech emotion recognition. In 2017 international conference on networks & advances in computational technologies (NetACT) (pp. 438–441). Thiruvananthapuram, India.
Gatys, L., Ecker, A., & Bethge, M. (2016) Image style transfer using convolutional neural networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2414–2423).
Ghosal, D., & Kolekar, M. (2018). Music genre recognition using deep neural networks and transfer learning. 2087–2091. https://doi.org/10.21437/Interspeech.2018-2045.
Humphrey, E. J., Bello, J. P., & LeCun, Y. (2012). Moving beyond feature design: Deep architectures and automatic feature learning in music informatics. In Proceedings of international society for music information retrieval conference (ISMIR) (pp. 403–408).
Inanoglu, Z., & Young, S. (2009). Data-driven emotion conversion in spoken English. Speech Communication, 51, 268–283.
Article Google Scholar
Jaitley, N., & Hinton, G. E. (2013). Vocal tract length perturbation (VTLP) improves speech recognition. In Proceedings of ICML workshop on deep learning for audio, speech, and language (pp. 278–324).
Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Lopez-Moreno, I., & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of neural information processing systems (pp. 1–15).
Ko, T., et al. (2015). Audio augmentation for speech recognition. In Sixteenth annual conference of the international speech communication association.
Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5220–5224).
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Li, T. L. H., & Chan, A. B. (2011). Genre classification and the invariance of MFCC features to key and tempo. In Lecture notes in computer science (Vol. 6523 LNCS, pp. 317–327).
Liao, Z., & Shen, S. (2023). Speech emotion recognition based on swin-transformer. Journal of Physics: Conference Series 2508(1), 012056.
Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verification in the wild. Computer Speech Language, 60, 1010–27.
Article Google Scholar
Nanni, L., Maguolo, G., & Paci, M. (2020). Data augmentation approaches for improving animal audio classification. Ecological Informatics, 57, 101084.
Article Google Scholar
Oikarinen, T., Srinivasan, K., Meisner, O., Hyman, J. B., Parmar, S., Fanucci-Kiss, A., Desimone, R., Landman, R., & Feng, G. (2019). Deep convolutional network for animal sound classification and source attribution using dual audio recordings. Journal of the Acoustical Society of America, 145, 654–662.
Article Google Scholar
Padi, S., Sadjadi, S. O., & Manocha, D. (2021). Improved speech emotion recognition using transfer learning and spectrogram augmentation. In Proceedings of the 2021 international conference on multimodal interaction (ICMI).
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An asrcorpus based on public domain audio books. In Proceedings of of IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206–5210).
Paraskevopoulou, G., Spyrou, E., & Perantonis, S. A. (2022). Data augmentation approach for improving the performance of speech emotion recognition. In Proceedings of the 19th international conference on signal processing and multimedia applications (ICSPMA) (pp. 61–69). https://doi.org/10.5220/0011148000003289
Park, D. S., et al. (2019).Specaugment: A simple data augmentation method for automatic speech recognition, arXiv preprint arXiv:1904.08779 .
Peng, Z., Lu, Y., Pan, S., Liu, Y. (2021). Efficient speech emotion recognition using multi-scale CNN and attention. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3020–3024).
Resna, S., & Rajan, R. (2023a). Multi-voice singing synthesis from lyrics. Circuits System Signal Processing, 42, 307–321. https://doi.org/10.1007/s00034-022-02122-3
Article Google Scholar
Resna, S., & Rajan, R. (2023b). Comparative study on multi-voice singing synthesize systems. International Journal of Automation and Smart Technology, 13(1), 2417.
Article Google Scholar
Schluter, J., & Grill, T. (2015). Exploring data augmentation for improved singing voice detection with neural networks. In Proceedings of the 16th international society for music information retrieval conference (ISMIR) (pp. 121–126).
Su, B.-H., & Lee, C. C. (2022). Unsupervised cross-corpus speech emotion recognition using a multi-source cycle-GAN. IEEE Transactions on Affective Computing.
Su, B., & Lee, C. (2018). Unsupervised cross-corpus speech emotion recognition using a multi-source cycle-GAN. IEEE Transactions on Affective Computing, no. 01, pp. 1–1, 5555.
Subbarao, M. V., Terlapu, S. K., & Chowdary, P. S. R. (2022). Emotion recognition using BiLSTM classifier. In 2022 international conference on computing, communication and power technology (IC3P) (pp. 195–198). Visakhapatnam, India. https://doi.org/10.1109/IC3P52835.2022.00048
Sukhavasi, M., & Sainath, A. (2019). Music theme recognition using CNN and self-attention. ArXiv: abs/1911.07041
Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1145–1154.
Article Google Scholar
Tu, G., Wen, J., Liu, C., Jiang, D., & Cambria, E. (2022). Context- and sentiment-aware networks for emotion recognition in conversation. IEEE Transactions on Artificial Intelligence, 3(5), 699–708.
Article Google Scholar
Ullah, R., Asif, M., Shah, W. A., Anjam, F., Ullah, I., Khurshaid, T., Wuttisittikulkij, L., Shah, S., Ali, S. M., & Alibakhshikenari, M. (2023). Speech emotion recognition using convolution neural networks and multi-head convolutional transformer. Sensors, 23, 6212. https://doi.org/10.3390/s23136212
Article Google Scholar
Ulyanov, D., & Lebedev, V. (2016). Audio texture synthesis and style transfer. http://tinyurl.com/y844x8qt
Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Eyben, F., & Schuller, B. (2022). Dawn of the transformer era in speech emotion recognition: Closing the valence gap. arXiv preprint arXiv:2203.07378,
Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., & Schuller, B. W. (2022) Dawn of the transformer era in speech emotion recognition, closing the valence gap. In arXiv preprint arXiv:2203.07378, .
Wong, S. C., Gatt,V. S. A., & McDonnell, M. D.(2016). Understanding data augmentation for classification: When to warp? In 2016 international conference on digital image computing: Techniques and applications (DICTA) (pp. 3586–3589).
Zhu, Z., Dai, W., Hu, Y., & Li, J. (2020). Speech emotion recognition model based on Bi-GRU and focal loss. Pattern Recognition Letters, 140, 358–365. https://doi.org/10.1016/j.patrec.2020.11.009
Article Google Scholar
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (ICCV) (pp. 2242–2251).

Download references

Acknowledgements

We express our sincere gratitude to the lab-mates for having helped us during the subjective evaluation.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors

Author information

Authors and Affiliations

Government Engineering College, Trivandrum, India
Rajeev Rajan
College of Engineering, Trivandrum, India
T. V. Hridya Raj
APJ Abdul Kalam Technological University, Thiruvananthapuram, India
Rajeev Rajan & T. V. Hridya Raj

Authors

Rajeev Rajan
View author publications
You can also search for this author in PubMed Google Scholar
T. V. Hridya Raj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajeev Rajan.

Ethics declarations

Conflict of interest

The authors declare that there is no competing interest related to this manuscript.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rajan, R., Hridya Raj, T.V. SENet-based speech emotion recognition using synthesis-style transfer data augmentation. Int J Speech Technol 26, 1017–1030 (2023). https://doi.org/10.1007/s10772-023-10071-8

Download citation

Received: 21 August 2023
Accepted: 11 November 2023
Published: 13 December 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10772-023-10071-8

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-featured Speech Emotion Recognition Using Extended Convolutional Neural Network

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Speech emotion recognition for human–computer interaction

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SENet-based speech emotion recognition using synthesis-style transfer data augmentation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multi-featured Speech Emotion Recognition Using Extended Convolutional Neural Network

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Speech emotion recognition for human–computer interaction

Explore related subjects

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation