Multi speaker text-to-speech synthesis using generalized end-to-end loss function

Owais Nazir¹,
Aruna Malik ORCID: orcid.org/0000-0003-1136-6828¹,
Samayveer Singh¹ &
…
Al-Sakib Khan Pathan²

375 Accesses
3 Citations
Explore all metrics

Abstract

Multi-speaker text-to-speech synthesis involves generating unique speech patterns for individual speakers based on reference waveforms and input sequences of graphemes or phonemes. Various deep neural networks are trained for this task using a large amount of speech data recorded from a specific speaker to generate audio in their voice. The model requires a large dataset to retrain itself and learn about a new speaker not seen during training. This process is expensive in terms of time and resources. Thus, a key requirement of such techniques is to reduce time and resource consumption. In this paper, a multi-speaker text-to-speech synthesis using a generalized end-to-end loss function is developed, capable of generating speech in real-time for a given speech reference from a user and a text string as input. This method considers the speaker’s characteristics in the generated speech using the speech reference of their voice. The proposed method also assesses the effect on spontaneity and fluency in the generated language, corresponding to the speaker encoder, using the mean opinion score (MOS). However, a speaker encoder is trained with varying hours of the audio dataset, and it observes the effect on the produced speech. Furthermore, an extensive analysis is performed on the impact of the training dataset on the speaker encoder, corresponding to the generated speech, and various speaker encoder models for the speaker verification task. Based on loss function and Equal Error Rate (EER), advanced GRU is selected for generalized end-to-end loss function. The speaker verification regression test represents that the projected prototype can generate language, which the regression algorithm is able to distinguish into two sets: male and female while second test shows the above technique is able to distinguish speaker embeddings separately in clusters showing each speaker is uniquely identified. In terms of results, our proposed model achieved a MOS of 4.02 when trained on ‘Train Clean 100’, 3.74 on ‘Train-clean-360’, and 3.25 on ‘Train-clean-500’. The MOS test juxtaposes our method with prior models, demonstrating its superior performance. Conclusively, a cross-similarity matrix offers a visual representation of the similarity and disparity between utterances, underscoring the model’s robustness and efficacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Adapting Single-Speaker Models for Multi-Speaker Environments

DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech

Article 01 February 2025

A Novel and Intelligent Approach for Indian Locale Based Text-to-Speech Model by Hybridizing Wave Net and Wave Glow with Mel-Spectrogram Analysis

Data availability

If you are interested in obtaining the data, please contact Owais Nazir at owaisnazir22@gmail.com.

References

Zen H, Nose T, Yamagishi J, Sako S, Masuko T, Black AW, Tokuda K (2007) The HMM-based speech synthesis system (HTS) version 2.0. SSW 6:294–299
Google Scholar
Van den Oord A, Kalchbrenner N, Espeholt L, Vinyals O, Graves A (2016) Conditional image generation with pixelcnn decoders. Adv Neural Inf Process Syst 29:1–9. ArXiv, abs/1606.05328
Van Den Oord A, Kalchbrenner N, Kavukcuoglu K (2016) Pixel recurrent neural networks. In: International Conference on Machine Learning, MLR, pp. 1747–1756
Wang Y, Skerry-Ryan RJ, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q, Agiomyrgiannakis Y, Clark Y, Saurous RA, Saurous RA (2017) Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135
Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoust Speech Signal Process 32(2):236–243
Article Google Scholar
Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerry-Ryan RJ, Saurous RA, Agiomyrgiannakis Y, Wu Y (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE
Gibiansky A, Arik S, Diamos G, Miller J, Peng K, Ping W, ..., Zhou Y (2017) Deep voice 2: Multi-speaker neural text-to-speech. Adv Neural Inf Process Syst 30
Ping W, Peng K, Gibiansky A, Arik SÖ, Kannan A, Narang S, Raiman J, Miller J (2017) Deep Voice 3: 2000-Speaker Neural Text-to-Speech. https://arxiv.org/abs/1710.07654
Arik SO, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng A, Raiman J, Sengupta S, Shoeybi M (2017) Deep voice: Real-time neural text-to-speech. International Conference on Machine Learning, PMLR
Panayotov V et al (2015) Librispeech: an asr corpus based on public domain audio books. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE
Taigman Y, Wolf L, Polyak A, Nachmani E (2017) Voiceloop: voice fitting and synthesis via a phonological loop. arXiv preprint arXiv:1707.06588
Nachmani E et al (2018) Fitting new speakers based on a short untranscribed sample. International Conference on Machine Learning, PMLR
Jia Y, Zhang Y, Weiss RJ, Wang Q, Shen J, Ren F, Chen Z, Nguyen P, Pang R, Moreno IL, Wu Y (2018) Transfer learning from speaker verification to multispeaker text-to-speech synthesis. arXiv Preprint arXiv :180604558
Nazir O, Malik A (2021) Deep learning end to end speech synthesis: a review. 2021 2nd International Conference on Secure Cyber Computing and Communications (ICSCCC), pp 66–71. https://doi.org/10.1109/ICSCCC51823.2021.9478125
Oord AVD, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: a generative model for raw audio. arXiv Preprint arXiv :160903499
Kalchbrenner N, Oord A, Simonyan K, Danihelka I, Vinyals O, Graves A, Kavukcuoglu K (2017) Video pixel networks. In: International Conference on Machine Learning (pp. 1771–1779). PMLR
Oord A, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, Driessche G, Lockhart E, Cobo L, Stimberg F, Casagrande N (2018) Parallel wavenet: Fast high-fidelity speech synthesis. In: International Conference on Machine Learning, PMLR pp. 3918–3926
Zhang YJ, Pan S, He L, Ling ZH (2019) Learning latent representations for style control and transfer in end-to-end speech synthesis. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6945–6949
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv Preprint arXiv :13126114
Vainer J, Dušek O (2020) Speedyspeech: efficient neural speech synthesis. arXiv preprint arXiv:2008.03802
Wan L, Wang Q, Papir A, Moreno IL (2018) Generalized end-to-end loss for speaker verification. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE
Kuznetsova A, Sivaraman A, Kim M (2022) The potential of neural Speech synthesis-based Data Augmentation for Personalized Speech Enhancement. arXiv preprint arXiv:2211.07493
Resna S, Rajan R (2022) Multi-voice singing synthesis from lyrics. Circuits Syst Signal Process 1–15
Heigold G et al (2016) End-to-end text-dependent speaker verification. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE
Variani E, Lei X, McDermott E, Moreno IL, Dominguez JG (2014) Deep neural networks for small footprint text-dependent speaker verification. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE,
Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ (2021) 1D convolutional neural networks and applications: a survey. Mech Syst Signal Process 151:107398
Article Google Scholar
Chung J, Gulcehre C, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv Preprint arXiv :14123555
Nagrani A, Chung JS, Zisserman A (2017) Voxceleb: a large-scale speaker identification dataset. arXiv Preprint arXiv :170608612
Chung JS, Nagrani A, Zisserman A (2018) Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv Preprint arXiv :14126980
McInnes L, Healy J, Melville J (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426
Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, Le Q, Agiomyrgiannakis Y, Clark R, Saurous RA (2017) Tacotron Towards end-to-end speech synthesis. In: Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20–24, 2017, ISCA, pp 4006–4010. https://doi.org/10.21437/interspeech.2017-1452
Shen J, Pang R, Weiss R, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerry-Ryan R, Saurous R, Agiomyrgiannakis Y, Wu Y (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4779– 4783. https://doi.org/10.1109/icassp.2018.8461368
Chen L, Ren J, Chen P et al (2022) Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2. Appl Intell 52:15193–15209. https://doi.org/10.1007/s10489-021-03075-x
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Dr B R Ambedkar National Institute of Technology Jalandhar, Jalandhar, Punjab, India
Owais Nazir, Aruna Malik & Samayveer Singh
Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
Al-Sakib Khan Pathan

Authors

Owais Nazir
View author publications
You can also search for this author in PubMed Google Scholar
Aruna Malik
View author publications
You can also search for this author in PubMed Google Scholar
Samayveer Singh
View author publications
You can also search for this author in PubMed Google Scholar
Al-Sakib Khan Pathan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aruna Malik.

Ethics declarations

Conflict of interest

The authors affirm that they do not have any conflicts of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nazir, O., Malik, A., Singh, S. et al. Multi speaker text-to-speech synthesis using generalized end-to-end loss function. Multimed Tools Appl 83, 64205–64222 (2024). https://doi.org/10.1007/s11042-024-18121-2

Download citation

Received: 05 July 2023
Revised: 01 November 2023
Accepted: 01 January 2024
Published: 13 January 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11042-024-18121-2

Multi speaker text-to-speech synthesis using generalized end-to-end loss function

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adapting Single-Speaker Models for Multi-Speaker Environments

DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech

A Novel and Intelligent Approach for Indian Locale Based Text-to-Speech Model by Hybridizing Wave Net and Wave Glow with Mel-Spectrogram Analysis

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Multi speaker text-to-speech synthesis using generalized end-to-end loss function

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Adapting Single-Speaker Models for Multi-Speaker Environments

DeepMine-multi-TTS: a Persian speech corpus for multi-speaker text-to-speech

A Novel and Intelligent Approach for Indian Locale Based Text-to-Speech Model by Hybridizing Wave Net and Wave Glow with Mel-Spectrogram Analysis

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now