[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Speech provides a natural way for human–computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 h from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in European Portuguese.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Data availability

Dataset is available on the github of the project: https://github.com/Edresson/TTS-Portuguese-Corpus.

Code availability

The trained models and demos are available on the github of the project: https://github.com/Edresson/TTS-Portuguese-Corpus.

Notes

  1. https://github.com/gunthercox/chatterbot-corpus/.

  2. Official repository: https://github.com/Edresson/TTS-Portuguese-Corpus.

  3. https://creativecommons.org/licenses/by/4.0/.

  4. https://github.com/bootphon/phonemizer.

  5. https://edresson.github.io/TTS-Portuguese-Corpus/.

  6. http://centrodeia.org/.

  7. https://cyberlabs.ai/.

References

  • Alencar, V., & Alcaim, A. (2008). LSF and lPC-derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In 2008 42nd Asilomar conference on signals, systems and computers (pp. 1237–1241). IEEE.

  • Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J., & Sengupta, S., & Ng, A. (2017). Deep voice: Real-time neural text-to-speech. arXiv preprint. http://arxiv.org/abs/170207825

  • Arık, S. O., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. arXiv preprint. http://arxiv.org/abs/170508947

  • Aroon, A., & Dhonde, S. (2015). Statistical parametric speech synthesis: A review. In 2015 IEEE 9th international conference on intelligent systems and control (ISCO) (pp. 1–5). IEEE.

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint. http://arxiv.org/abs/160706450

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint. http://arxiv.org/abs/14090473

  • Benesty, J., Chen, J., & Habets, E. A. (2011). Speech enhancement in the STFT domain. Springer Science & Business Media.

  • Braude, D. A., Shimodaira, H., & Youssef, A. B. (2013). Template-warping based speech driven head motion synthesis. In Interspeech (pp. 2763–2767).

  • Charpentier, F., & Stella, M. (1986). Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In ICASSP’86. IEEE international conference on acoustics, speech, and signal processing (Vol. 11, pp. 2015–2018). IEEE.

  • Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint. http://arxiv.org/abs/14061078

  • Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint. http://arxiv.org/abs/14123555

  • Dempsey, P. (2017). The teardown: Google home personal assistant. Engineering & Technology, 12(3), 80–81.

    Article  Google Scholar 

  • Durkan, C., Bekasov, A., Murray, I., & Papamakarios, G. (2019). Neural spline flows. In Advances in neural information processing systems (pp. 7511–7522)

  • Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.

  • Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.

    Article  Google Scholar 

  • Gruber, T. R. (2009) Siri, a virtual personal assistant-bringing intelligence to the interface. In Semantic technologies conference.

  • Gölge, E. (2019). Deep learning for text to speech. https://github.com/mozilla/TTS

  • Hoogeboom, E., Van Den Berg, R., & Welling, M. (2019). Emerging convolutions for generative normalizing flows. arXiv preprint. http://arxiv.org/abs/190111137

  • Ito, K. (2017). The lj speech dataset. Retrieved April 29, 2020, from https://keithito.com/LJ-Speech-Dataset/

  • Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint. http://arxiv.org/abs/161010099

  • Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. arXiv preprint. http://arxiv.org/abs/200511129

  • Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (pp 4743–4751).

  • Klatt, D. H. (1980) Software for a cascade/parallel formant synthesizer. The Journal of the Acoustical Society of America, 67(3), 971–995.

  • Kumar, R., Kumar, K., Anand, V., Bengio, Y., & Courville, A. (2020) NU-GAN: High resolution neural upsampling with GAN. arXiv preprint. http://arxiv.org/abs/201011362

  • Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., Bengio, Y., & Sample R. N. N. (2017). Char2wav: End-to-end speech synthesis. In International conference on learning representations, workshop.

  • Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flow-TTS: A non-autoregressive network for text to speech based on flow. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7209–7213). IEEE.

  • Morise, M., Yokomori, F., & Ozawa, K. (2016). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877–1884.

    Article  Google Scholar 

  • Park, K. (2018). A tensorflow implementation of DC-TTS. https://github.com/Kyubyong/dc_tts

  • Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. arXiv preprint. http://arxiv.org/abs/171007654

  • Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020). Mls: A large-scale multilingual dataset for speech research. In Proceedings of Interspeech 2020 (pp. 2757–2761).

  • Purington, A., Taft, J. G., Sannon, S., Bazarova, N. N., & Taylor, S. H (2017). “Alexa is my new BFF”: Social roles, user satisfaction, and personification of the Amazon echo. In Proceedings of the 2017 CHI conference extended abstracts on human factors in computing systems (pp. 2853–2859).

  • Quintanilha, I. M., Netto, S. L., & Biscainho, L. W. P. (2020). An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora. Journal of Communication and Information Systems, 35(1), 230–242.

    Article  Google Scholar 

  • Quintas, S., & Trancoso, I. (2020). Evaluation of deep learning approaches to text-to-speech systems for European Portuguese. In International conference on computational processing of the Portuguese language (pp. 34–42). Springer.

  • Ribeiro, F., Florêncio, D., Zhang, C., & Seltzer, M. (2011). Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2416–2419). IEEE.

  • Seara, I. (1994). Estudo estatístico dos fonemas do português brasileiro falado na capital de santa catarina para elaboração de frases foneticamente balanceadas. PhD thesis, Dissertação de Mestrado, Universidade Federal de Santa Catarina ...

  • Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., & Saurous R. A. (2018). Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779–4783). IEEE.

  • Siddhi, D., Verghese, J. M., & Bhavik, D. (2017). Survey on various methods of text to speech synthesis. International Journal of Computer Applications, 165(6), 26–30.

  • Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In International conference on learning representations, workshop.

  • Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. In Advances in neural information processing systems (pp. 2377–2385).

  • Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint. http://arxiv.org/abs/171008969

  • Tamamori, A., Hayashi, T., Kobayashi, K., Takeda, K., & Toda, T. (2017). Speaker-dependent WaveNet vocoder. In Proceedings of Interspeech (pp. 1118–1122).

  • Teixeira, J. P., Freitas, D., Braga, D., Barros, M. J., & Latsch, V. (2001). Phonetic events from the labeling the European Portuguese database for speech synthesis, FEUP/IPBDB. In Seventh European conference on speech communication and technology.

  • Teixeira, J. P., Freitas, D., & Fujisaki, H. (2003). Prediction of Fujisaki model’s phrase commands. In Eighth European conference on speech communication and technology.

  • Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1315–1318). IEEE.

  • Valin, J. M. (2017). A hybrid DSP/deep learning approach to real-time full-band speech enhancement. arXiv preprint. http://arxiv.org/abs/170908243

  • Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. arXiv preprint. http://arxiv.org/abs/200505957

  • Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint. http://arxiv.org/abs/160903499

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

  • Wang, W. Y., & Georgila, K. (2011). Automatic detection of unnatural word-level segments in unit-selection speech synthesis. In 2011 IEEE workshop on automatic speech recognition & understanding (pp. 289–294). IEEE.

  • Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., & Bengio, S., & Le, Q. V. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model. arXiv preprint. http://arxiv.org/abs/170310135

  • Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint. http://arxiv.org/abs/151107122

  • Ze, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7962–7966). IEEE.

  • Zhu, X., Beauregard, G. T., & Wyse, L. L. (2007). Real-time signal estimation from modified short-time fourier transform magnitude spectra. IEEE Transactions on Audio, Speech, and Language Processing, 15(5), 1645–1653.

    Article  Google Scholar 

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, as well as CNPq (National Council of Technological and Scientific Development) Grants 304266/2020-5. This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP Grant #2019/07665-4) and by the IBM Corporation. Also, we would like to thank the partial financial support for this paper provided by CEIAFootnote 6 (Artificial Intelligence Excellence Center, in English) via a funded project by the Cyberlabs GroupFootnote 7. We also gratefully acknowledge the support of NVIDIA corporation with the donation of the GPU used in part of the experiments presented in this research.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, as well as CNPq (National Council of Technological and Scientific Development) Grant 304266/2020-5.

Author information

Authors and Affiliations

Authors

Contributions

EC, ACJ and SA proposed the methodology for collecting the dataset. EC, CS, ACJ, MAP and JPT proposed the experiments. EC and FSdO implemented and trained the models. All authors contributed to the final manuscript.

Corresponding author

Correspondence to Edresson Casanova.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Casanova, E., Junior, A.C., Shulby, C. et al. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Lang Resources & Evaluation 56, 1043–1055 (2022). https://doi.org/10.1007/s10579-021-09570-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-021-09570-4

Keywords

Navigation