TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

719 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

Speech provides a natural way for human–computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 h from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in European Portuguese.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Data availability

Dataset is available on the github of the project: https://github.com/Edresson/TTS-Portuguese-Corpus.

Code availability

The trained models and demos are available on the github of the project: https://github.com/Edresson/TTS-Portuguese-Corpus.

Notes

References

Alencar, V., & Alcaim, A. (2008). LSF and lPC-derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In 2008 42nd Asilomar conference on signals, systems and computers (pp. 1237–1241). IEEE.
Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J., & Sengupta, S., & Ng, A. (2017). Deep voice: Real-time neural text-to-speech. arXiv preprint. http://arxiv.org/abs/170207825
Arık, S. O., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. arXiv preprint. http://arxiv.org/abs/170508947
Aroon, A., & Dhonde, S. (2015). Statistical parametric speech synthesis: A review. In 2015 IEEE 9th international conference on intelligent systems and control (ISCO) (pp. 1–5). IEEE.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint. http://arxiv.org/abs/160706450
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint. http://arxiv.org/abs/14090473
Benesty, J., Chen, J., & Habets, E. A. (2011). Speech enhancement in the STFT domain. Springer Science & Business Media.
Braude, D. A., Shimodaira, H., & Youssef, A. B. (2013). Template-warping based speech driven head motion synthesis. In Interspeech (pp. 2763–2767).
Charpentier, F., & Stella, M. (1986). Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In ICASSP’86. IEEE international conference on acoustics, speech, and signal processing (Vol. 11, pp. 2015–2018). IEEE.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint. http://arxiv.org/abs/14061078
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint. http://arxiv.org/abs/14123555
Dempsey, P. (2017). The teardown: Google home personal assistant. Engineering & Technology, 12(3), 80–81.
Article Google Scholar
Durkan, C., Bekasov, A., Murray, I., & Papamakarios, G. (2019). Neural spline flows. In Advances in neural information processing systems (pp. 7511–7522)
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.
Article Google Scholar
Gruber, T. R. (2009) Siri, a virtual personal assistant-bringing intelligence to the interface. In Semantic technologies conference.
Gölge, E. (2019). Deep learning for text to speech. https://github.com/mozilla/TTS
Hoogeboom, E., Van Den Berg, R., & Welling, M. (2019). Emerging convolutions for generative normalizing flows. arXiv preprint. http://arxiv.org/abs/190111137
Ito, K. (2017). The lj speech dataset. Retrieved April 29, 2020, from https://keithito.com/LJ-Speech-Dataset/
Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint. http://arxiv.org/abs/161010099
Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. arXiv preprint. http://arxiv.org/abs/200511129
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (pp 4743–4751).
Klatt, D. H. (1980) Software for a cascade/parallel formant synthesizer. The Journal of the Acoustical Society of America, 67(3), 971–995.
Kumar, R., Kumar, K., Anand, V., Bengio, Y., & Courville, A. (2020) NU-GAN: High resolution neural upsampling with GAN. arXiv preprint. http://arxiv.org/abs/201011362
Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., Bengio, Y., & Sample R. N. N. (2017). Char2wav: End-to-end speech synthesis. In International conference on learning representations, workshop.
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flow-TTS: A non-autoregressive network for text to speech based on flow. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7209–7213). IEEE.
Morise, M., Yokomori, F., & Ozawa, K. (2016). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877–1884.
Article Google Scholar
Park, K. (2018). A tensorflow implementation of DC-TTS. https://github.com/Kyubyong/dc_tts
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. arXiv preprint. http://arxiv.org/abs/171007654
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020). Mls: A large-scale multilingual dataset for speech research. In Proceedings of Interspeech 2020 (pp. 2757–2761).
Purington, A., Taft, J. G., Sannon, S., Bazarova, N. N., & Taylor, S. H (2017). “Alexa is my new BFF”: Social roles, user satisfaction, and personification of the Amazon echo. In Proceedings of the 2017 CHI conference extended abstracts on human factors in computing systems (pp. 2853–2859).
Quintanilha, I. M., Netto, S. L., & Biscainho, L. W. P. (2020). An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora. Journal of Communication and Information Systems, 35(1), 230–242.
Article Google Scholar
Quintas, S., & Trancoso, I. (2020). Evaluation of deep learning approaches to text-to-speech systems for European Portuguese. In International conference on computational processing of the Portuguese language (pp. 34–42). Springer.
Ribeiro, F., Florêncio, D., Zhang, C., & Seltzer, M. (2011). Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2416–2419). IEEE.
Seara, I. (1994). Estudo estatístico dos fonemas do português brasileiro falado na capital de santa catarina para elaboração de frases foneticamente balanceadas. PhD thesis, Dissertação de Mestrado, Universidade Federal de Santa Catarina ...
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., & Saurous R. A. (2018). Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779–4783). IEEE.
Siddhi, D., Verghese, J. M., & Bhavik, D. (2017). Survey on various methods of text to speech synthesis. International Journal of Computer Applications, 165(6), 26–30.
Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In International conference on learning representations, workshop.
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. In Advances in neural information processing systems (pp. 2377–2385).
Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint. http://arxiv.org/abs/171008969
Tamamori, A., Hayashi, T., Kobayashi, K., Takeda, K., & Toda, T. (2017). Speaker-dependent WaveNet vocoder. In Proceedings of Interspeech (pp. 1118–1122).
Teixeira, J. P., Freitas, D., Braga, D., Barros, M. J., & Latsch, V. (2001). Phonetic events from the labeling the European Portuguese database for speech synthesis, FEUP/IPBDB. In Seventh European conference on speech communication and technology.
Teixeira, J. P., Freitas, D., & Fujisaki, H. (2003). Prediction of Fujisaki model’s phrase commands. In Eighth European conference on speech communication and technology.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1315–1318). IEEE.
Valin, J. M. (2017). A hybrid DSP/deep learning approach to real-time full-band speech enhancement. arXiv preprint. http://arxiv.org/abs/170908243
Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. arXiv preprint. http://arxiv.org/abs/200505957
Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint. http://arxiv.org/abs/160903499
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
Wang, W. Y., & Georgila, K. (2011). Automatic detection of unnatural word-level segments in unit-selection speech synthesis. In 2011 IEEE workshop on automatic speech recognition & understanding (pp. 289–294). IEEE.
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., & Bengio, S., & Le, Q. V. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model. arXiv preprint. http://arxiv.org/abs/170310135
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint. http://arxiv.org/abs/151107122
Ze, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7962–7966). IEEE.
Zhu, X., Beauregard, G. T., & Wyse, L. L. (2007). Real-time signal estimation from modified short-time fourier transform magnitude spectra. IEEE Transactions on Audio, Speech, and Language Processing, 15(5), 1645–1653.
Article Google Scholar

Download references

Acknowledgements

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, as well as CNPq (National Council of Technological and Scientific Development) Grants 304266/2020-5. This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP Grant #2019/07665-4) and by the IBM Corporation. Also, we would like to thank the partial financial support for this paper provided by CEIA^{Footnote 6} (Artificial Intelligence Excellence Center, in English) via a funded project by the Cyberlabs Group^{Footnote 7}. We also gratefully acknowledge the support of NVIDIA corporation with the donation of the GPU used in part of the experiments presented in this research.

Funding

Author information

Authors and Affiliations

Instituto de Ciências Matemáticas e de Computação, University of São Paulo, São Carlos , Brazil
Edresson Casanova, Moacir Antonelli Ponti & Sandra Aluísio
Federal University of Technology – Paraná (UTFPR), Medianeira, Brazil
Arnaldo Candido Junior
DefinedCrowd Corp., Seattle, USA
Christopher Shulby
Federal University of Mato Grosso, Cuiabá, Brazil
Frederico Santos de Oliveira
Research Center in Digitalization and Intelligent Robotics (CEDRI) - Instituto Politecnico de Braganca, Bragança, Portugal
João Paulo Teixeira

Authors

Edresson Casanova
View author publications
You can also search for this author in PubMed Google Scholar
Arnaldo Candido Junior
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Shulby
View author publications
You can also search for this author in PubMed Google Scholar
Frederico Santos de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
João Paulo Teixeira
View author publications
You can also search for this author in PubMed Google Scholar
Moacir Antonelli Ponti
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Aluísio
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

EC, ACJ and SA proposed the methodology for collecting the dataset. EC, CS, ACJ, MAP and JPT proposed the experiments. EC and FSdO implemented and trained the models. All authors contributed to the final manuscript.

Corresponding author

Correspondence to Edresson Casanova.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Casanova, E., Junior, A.C., Shulby, C. et al. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Lang Resources & Evaluation 56, 1043–1055 (2022). https://doi.org/10.1007/s10579-021-09570-4

Download citation

Accepted: 23 November 2021
Published: 12 January 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s10579-021-09570-4

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Abstract

Access this article

Subscribe and save

Buy Now

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Abstract

Access this article

Subscribe and save

Buy Now

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation