Abstract
Speech provides a natural way for human–computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 h from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in European Portuguese.
Data availability
Dataset is available on the github of the project: https://github.com/Edresson/TTS-Portuguese-Corpus.
Code availability
The trained models and demos are available on the github of the project: https://github.com/Edresson/TTS-Portuguese-Corpus.
Notes
References
Alencar, V., & Alcaim, A. (2008). LSF and lPC-derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In 2008 42nd Asilomar conference on signals, systems and computers (pp. 1237–1241). IEEE.
Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., Li, X., Miller, J., Raiman, J., & Sengupta, S., & Ng, A. (2017). Deep voice: Real-time neural text-to-speech. arXiv preprint. http://arxiv.org/abs/170207825
Arık, S. O., Diamos, G., Gibiansky, A., Miller, J., Peng, K., Ping, W., Raiman, J., & Zhou, Y. (2017). Deep voice 2: Multi-speaker neural text-to-speech. arXiv preprint. http://arxiv.org/abs/170508947
Aroon, A., & Dhonde, S. (2015). Statistical parametric speech synthesis: A review. In 2015 IEEE 9th international conference on intelligent systems and control (ISCO) (pp. 1–5). IEEE.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint. http://arxiv.org/abs/160706450
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint. http://arxiv.org/abs/14090473
Benesty, J., Chen, J., & Habets, E. A. (2011). Speech enhancement in the STFT domain. Springer Science & Business Media.
Braude, D. A., Shimodaira, H., & Youssef, A. B. (2013). Template-warping based speech driven head motion synthesis. In Interspeech (pp. 2763–2767).
Charpentier, F., & Stella, M. (1986). Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In ICASSP’86. IEEE international conference on acoustics, speech, and signal processing (Vol. 11, pp. 2015–2018). IEEE.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv preprint. http://arxiv.org/abs/14061078
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint. http://arxiv.org/abs/14123555
Dempsey, P. (2017). The teardown: Google home personal assistant. Engineering & Technology, 12(3), 80–81.
Durkan, C., Bekasov, A., Murray, I., & Papamakarios, G. (2019). Neural spline flows. In Advances in neural information processing systems (pp. 7511–7522)
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). MIT Press.
Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.
Gruber, T. R. (2009) Siri, a virtual personal assistant-bringing intelligence to the interface. In Semantic technologies conference.
Gölge, E. (2019). Deep learning for text to speech. https://github.com/mozilla/TTS
Hoogeboom, E., Van Den Berg, R., & Welling, M. (2019). Emerging convolutions for generative normalizing flows. arXiv preprint. http://arxiv.org/abs/190111137
Ito, K. (2017). The lj speech dataset. Retrieved April 29, 2020, from https://keithito.com/LJ-Speech-Dataset/
Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint. http://arxiv.org/abs/161010099
Kim, J., Kim, S., Kong, J., & Yoon, S. (2020). Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. arXiv preprint. http://arxiv.org/abs/200511129
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., & Welling, M. (2016). Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems (pp 4743–4751).
Klatt, D. H. (1980) Software for a cascade/parallel formant synthesizer. The Journal of the Acoustical Society of America, 67(3), 971–995.
Kumar, R., Kumar, K., Anand, V., Bengio, Y., & Courville, A. (2020) NU-GAN: High resolution neural upsampling with GAN. arXiv preprint. http://arxiv.org/abs/201011362
Mehri, S., Kumar, K., Gulrajani, I., Kumar, R., Jain, S., Sotelo, J., Courville, A., Bengio, Y., & Sample R. N. N. (2017). Char2wav: End-to-end speech synthesis. In International conference on learning representations, workshop.
Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., & Xiao, J. (2020). Flow-TTS: A non-autoregressive network for text to speech based on flow. In ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 7209–7213). IEEE.
Morise, M., Yokomori, F., & Ozawa, K. (2016). World: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7), 1877–1884.
Park, K. (2018). A tensorflow implementation of DC-TTS. https://github.com/Kyubyong/dc_tts
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., & Miller, J. (2017). Deep voice 3: 2000-speaker neural text-to-speech. arXiv preprint. http://arxiv.org/abs/171007654
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020). Mls: A large-scale multilingual dataset for speech research. In Proceedings of Interspeech 2020 (pp. 2757–2761).
Purington, A., Taft, J. G., Sannon, S., Bazarova, N. N., & Taylor, S. H (2017). “Alexa is my new BFF”: Social roles, user satisfaction, and personification of the Amazon echo. In Proceedings of the 2017 CHI conference extended abstracts on human factors in computing systems (pp. 2853–2859).
Quintanilha, I. M., Netto, S. L., & Biscainho, L. W. P. (2020). An open-source end-to-end ASR system for Brazilian Portuguese using DNNs built from newly assembled corpora. Journal of Communication and Information Systems, 35(1), 230–242.
Quintas, S., & Trancoso, I. (2020). Evaluation of deep learning approaches to text-to-speech systems for European Portuguese. In International conference on computational processing of the Portuguese language (pp. 34–42). Springer.
Ribeiro, F., Florêncio, D., Zhang, C., & Seltzer, M. (2011). Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2416–2419). IEEE.
Seara, I. (1994). Estudo estatístico dos fonemas do português brasileiro falado na capital de santa catarina para elaboração de frases foneticamente balanceadas. PhD thesis, Dissertação de Mestrado, Universidade Federal de Santa Catarina ...
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., & Saurous R. A. (2018). Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779–4783). IEEE.
Siddhi, D., Verghese, J. M., & Bhavik, D. (2017). Survey on various methods of text to speech synthesis. International Journal of Computer Applications, 165(6), 26–30.
Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., & Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In International conference on learning representations, workshop.
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks. In Advances in neural information processing systems (pp. 2377–2385).
Tachibana, H., Uenoyama, K., & Aihara, S. (2017). Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. arXiv preprint. http://arxiv.org/abs/171008969
Tamamori, A., Hayashi, T., Kobayashi, K., Takeda, K., & Toda, T. (2017). Speaker-dependent WaveNet vocoder. In Proceedings of Interspeech (pp. 1118–1122).
Teixeira, J. P., Freitas, D., Braga, D., Barros, M. J., & Latsch, V. (2001). Phonetic events from the labeling the European Portuguese database for speech synthesis, FEUP/IPBDB. In Seventh European conference on speech communication and technology.
Teixeira, J. P., Freitas, D., & Fujisaki, H. (2003). Prediction of Fujisaki model’s phrase commands. In Eighth European conference on speech communication and technology.
Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (2000). Speech parameter generation algorithms for HMM-based speech synthesis. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100) (Vol. 3, pp. 1315–1318). IEEE.
Valin, J. M. (2017). A hybrid DSP/deep learning approach to real-time full-band speech enhancement. arXiv preprint. http://arxiv.org/abs/170908243
Valle, R., Shih, K., Prenger, R., & Catanzaro, B. (2020). Flowtron: An autoregressive flow-based generative network for text-to-speech synthesis. arXiv preprint. http://arxiv.org/abs/200505957
Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint. http://arxiv.org/abs/160903499
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
Wang, W. Y., & Georgila, K. (2011). Automatic detection of unnatural word-level segments in unit-selection speech synthesis. In 2011 IEEE workshop on automatic speech recognition & understanding (pp. 289–294). IEEE.
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., & Bengio, S., & Le, Q. V. (2017). Tacotron: A fully end-to-end text-to-speech synthesis model. arXiv preprint. http://arxiv.org/abs/170310135
Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. arXiv preprint. http://arxiv.org/abs/151107122
Ze, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 7962–7966). IEEE.
Zhu, X., Beauregard, G. T., & Wyse, L. L. (2007). Real-time signal estimation from modified short-time fourier transform magnitude spectra. IEEE Transactions on Audio, Speech, and Language Processing, 15(5), 1645–1653.
Acknowledgements
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, as well as CNPq (National Council of Technological and Scientific Development) Grants 304266/2020-5. This research was carried out at the Center for Artificial Intelligence (C4AI-USP), with support by the São Paulo Research Foundation (FAPESP Grant #2019/07665-4) and by the IBM Corporation. Also, we would like to thank the partial financial support for this paper provided by CEIAFootnote 6 (Artificial Intelligence Excellence Center, in English) via a funded project by the Cyberlabs GroupFootnote 7. We also gratefully acknowledge the support of NVIDIA corporation with the donation of the GPU used in part of the experiments presented in this research.
Funding
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001, as well as CNPq (National Council of Technological and Scientific Development) Grant 304266/2020-5.
Author information
Authors and Affiliations
Contributions
EC, ACJ and SA proposed the methodology for collecting the dataset. EC, CS, ACJ, MAP and JPT proposed the experiments. EC and FSdO implemented and trained the models. All authors contributed to the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Casanova, E., Junior, A.C., Shulby, C. et al. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Lang Resources & Evaluation 56, 1043–1055 (2022). https://doi.org/10.1007/s10579-021-09570-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-021-09570-4