[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ICASSP.2018.8462393guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article

ON the Use of Wavenet as a Statistical Vocoder

Published: 15 April 2018 Publication History

Abstract

In this paper, we explore the possibility of using the WaveNet architecture as a statistical vocoder. In that case, the generation of speech waveforms is locally conditioned only by acoustic features. Focusing on the single speaker case at the moment, we investigate the impact of the local conditions as well as that of the amount of data available for training. Furthermore, variations of the WaveNet architecture are considered and discussed in the context of our work. We compare our work against a very recent work which also used WaveNet architecture as a speech vocoder using the same speech data. More specifically, we used two female and two male speakers from the CMU-ARCTIC database to contrast the use of cepstrum coefficients and filter-bank features as local conditioners with the goal to improve the overall quality for both male and female speakers. In the paper we also discuss the impact of the size of the training data. Objective metrics for quality and intelligibility of the generated by the WaveNet speech as well as subjective tests support our suggestions.

6. References

[1]
J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, no. 4, pp. 561–580, April, 1975.
[2]
Robert J McAulay and Thomas F Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust. Speech Signal Process., vol. 34, no. 4, pp. 744–754, 1986.
[3]
Daniel Griffin and Jae Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. Acoust. Speech Signal Process., vol. 32, no. 2, pp. 236–243, 1984.
[4]
Yannis Stylianou, “Applying the harmonic plus noise model in concatenative speech synthesis,” IEEE Trans. Speech Audio Process., vol. 9, no. 1, pp. 21–29, 2001.
[5]
Hideki Kawahara, Ikuyo Masuda-Katsuse, and Alain de Cheveign, “Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based F0 extraction,” Speech Commun., vol. 27, pp. 187–207, 1999.
[6]
Nagaraj Adiga and S R M Prasanna, “Source modeling for HMM based speech synthesis using integrated LP residual,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2016.
[7]
Yannis Stylianou, Olivier Cappe, and Eric Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. speech audio process., vol. 6, no. 2, pp. 131–142, 1998.
[8]
Tomoki Toda, Alan W Black, and Keiichi Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222–2235, 2007.
[9]
A J Hunt and A W Black, “Unit selection in a con-catenative speech synthesis system using a large speech database,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., vol. 1, pp. 373–376, 1996.
[10]
K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speech synthesis based on hidden Markov models,” Proc. IEEE, vol. 101–5, pp. 1234–1252, 2013.
[11]
Z. H. Ling, S. Y. Kang, H. Zen, A. Senior, M. Schuster, X. J. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Process. Magazine, vol. 32, no. 3, pp. 35–52, May 2015.
[12]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yan-nis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous, “Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model,” arXiv preprint arXiv:, 2017.
[13]
A. van denoord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” arXiv preprint arXiv:, 2016.
[14]
Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, and Tomoki Toda, “Speaker-Dependent Wavenet Vocoder,” in Proc. Interspeech, 2017, pp. 1118–1122.
[15]
Kazuhiro Kobayashi, Tomoki Hayashi, Akira Tamamori, and Tomoki Toda, “Statistical voice conversion with Wavenet-Based waveform generation,” in Proc. Interspeech, 2017, pp. 1138–1142.
[16]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, Dinei Florencio, and Mark Hasegawa-Johnson, “Speech enhancement using bayesian WaveNet,” in Proc. Inter-speech, 2017, pp. 2013–2017.
[17]
Dario Rethage, Jordi Pons, and Xavier Serra, “A WaveNet for speech denoising,” arXiv preprint arXiv:, 2017.
[18]
John Kominek and Alan W Black, “The CMU ARCTIC speech databases,” in Proc. 5th ISCA Speech Synthesis Workshop, 2004, pp. 223–224.
[19]
Sercan Arik, Gregory Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou, “Deep voice 2: Multi-speaker neural text-to-speech,” arXiv preprint arXiv:, 2017.
[20]
Zhizheng Wu, Oliver Watts, and Simon King, “Merlin: An open source neural network speech synthesis system,” in Proc. SSW, Sunnyvale, USA, 2016.
[21]
Cees H Taal, Richard C endriks, Richard Heusdens, and Jesper Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio Speech Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011.
[22]
Milos Cernak and Milan Rusko, “An evaluation of synthetic speech using the PESQ measure,” in Proc. European Congress on Acoust., 2005, pp. 2725–2728.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Apr 2018
17916 pages

Publisher

IEEE Press

Publication History

Published: 15 April 2018

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media