Char2Wav: End-to-End Speech Synthesis

Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, Yoshua Bengio

27 Dec 2024 (modified: 16 Apr 2017)ICLR 2017Readers: Everyone

Abstract: We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoder-decoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.

TL;DR: Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.

Keywords: Speech, Deep learning, Applications

Conflicts: umontreal.ca, inrs.ca, iitk.ac.in

5 Replies