Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3.
As the first open-source TTS model that tried to combine flow-matching and DiT, StableTTS is a fast and lightweight TTS model for chinese and english speech generation. It has only 10M parameters.
Work is in progress now. Pretrained models and detailed instructions will be released soon!
For detailed inference instructions, please refer to inference.ipynb
Setting up and training your model with StableTTS is straightforward. Follow these steps to get started:
-
Generate Text and Audio pairs: Generate the text and audio pair filelist as
./filelists/example.txt
. Some recipes of open-source datasets could be found in./recipes
. (Since we use reference encoder to capture speaker identity, there is no need for a speaker ID in multis 8385 peaker synthesis and training.) -
Run Preprocessing: Adjust the
DataConfig
inpreprocess.py
to set your input and output paths, then run the script. This will process the audio and text according to your list, outputting a JSON file with paths to resampled audios, mel features, and phonemes. Note: Ensure to switchchinese=False
inDataConfig
for English text processing.
-
Adjust Training Configuration: In
config.py
, modifyTrainConfig
to set your file list path and adjust training parameters as needed. -
Start the Training Process: Launch
train.py
to start training your model.
Feel free to explore and modify settings in config.py
to modify the hyperparameters!
Model Name | Task Details | Download Link |
---|---|---|
StableTTS | text to mel | Model is currently in training... |
Vocos | mel to wav | 🤗 |
-
We use the Diffusion Convolution Transformer block from Hierspeech++, which is a combination of original DiT and FFT(Feed forward Transformer from fastspeech) for better prosody.
-
In flow-matching decoder, we add a FiLM layer before DiT block to condition timestep embedding into model.
The development of our models heavily relies on insights and code from various projects. We express our heartfelt thanks to the creators of the following:
Matcha TTS: Essential flow-matching code.
Grad TTS: Diffusion model structure.
Stable Diffusion 3: Idea of combining flow-matching and DiT.
Vits: Code style and MAS insights, DistributedBucketSampler.
plowtts-pytorch: codes of MAS in training
Bert-VITS2 : numba version of MAS and modern pytorch codes of Vits
fish-speech: dataclass usage and mel-spectrogram transforms using torchaudio
gpt-sovits: melstyle encoder for voice clone
diffsinger: chinese three section phoneme scheme for chinese g2p
- Release pretrained models.
- Provide finetuning instructions.
- Support Japanese language.
- User friendly preprocess and inference script.
- Enhance documentation and citations.
- Add chinese version of readme.
Any organization or individual is prohibited from using any technology in this repo to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.