Implementation of a TTS pipeline using Fastspeech2 model trained on a LJSpeech dataset.
See the results of both models at the end of this README.
The first model was trained with the following configuration: batch size=20, batch expand size=24, AdamW with warmup, max_lr=5e-4, len_epoch=3000, num_epochs=40, grad_norm_clip=2.
The second model uses pre layer norm in attention. It was trained with batch size=64, max_lr=1e-3. Initialization in MultiHead Attention was replaced on xavier uniform. All other configuration parameters are the same.
To use model with or without prelayer norm, add "attn_use_prelayer_norm": true/false
to model's config.
pip install -r ./requirements.txt
To reproduce training download necessary files, including LJSpeech, it's mel spectrograms, alignments for FastSpeech and Waveglow model weights, using a shell script.
sh scripts/download_data.sh
To get pitches and energies run scripts/setup.py, which saves them in the same data folder as other files.
python scripts/setup.py
Configs can be found in hw_tts/configs folder. In particular, for testing use config_test.json
.
One can redefine parameters which are set within config by passing them in terminal via flags.
python train.py -c CONFIG -r CHECKPOINT -k WANDB_KEY --wandb_run_name WANDB_RUN_NAME --n_gpu NUM_GPU --batch_size REAL_BATCH_SIZE --batch_expand_size MULTIPLIER_FOR_BATCH_SAMPLING --len_epoch ITERS_PER_EPOCH --waveglow_path WAVEGLOW_WEIGHTS_PATH --data_path PATH_TO_TRAIN_TEXTS --mel_ground_truth PATH_TO_GT_MELS --alignment_path PATH_TO_GT_ALIGNMENTS --pitch_path PATH_TO_GT_PITCHES --energy_path PATH_TO_GT_ENERGIES
To use model with or without prelayer norm, add "attn_use_prelayer_norm": true/false
to model's config.
python test.py -c hw_tts/configs/config_test.json -r CHECKPOINT -t test.txt -o output
test.txt
is a file with 3 sentences for evaluation, each on a newline.output
is a folder to save the result.
You can tune parameters for speeding up / slowing down, pitching up or down, changing energy of an audio. One can find variants of parameters, called params_list
, with which the audio will be generated also in config_test.json
. They are given as a list of triplets, where first one is related to duration (greater value means slowing down), second one to pitch (greater value means pitching up), and the third one to energy (greater means lower volume).
Current parameters list generates the following audios for texts given in test.txt
:
- regular generated audio
- audio with +20%/-20% for pitch/speed/energy
- audio with +20/-20% for pitch, speed and energy together
Generation of these 3 sentences. Filename corresponds to the order of a sentence.
A defibrillator is a device that gives a high energy electric shock to the heart of someone who is in cardiac arrest
Massachusetts Institute of Technology may be best known for its math, science and engineering education
Wasserstein distance or Kantorovich Rubinstein metric is a distance function defined between probability distributions on a given metric space
I considered to publish results of two models, that were described in a report, because their generation quality difference is quite subjective.