8000 GitHub - kkorolev1/tts_dla: Fastspeech2 implementation for TTS task
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

kkorolev1/tts_dla

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTS HW 3

Implementation of a TTS pipeline using Fastspeech2 model trained on a LJSpeech dataset.

WanDB Report

See the results of both models at the end of this README.

Checkpoints

The first model was trained with the following configuration: batch size=20, batch expand size=24, AdamW with warmup, max_lr=5e-4, len_epoch=3000, num_epochs=40, grad_norm_clip=2.

The second model uses pre layer norm in attention. It was trained with batch size=64, max_lr=1e-3. Initialization in MultiHead Attention was replaced on xavier uniform. All other configuration parameters are the same. To use model with or without prelayer norm, add "attn_use_prelayer_norm": true/false to model's config.

Installation guide

pip install -r ./requirements.txt

To reproduce training download necessary files, including LJSpeech, it's mel spectrograms, alignments for FastSpeech and Waveglow model weights, using a shell script.

sh scripts/download_data.sh

To get pitches and energies run scripts/setup.py, which saves them in the same data folder as other files.

python scripts/setup.py

Configs can be found in hw_tts/configs folder. In particular, for testing use config_test.json.

Training

One can redefine parameters which are set within config by passing them in terminal via flags.

python train.py -c CONFIG -r CHECKPOINT -k WANDB_KEY --wandb_run_name WANDB_RUN_NAME --n_gpu NUM_GPU --batch_size REAL_BATCH_SIZE --batch_expand_size MULTIPLIER_FOR_BATCH_SAMPLING --len_epoch ITERS_PER_EPOCH --waveglow_path WAVEGLOW_WEIGHTS_PATH --data_path PATH_TO_TRAIN_TEXTS --mel_ground_truth PATH_TO_GT_MELS --alignment_path PATH_TO_GT_ALIGNMENTS --pitch_path PATH_TO_GT_PITCHES --energy_path PATH_TO_GT_ENERGIES

Testing

To use model with or without prelayer norm, add "attn_use_prelayer_norm": true/false to model's config.

python test.py -c hw_tts/configs/config_test.json -r CHECKPOINT -t test.txt -o output
  • test.txt is a file with 3 sentences for evaluation, each on a newline.
  • output is a folder to save the result.

You can tune parameters for speeding up / slowing down, pitching up or down, changing energy of an audio. One can find variants of parameters, called params_list, with which the audio will be generated also in config_test.json. They are given as a list of triplets, where first one is related to duration (greater value means slowing down), second one to pitch (greater value means pitching up), and the third one to energy (greater means lower volume).

Current parameters list generates the following audios for texts given in test.txt:

  • regular generated audio
  • audio with +20%/-20% for pitch/speed/energy
  • audio with +20/-20% for pitch, speed and energy together

Results

Generation of these 3 sentences. Filename corresponds to the order of a sentence.

A defibrillator is a device that gives a high energy electric shock to the heart of someone who is in cardiac arrest

Massachusetts Institute of Technology may be best known for its math, science and engineering education

Wasserstein distance or Kantorovich Rubinstein metric is a distance function defined between probability distributions on a given metric space

I considered to publish results of two models, that were described in a report, because their generation quality difference is quite subjective.

First model

3_speed.1_pitch.1_energy.1.mp4
3_speed.1_pitch.1_energy.1.2.mp4
3_speed.1_pitch.1_energy.0.8.mp4
3_speed.1_pitch.1.2_energy.1.mp4
3_speed.1_pitch.0.8_energy.1.mp4
3_speed.1.2_pitch.1_energy.1.mp4
3_speed.1.2_pitch.1.2_energy.1.2.mp4
3_speed.0.8_pitch.1_energy.1.mp4
3_speed.0.8_pitch.0.8_energy.0.8.mp4
2_speed.1_pitch.1_energy.1.mp4
2_speed.1_pitch.1_energy.1.2.mp4
2_speed.1_pitch.1_energy.0.8.mp4
2_speed.1_pitch.1.2_energy.1.mp4
2_speed.1_pitch.0.8_energy.1.mp4
2_speed.1.2_pitch.1_energy.1.mp4
2_speed.1.2_pitch.1.2_energy.1.2.mp4
2_speed.0.8_pitch.1_energy.1.mp4
2_speed.0.8_pitch.0.8_energy.0.8.mp4
1_speed.1_pitch.1_energy.1.mp4
1_speed.1_pitch.1_energy.1.2.mp4
1_speed.1_pitch.1_energy.0.8.mp4
1_speed.1_pitch.1.2_energy.1.mp4
1_speed.1_pitch.0.8_energy.1.mp4
1_speed.1.2_pitch.1_energy.1.mp4
1_speed.1.2_pitch.1.2_energy.1.2.mp4
1_speed.0.8_pitch.1_energy.1.mp4
1_speed.0.8_pitch.0.8_energy.0.8.mp4

Second model

3_speed.1_pitch.1_energy.1.2.mp4
3_speed.1_pitch.1_energy.0.8.mp4
3_speed.1_pitch.1.2_energy.1.mp4
3_speed.1_pitch.0.8_energy.1.mp4
3_speed.1.2_pitch.1_energy.1.mp4
3_speed.1.2_pitch.1.2_energy.1.2.mp4
3_speed.0.8_pitch.1_energy.1.mp4
3_speed.0.8_pitch.0.8_energy.0.8.mp4
2_speed.1_pitch.1_energy.1.mp4
2_speed.1_pitch.1_energy.1.2.mp4
2_speed.1_pitch.1_energy.0.8.mp4
2_speed.1_pitch.1.2_energy.1.mp4
2_speed.1_pitch.0.8_energy.1.mp4
2_speed.1.2_pitch.1_energy.1.mp4
2_speed.1.2_pitch.1.2_energy.1.2.mp4
2_speed.0.8_pitch.1_energy.1.mp4
2_speed.0.8_pitch.0.8_energy.0.8.mp4
1_speed.1_pitch.1_energy.1.mp4
1_speed.1_pitch.1_energy.1.2.mp4
1_speed.1_pitch.1_energy.0.8.mp4
1_speed.1_pitch.1.2_energy.1.mp4
1_speed.1_pitch.0.8_energy.1.mp4
1_speed.1.2_pitch.1_energy.1.mp4
1_speed.1.2_pitch.1.2_energy.1.2.mp4
1_speed.0.8_pitch.1_energy.1.mp4
1_speed.0.8_pitch.0.8_energy.0.8.mp4
3_speed.1_pitch.1_energy.1.mp4

About

Fastspeech2 implementation for TTS task

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0