-
Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.
-
Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
-
Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, Korean, Cantonese and Chinese.
-
WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.
Unseen speakers few-shot fine-tuning demo:
few.shot.fine.tuning.demo.mp4
The TTS annotation file format:
vocal_path|speaker_name|language|text
Language dictionary:
- 'zh': Chinese
- 'ja': Japanese
- 'en': English
- 'ko': Korean
- 'yue': Cantonese
Example:
xxx.wav|xxx|en|I like playing Genshin.
New Features:
-
Support Korean and Cantonese
-
An optimized text frontend
-
Pre-trained model extended from 2k hours to 5k hours
-
Improved synthesis quality for low-quality reference audio
Use v2 from v1 environment:
-
pip install -r requirements.txt
to update some packages -
Clone the latest codes from github.
-
Download v2 pretrained models from huggingface and put them into
GPT_SoVITS\pretrained_models\gsv-v2final-pretrained
.Chinese v2 additional: G2PWModel_1.1.zip(Download G2PW models, unzip and rename to
G2PWModel
, and then place them inGPT_SoVITS/text
.
New Features:
-
The timbre similarity is higher, requiring less training data to approximate the target speaker (the timbre similarity is significantly improved using the base model directly without fine-tuning).
-
GPT model is more stable, with fewer repetitions and omissions, and it is easier to generate speech with richer emotional expression.
Use v3 from v2 environment:
-
pip install -r requirements.txt
to update some packages -
Clone the latest codes from github.
-
Download v3 pretrained models (s1v3.ckpt, s2Gv3.pth and models--nvidia--bigvgan_v2_24khz_100band_256x folder) from huggingface and put them into
GPT_SoVITS\pretrained_models
.additional: for Audio Super Resolution model, you can read how to download
Special thanks to the following projects and contributors:
Thankful to @Naozumi520 for providing the Cantonese training set and for the guidance on Cantonese-related knowledge.