8000 GitHub - fansuregrin/GPT-SoVITS: 1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

License

Notifications You must be signed in to change notification settings

fansuregrin/GPT-SoVITS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT-SoVITS-Encore

A Powerful Few-shot Voice Conversion and Text-to-Speech Project.

madewithlove License


Features:

  1. Zero-shot TTS: Input a 5-second vocal sample and experience instant text-to-speech conversion.

  2. Few-shot TTS: Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.

  3. Cross-lingual Support: Inference in languages different from the training dataset, currently supporting English, Japanese, Korean, Cantonese and Chinese.

  4. WebUI Tools: Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.

Unseen speakers few-shot fine-tuning demo:

few.shot.fine.tuning.demo.mp4

Installation

Pretrained Models

Dataset Format

The TTS annotation file format:

vocal_path|speaker_name|language|text

Language dictionary:

  • 'zh': Chinese
  • 'ja': Japanese
  • 'en': English
  • 'ko': Korean
  • 'yue': Cantonese

Example:

xxx.wav|xxx|en|I like playing Genshin.

Finetune and inference

V2 Release Notes

New Features:

  1. Support Korean and Cantonese

  2. An optimized text frontend

  3. Pre-trained model extended from 2k hours to 5k hours

  4. Improved synthesis quality for low-quality reference audio

    more details

Use v2 from v1 environment:

  1. pip install -r requirements.txt to update some packages

  2. Clone the latest codes from github.

  3. Download v2 pretrained models from huggingface and put them into GPT_SoVITS\pretrained_models\gsv-v2final-pretrained.

    Chinese v2 additional: G2PWModel_1.1.zip(Download G2PW models, unzip and rename to G2PWModel, and then place them in GPT_SoVITS/text.

V3 Release Notes

New Features:

  1. The timbre similarity is higher, requiring less training data to approximate the target speaker (the timbre similarity is significantly improved using the base model directly without fine-tuning).

  2. GPT model is more stable, with fewer repetitions and omissions, and it is easier to generate speech with richer emotional expression.

    more details

Use v3 from v2 environment:

  1. pip install -r requirements.txt to update some packages

  2. Clone the latest codes from github.

  3. Download v3 pretrained models (s1v3.ckpt, s2Gv3.pth and models--nvidia--bigvgan_v2_24khz_100band_256x folder) from huggingface and put them into GPT_SoVITS\pretrained_models.

    additional: for Audio Super Resolution model, you can read how to download

Credits

Special thanks to the following projects and contributors:

Major Repository

GPT-SoVITS

Theoretical Research

Pretrained Models

Text Frontend for Inference

WebUI Tools

Thankful to @Naozumi520 for providing the Cantonese training set and for the guidance on Cantonese-related knowledge.

Thanks to all contributors for their efforts

About

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.7%
  • Jupyter Notebook 1.3%
  • Cuda 0.9%
  • C 0.6%
  • Shell 0.3%
  • Dockerfile 0.1%
  • C++ 0.1%
0