F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-TTS: Diffusion Transformer with ConvNeXt V2, faster trained and inference.

E2 TTS: Flat-UNet Transformer, closest reproduction from paper.

Sway Sampling: Inference-time flow step sampling strategy, greatly improves performance

Thanks to all the contributors !

News

2024/10/08: F5-TTS & E2 TTS base models on 🤗 Hugging Face, 🤖 Model Scope, 🟣 Wisemodel.

Installation

powershell run with install-with-uv(nocache).ps1

[Optional]: We provide Dockerfile and you can use the following command to build it.

# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5-tts python=3.10
conda activate f5-tts

# NVIDIA GPU: install pytorch with your CUDA version, e.g.
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

# AMD GPU: install pytorch with your ROCm version, e.g. (Linux only)
pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2

# Intel GPU: install pytorch with your XPU version, e.g.
# Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed
pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu

Then you can choose from a few options below:

1. As a pip package (if just for inference)

pip install git+https://github.com/SWivid/F5-TTS.git

2. Local editable (if also do training, finetuning)

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
# git submodule update --init --recursive  # (optional, if need bigvgan)
pip install -e .

3. Docker usage

# Build from Dockerfile
docker build -t f5tts:v1 .

# Or pull from GitHub Container Registry
docker pull ghcr.io/swivid/f5-tts:main

Inference

1. Gradio App

Currently supported features:

Basic TTS with Chunk Inference
Multi-Style / Multi-Speaker Generation
Voice Chat powered by Qwen2.5-3B-Instruct
Custom inference with more language support

# Launch a Gradio app (web interface)
f5-tts_infer-gradio

# Specify the port/host
f5-tts_infer-gradio --port 7860 --host 0.0.0.0

# Launch a share link
f5-tts_infer-gradio --share

NVIDIA device docker compose file example

services:
  f5-tts:
    image: ghcr.io/swivid/f5-tts:main
    ports:
      - "7860:7860"
    environment:
      GRADIO_SERVER_PORT: 7860
    entrypoint: ["f5-tts_infer-gradio", "--port", "7860", "--host", "0.0.0.0"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  f5-tts:
    driver: local

2. CLI Inference

# Run with flags
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."

# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
f5-tts_infer-cli
# Or with your own .toml file
f5-tts_infer-cli -c custom.toml

# Multi voice. See src/f5_tts/infer/README.md
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml

3. More instructions

In order to have better generation results, take a moment to read detailed guidance.
The Issues are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.

Training

1. Gradio App

Read training & finetuning guidance for more instructions.

# Quick start with Gradio web interface
f5-tts_finetune-gradio

Evaluation

Development

Use pre-commit to ensure code quality (will run linters and formatters automatically)

pip install pre-commit
pre-commit install

When making a pull request, before each commit, run:

pre-commit run --all-files

Note: Some model components have linting exceptions for E722 to accommodate tensor notation

Prepare Dataset

Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in model/dataset.py.

# prepare custom dataset up to your need
# download corresponding dataset first, and fill in the path in scripts

# Prepare the Emilia dataset
python scripts/prepare_emilia.py

# Prepare the Wenetspeech4TTS dataset
python scripts/prepare_wenetspeech4tts.py

Training & Finetuning

Once your datasets are prepared, you can start the training process.

# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml     
accelerate config
accelerate launch train.py

An initial guidance on Finetuning #57.

Gradio UI finetuning with finetune_gradio.py see #143.

Wandb Logging

By default, the training script does NOT use logging (assuming you didn't manually log in using wandb login).

To turn on wandb logging, you can either:

Manually login with wandb login: Learn more here
Automatically login programmatically by setting an environment variable: Get an API KEY at https://wandb.ai/site/ and set the environment variable as follows:

On Mac & Linux:

export WANDB_API_KEY=<YOUR WANDB API KEY>

On Windows:

set WANDB_API_KEY=<YOUR WANDB API KEY>

Moreover, if you couldn't access Wandb and want to log metrics offline, you can the environment variable as follows:

export WANDB_MODE=offline

Inference

The pretrained model checkpoints can be reached at 🤗 Hugging Face and 🤖 Model Scope, or automatically downloaded with inference-cli and gradio_app.

Currently support 30s for a single generation, which is the TOTAL length of prompt audio and the generated. Batch inference with chunks is supported by inference-cli and gradio_app.

To avoid possible inference failures, make sure you have seen through the following instructions.
A longer prompt audio allows shorter generated output. The part longer than 30s cannot be generated properly. Consider using a prompt audio <15s.
Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. If first few words skipped in code-switched generation (cuz different speed with different languages), this might help.

CLI Inference

Either you can specify everything in inference-cli.toml or override with flags. Leave --ref_text "" will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set ckpt_file in inference-cli.py

for change model use --ckpt_file to specify the model you want to load,
for change vocab.txt use --vocab_file to provide your vocab.txt file.

python inference-cli.py \
--model "F5-TTS" \
--ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
--ref_text "Some call me nature, others call me mother nature." \
--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."

python inference-cli.py \
--model "E2-TTS" \
--ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
--ref_text "对，这就是我，万人敬仰的太乙真人。" \
--gen_text "突然，身边一阵笑声。我看着他们，意气风发地挺直了胸膛，甩了甩那稍显肉感的双臂，轻笑道，我身上的肉，是为了掩饰我爆棚的魅力，否则，岂不吓坏了你们呢？"

# Multi voice
python inference-cli.py -c samples/story.toml

Gradio App

Currently supported features:

Chunk inference
Podcast Generation
Multiple Speech-Type Generation

You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in gradio_app.py). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than inference-cli.

powershell run with run_gui.ps1

Speech Editing

To test speech editing capabilities, use the following command.

python speech_edit.py

Evaluation

Prepare Test Datasets

Seed-TTS test set: Download from seed-tts-eval.
LibriSpeech test-clean: Download from OpenSLR.
Unzip the downloaded datasets and place them in the data/ directory.
Update the path for the test-clean data in scripts/eval_infer_batch.py
Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo

Batch Inference for Test Set

To run batch inference for evaluations, execute the following commands:

# batch inference for evaluations accelerate config # if not set before bash scripts/eval_infer_batch.sh

Download Evaluation Model Checkpoints

Chinese ASR Model: Paraformer-zh
English ASR Model: Faster-Whisper
WavLM Model: Download from Google Drive.

Objective Evaluation

Install packages for evaluation:

pip install -r requirements_eval.txt

Some Notes

For faster-whisper with CUDA 11:

pip install --force-reinstall ctranslate2==3.24.0

(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:

pip install faster-whisper==0.10.1

Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:

# Evaluation for Seed-TTS test set
python scripts/eval_seedtts_testset.py

# Evaluation for LibriSpeech-PC test-clean (cross-sentence)
python scripts/eval_librispeech_test_clean.py

Acknowledgements

E2-TTS brilliant work, simple and effective
Emilia, WenetSpeech4TTS, LibriTTS, LJSpeech valuable datasets
lucidrains initial CFM structure with also bfs18 for discussion
SD3 & Hugging Face diffusers DiT and MMDiT code structure
torchdiffeq as ODE solver, Vocos and BigVGAN as vocoder
FunASR, faster-whisper, UniSpeech, SpeechMOS for evaluation tools
ctc-forced-aligner for speech edit test
mrfakename huggingface space demo ~
f5-tts-mlx Implementation with MLX framework by Lucas Newman
F5-TTS-ONNX ONNX Runtime version by DakeQQ

Citation

If our work and codebase is useful for you, please cite as:

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

License

Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.

Name		Name	Last commit message	Last commit date
Latest commit History 490 Commits
.github		.github
ckpts		ckpts
data		data
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
install-uv-qinglong(nocache).ps1		install-uv-qinglong(nocache).ps1
install-uv-qinglong.ps1		install-uv-qinglong.ps1
pyproject.toml		pyproject.toml
requirements-uv.txt		requirements-uv.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
run_gui.ps1		run_gui.ps1
uv-installer.ps1		uv-installer.ps1
uv-installer.sh		uv-installer.sh

License

sdbds/F5-TTS

Folders and files

Latest commit

History

Repository files navigation

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Thanks to all the contributors !

News

Installation

1. As a pip package (if just for inference)

2. Local editable (if also do training, finetuning)

3. Docker usage

Inference

1. Gradio App

2. CLI Inference

3. More instructions

Training

1. Gradio App

Evaluation

Development

Prepare Dataset

Training & Finetuning

Wandb Logging

Inference

CLI Inference

Gradio App

Speech Editing

Evaluation

Prepare Test Datasets

Batch Inference for Test Set

Download Evaluation Model Checkpoints

Objective Evaluation

Acknowledgements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages