Swahili TTS Fine-tuning for Cloud GPU

This repository contains the necessary code and configuration to fine-tune the facebook/mms-tts-swh model using the Mozilla Common Voice Swahili dataset on a cloud GPU environment (e.g., Google Colab, AWS SageMaker, GCP AI Platform, Kaggle Kernel).

It is adapted from the ylacombe/finetune-hf-vits repository.

Prerequisites

Cloud environment with a CUDA-enabled GPU (e.g., NVIDIA T4, V100, A100).
Python 3.8+
Git
Hugging Face account and authentication token (Settings -> Access Tokens).
(Optional) Weights & Biases account for experiment tracking (wandb.ai).

Setup

Clone the repository:

git clone <your-repo-url>
cd swahili-tts-cloud-gpu

Create a Python virtual environment (Recommended):

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install requirements:

pip install -r requirements.txt
# Verify torch is installed with CUDA support
# python -c "import torch; print(torch.cuda.is_available())"

Build Monotonic Alignment Search: This requires C build tools (e.g., build-essential on Debian/Ubuntu).
```
cd monotonic_align
python setup.py build_ext --inplace
cd ..
```

Log in to Hugging Face:

huggingface-cli login
# Enter your token with write permissions

(Optional) Log in to Weights & Biases:
```
wandb login
# Enter your API key
```

Fine-tuning Process

Prepare the Base Model: Convert the original MMS checkpoint to include the discriminator. This saves the required model files into the ./models/base_model directory.
```
python convert_original_discriminator_checkpoint.py --language_code sw --pytorch_dump_folder_path ./models/base_model
```
(Note: The language code here is sw, consistent with the dataset config name).
Accept Dataset Terms: You must have accepted the terms for the mozilla-foundation/common_voice_17_0 dataset on the Hugging Face website while logged in. The script will fail otherwise during data download. Visit mozilla-foundation/common_voice_17_0 and accept the terms.
Review Configuration: Open training_config.json and adjust parameters if needed:
- filter_on_speaker_id: Change "female" to "male" or "other" if desired.
- hub_model_id: Crucially, change "your-username/swahili-mms-female-voice-cloud" to your actual Hugging Face username and desired model name.
- per_device_train_batch_size: Adjust based on your GPU's VRAM (e.g., lower for T4, higher for A100).
- gradient_accumulation_steps: Adjust inversely to batch size.
- fp16: Set to false if your GPU doesn't support FP16 well.
- push_to_hub: Set to false if you don't want to automatically upload the final model.
Configure Accelerate: Set up accelerate for your GPU environment.
```
accelerate config
```
- Choose This machine.
- Choose multi-GPU if you have multiple GPUs, otherwise No distributed training.
- Follow prompts, generally accepting defaults (saying No to DeepSpeed, Dynamo, etc.).
- Select fp16 if you enabled it in the config and your GPU supports it.

Run Fine-tuning: Launch the training script using accelerate.

accelerate launch run_vits_finetuning.py training_config.json

Monitoring

Terminal Output: Track progress directly in the terminal.

TensorBoard: Launch TensorBoard to view loss curves and other metrics.

# Run in a separate terminal or tmux/screen session
tensorboard --logdir models/finetuned_model/runs

Weights & Biases: If logged in, monitor experiments on the W&B dashboard.

Using the Fine-tuned Model

Once training is complete (or you want to test a checkpoint):

The best model is saved in ./models/finetuned_model.
If push_to_hub was true, it's also uploaded to your Hugging Face Hub repository.

Use the model with the transformers pipeline:

from transformers import pipeline
import scipy.io.wavfile
import torch

# Use the Hub ID or the local path to your final model/checkpoint
model_id = "your-username/swahili-mms-female-voice-cloud" # Or "./models/finetuned_model"
device = "cuda" if torch.cuda.is_available() else "cpu"

synthesiser = pipeline("text-to-speech", model=model_id, device=device)

text = "Habari gani? Hii ni mfano wa sauti iliyotengenezwa kutoka kwenye mfano ulioboreshwa."
output_file = "finetuned_cloud_output.wav"

# Generate speech
speech = synthesiser(text)

# Save audio
scipy.io.wavfile.write(
    output_file, 
    rate=speech["sampling_rate"], 
    data=speech["audio"]
)

print(f"Audio saved to {output_file}")

Directory Structure

swahili-tts-cloud-gpu/
├── .git/
├── .gitignore
├── README.md
├── requirements.txt
├── training_config.json
├── run_vits_finetuning.py
├── convert_original_discriminator_checkpoint.py
├── monotonic_align/           # Code for alignment
│   └── ...
├── utils/                     # Utility functions
│   └── ...
├── data/                      # For downloaded/processed data (ignored by git)
│   └── .gitkeep
└── models/                    # For model files (ignored by git)
    ├── base_model/            # Converted base model for training
    │   └── .gitkeep
    └── finetuned_model/       # Output directory for checkpoints & final model
 
50A2
       └── .gitkeep

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Swahili TTS Fine-tuning for Cloud GPU

Prerequisites

Setup

Fine-tuning Process

Monitoring

Using the Fine-tuned Model

Directory Structure

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
monotonic_align		monotonic_align
utils		utils
.gitignore		.gitignore
README.md		README.md
convert_original_discriminator_checkpoint.py		convert_original_discriminator_checkpoint.py
requirements.txt		requirements.txt
run_vits_finetuning.py		run_vits_finetuning.py
training_config.json		training_config.json

kulemantu/sw-tts

Folders and files

Latest commit

History

Repository files navigation

Swahili TTS Fine-tuning for Cloud GPU

Prerequisites

Setup

Fine-tuning Process

Monitoring

Using the Fine-tuned Model

Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages