A robust data collection and processing pipeline for creating high-quality multilingual speech datasets for TTS benchmarking.
This pipeline automates the collection, processing, and filtering of multilingual speech data from online sources to create a clean, standardized dataset suitable for training and evaluating Text-to-Speech (TTS) systems across multiple languages.
The pipeline currently supports 15 languages:
- English (en)
- Spanish (es)
- Italian (it)
- Japanese (ja)
- Polish (pl)
- Portuguese (pt)
- Turkish (tr)
- Chinese (zh)
- French (fr)
- German (de)
- Korean (ko)
- Arabic (ar)
- Russian (ru)
- Dutch (nl)
- Hindi (hi)
The data processing pipeline consists of the following stages:
-
Video Discovery (
01_get_videos.py
): Searches for relevant videos in target languages using YouTube API with geo-specific queries. -
Language Verification (
02_to_txt.py
): Extracts video metadata and verifies that content is in the target language using fastText language detection. -
Video Download (
03_download_videos.py
): Downloads the verified videos for further processing. -
Utterance Extraction (
04_get_utterances.py
): Processes videos to extract individual utterances and associated metadata. -
Audio Processing (
05_extract_and_filter.py
): Extracts audio segments, performs quality filtering based on:- Speech quality (DNSMOS score)
- Background noise/music detection
- Transcript accuracy (WER/CER metrics)
- Audio duration constraints
-
Final Dataset Creation (
06_final_filter.py
): Performs final filtering and creates the standardized dataset.
- Python 3.8+
- YouTube Data API key
- GeoNames API username
- GPU recommended for audio processing (with CUDA support)
- torch/torchaudio
- faster_whisper
- fasttext
- demucs
- transformers
- rich
- requests
- geopy
- python-dotenv
-
Setup Environment:
# Clone the repository git clone https://github.com/yourusername/TTSDS.git cd TTSDS # Create and activate a virtual environment (recommended) python -m venv venv source venv/bin/activate # On Windows, use: venv\Scripts\activate # Install dependencies pip install -r requirements.txt # Set up API keys in .env file echo "YOUTUBE_API_KEYS=your_youtube_api_key" > .env echo "GEONAMES_USERNAME=your_geonames_username" >> .env
-
Run the Pipeline:
To process a specific language:
python 01_get_videos.py --language_code en python 02_to_txt.py --language_code en python 03_download_videos.py --language_code en python 04_get_utterances.py --language_code en python 05_extract_and_filter.py --language_code en --device_index 0 python 06_final_filter.py --language_code en
To process all supported languages (may take significant time):
for lang in en es it ja pl pt tr zh fr de ko ar ru nl hi; do python 01_get_videos.py --language_code $lang python 02_to_txt.py --language_code $lang python 03_download_videos.py --language_code $lang python 04_get_utterances.py --language_code $lang python 05_extract_and_filter.py --language_code $lang --device_index 0 python 06_final_filter.py --language_code $lang done
The final dataset will be organized by language code with the following structure:
dataset/
├── en/
│ ├── 0001_speaker1_0.mp3
│ ├── 0001_speaker1_0.txt
│ ├── ...
├── es/
│ ├── ...
Each audio file has a corresponding text file containing the transcript. The naming convention is:
{index}_{speaker_hash}_{utterance_id}.{ext}
The pipeline filters utterances based on:
- Word Error Rate (WER < 10%)
- Character Error Rate (CER < 5%)
- Duration (3-30 seconds)
- Background music/noise level
This project is licensed under the MIT License - see the LICENSE file for details.
- This project uses the YouTube Data API for video discovery
- Audio processing leverages various open-source tools including faster-whisper, demucs, and torchaudio