ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

This repo contains the official implementation of the ACM MM 2024:

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Zhongxu Wang, Yujia Wang, Mingzhu Li, Hua Huang

Introduction

We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for humanlike speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multidimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker’s voice and prosody.

Demo Page: ArtSpeech demopage

Paper Link: paper

Pre-requisites

Python >= 3.8

conda create -n ArtSpeech python==3.8.0
conda activate ArtSpeech

(Some code adjustments might be necessary when using the latest versions of Python and PyTorch.)

Clone this repository:

git clone https://github.com/Zhongxu-Wang/ArtSpeech.git
cd ArtSpeech

Install python requirements.

pip install torchaudio munch torch librosa pyyaml click tqdm attrdict matplotlib tensorboard Cython

Inference

Download the pre-training weight and put it in the Outputs/LibriTTS/: LibriTTS pre-training weight

Set pretrained_model in Configs\config.yaml to "Outputs/LibriTTS/epoch_2nd_00119.pth"

Before running the inference, make sure you have a reference audio file and the text you want to synthesize.

text = "XXXXX"
ref_wav = "ref.wav"
save_path = "output.wav"

Execute the test script

python test.py

Training

Dataset Preparation

Create a new folder: Data/LibriTTS/train-clean-460/ Downloa 6D39 d the train-clean-100.tar.gz and train-clean-360.tar.gz datasets from LibriTTS, merge them, and place them in the train-clean-460 directory.

Download the articulatory features gt predict file and place it in the Data/ directory. The predict file contains the articulatory features GT for the LibriTTS, LJSpeech, and VCTK datasets. If you want to train on your own dataset, please refer to this project: TVsExtractor.

Train the Model

Run the following commands:

python train_first.py
python train_second.py

Additional Training Data

All results in the paper, as well as the pre-trained models provided in the repository, were trained using the LJSpeech and LibriTTS datasets.

Here we publish the multi-age, emotionally rich speech data we collected. I hope this will be useful for future research.

Multi-age dataset

Multi-age dataset The Multi-Age Dataset consists of 36 children's videos and 15 elderly videos sourced from YouTube. The corresponding audio has been transcribed through Automatic Speech Recognition (ASR), segmented, and manually verified. The dataset contains a total of 4,695 spoken sentences.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Configs		Configs
Data		Data
Utils		Utils
Vocoder		Vocoder
.gitattributes		.gitattributes
README.md		README.md
S_monotonic_align.py		S_monotonic_align.py
S_monotonic_align_Triton.py		S_monotonic_align_Triton.py
meldataset.py		meldataset.py
models.py		models.py
optimizers.py		optimizers.py
test.py		test.py
train_first.py		train_first.py
train_second.py		train_second.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Zhongxu Wang, Yujia Wang, Mingzhu Li, Hua Huang

Introduction

Pre-requisites

Inference

Training

Dataset Preparation

Train the Model

Additional Training Data

Multi-age dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Zhongxu-Wang/ArtSpeech

Folders and files

Latest commit

History

Repository files navigation

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Zhongxu Wang, Yujia Wang, Mingzhu Li, Hua Huang

Introduction

Pre-requisites

Inference

Training

Dataset Preparation

Train the Model

Additional Training Data

Multi-age dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages