This repo contains the official implementation of the ACM MM 2024:
We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for humanlike speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multidimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker’s voice and prosody.
Demo Page: ArtSpeech demopage
Paper Link: paper
- Python >= 3.8
conda create -n ArtSpeech python==3.8.0
conda activate ArtSpeech
(Some code adjustments might be necessary when using the latest versions of Python and PyTorch.)
- Clone this repository:
git clone https://github.com/Zhongxu-Wang/ArtSpeech.git
cd ArtSpeech
- Install python requirements.
pip install torchaudio munch torch librosa pyyaml click tqdm attrdict matplotlib tensorboard Cython
Download the pre-training weight and put it in the Outputs/LibriTTS/
: LibriTTS pre-training weight
Set pretrained_model
in Configs\config.yaml
to "Outputs/LibriTTS/epoch_2nd_00119.pth"
Before running the inference, make sure you have a reference audio file and the text you want to synthesize.
text = "XXXXX"
ref_wav = "ref.wav"
save_path = "output.wav"
Execute the test script
python test.py
Create a new folder: Data/LibriTTS/train-clean-460/
Downloa
6D39
d the train-clean-100.tar.gz
and train-clean-360.tar.gz
datasets from LibriTTS, merge them, and place them in the train-clean-460
directory.
Download the articulatory features gt predict file and place it in the Data/
directory. The predict file contains the articulatory features GT for the LibriTTS, LJSpeech, and VCTK datasets. If you want to train on your own dataset, please refer to this project: TVsExtractor.
Run the following commands:
python train_first.py
python train_second.py
All results in the paper, as well as the pre-trained models provided in the repository, were trained using the LJSpeech and LibriTTS datasets.
Here we publish the multi-age, emotionally rich speech data we collected. I hope this will be useful for future research.
Multi-age dataset The Multi-Age Dataset consists of 36 children's videos and 15 elderly videos sourced from YouTube. The corresponding audio has been transcribed through Automatic Speech Recognition (ASR), segmented, and manually verified. The dataset contains a total of 4,695 spoken sentences.