8000 GitHub - Zhongxu-Wang/ArtSpeech: ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Zhongxu-Wang/ArtSpeech

Repository files navigation

Framework

This repo contains the official implementation of the ACM MM 2024:

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Zhongxu Wang, Yujia Wang, Mingzhu Li, Hua Huang

Introduction

We devise an articulatory representation-based text-to-speech (TTS) model, ArtSpeech, an explainable and effective network for humanlike speech synthesis, by revisiting the sound production system. Current deep TTS models learn acoustic-text mapping in a fully parametric manner, ignoring the explicit physical significance of articulation movement. ArtSpeech, on the contrary, leverages articulatory representations to perform adaptive TTS, clearly describing the voice tone and speaking prosody of different speakers. Specifically, energy, F0, and vocal tract variables are utilized to represent airflow forced by articulatory organs, the degree of tension in the vocal folds of the larynx, and the coordinated movements between different organs, respectively. We also design a multidimensional style mapping network to extract speaking styles from the articulatory representations, guided by which variation predictors could predict the final mel spectrogram output. To validate the effectiveness of our approach, we conducted comprehensive experiments and analyses using the widely recognized speech corpus, such as LJSpeech and LibriTTS datasets, yielding promising similarity enhancement between the generated results and the target speaker’s voice and prosody.

Demo Page: ArtSpeech demopage

Paper Link: paper

Pre-requisites

  1. Python >= 3.8
conda create -n ArtSpeech python==3.8.0
conda activate ArtSpeech

(Some code adjustments might be necessary when using the latest versions of Python and PyTorch.)

  1. Clone this repository:
git clone https://github.com/Zhongxu-Wang/ArtSpeech.git
cd ArtSpeech
  1. Install python requirements.
pip install torchaudio munch torch librosa pyyaml click tqdm attrdict matplotlib tensorboard Cython

Inference

Download the pre-training weight and put it in the Outputs/LibriTTS/: LibriTTS pre-training weight

Set pretrained_model in Configs\config.yaml to "Outputs/LibriTTS/epoch_2nd_00119.pth"

Before running the inference, make sure you have a reference audio file and the text you want to synthesize.

text = "XXXXX"
ref_wav = "ref.wav"
save_path = "output.wav"

Execute the test script

python test.py

Training

Dataset Preparation

Create a new folder: Data/LibriTTS/train-clean-460/ Downloa 6D39 d the train-clean-100.tar.gz and train-clean-360.tar.gz datasets from LibriTTS, merge them, and place them in the train-clean-460 directory.

Download the articulatory features gt predict file and place it in the Data/ directory. The predict file contains the articulatory features GT for the LibriTTS, LJSpeech, and VCTK datasets. If you want to train on your own dataset, please refer to this project: TVsExtractor.

Train the Model

Run the following commands:

python train_first.py
python train_second.py

Additional Training Data

All results in the paper, as well as the pre-trained models provided in the repository, were trained using the LJSpeech and LibriTTS datasets.

Here we publish the multi-age, emotionally rich speech data we collected. I hope this will be useful for future research.

Multi-age dataset

Multi-age dataset The Multi-Age Dataset consists of 36 children's videos and 15 elderly videos sourced from YouTube. The corresponding audio has been transcribed through Automatic Speech Recognition (ASR), segmented, and manually verified. The dataset contains a total of 4,695 spoken sentences.

About

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0