TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion (ICASSP 2023 Top 3% paper recognized)

This is a Pytorch implementation of TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion. TriAAN-VC is a deep learning model for any-to-any voice conversion. TriAAN-VC can maintain the linguistic contents of source speech and represent target characteristics, unlike previous methods. Experimental results on the VCTK dataset suggest that TriAAN-VC achieves state-of-the-art performance.

We recommend you visit our demo site.

The overall architecture of TriAAN-VC is as below:

Installation & Enviornment

The OS, Python, and PyTorch version are as below (You can also use other versions.):

Windows
Linux
python == 3.8
pytorch == 1.9.1
torchaudio == 0.9.1

You can install requirements through git and requirements.txt except for pytorch and torchaudio.

git clone https://github.com/winddori2002/TriAAN-VC.git
cd TriAAN-VC
pip install -r requirements.txt

Prepare for usage

1. Prepare dataset

We use the VCTK dataset consisting of 110 speakers with 400 utterances per speaker.

The dataset can be downloaded here.
We divide the dataset depending on seen-to-seen and unseen-to-unseen scenarios for evaluation.

2. Prepare pre-trained vocoder and feature extractor

We use the pre-trained ParallelWaveGAN as vocoder and CPC extractor as feature extractor. You can use the pre-trained weights in this repository. The vocoder is trained on the VCTK dataset and CPC extractor is trained on the LibriSpeech dataset.

This repository provides pre-trained ParallelWaveGAN provided by here and CPC extractor provided by here.
Or you can train ParallelWaveGAN and CPC.

3. Preprocess

The preprocess stages contain dataset split, feature extraction, making data paths, and eval pairs.

The steps are for VCTK dataset and if you want to use other dataset, you need to modify the details.

To split dataset and get mel-spectrograms, lf0, and metadata, run the following code and modify the directories in the ./config/preprocess.yaml.

python preprocess.py

To get CPC features and represent the paths on the metadata, run the following code. (You need pre-trained weights for CPC)

python preprocess_cpc.py

To get evaluation pairs for conversion, run the following code.

python generate_eval_pair.py

The dataset split and eval pairs are for evaluation and investigation, they are not actually necessary to train models.

How to use

1. Train

Training with settings

You can train TriAAN-VC by running the following code.

If you want to edit model settings, you can run python main.py train with other arguments.

In config/base.yaml, you can find other arguments, such as batch size, epoch, and so on.

python main.py train
Model arguments:
  encoder:
  	c_in:      256  (cpc:256, mel:80)
	c_h:       512
	c_out:     4
	num_layer: 6
  decoder:
  	c_in:      4
	c_h:       512
	c_out:     80
	num_layer: 6
	
Train arguments:
  epoch:      500
  batch_size: 64
  siam:       True (if not, siamese path is excluded)
  cpc:        True (if not, TriAAN-VC uses mel inputs)

Training with logging

The logs are uploaded on neptune.ai

python main.py train --logging True

Logging arguments:
  --logging    : True/False

2. Evaluation

After training, you can evaluate the model in terms of lingustic content (WER and CER) and target characteristic (SV).

You need to keep the model arguments in the training phase. The code only supports the version in which the number of target utterances is 1.

python main.py test
evaluation arguments:
  --checkpoint: Checkpoint path

Or, you can use the below code for testing with multi-target utterances.

python test.py --n_uttr 1 --eval True
evaluation arguments:
  --eval:       Option for evaluation
  --n_uttr:     Number of target utterances
  --checkpoint: Checkpoint path

3. Pretrained weights

The pretrained weights of TriAAN-VC is uploaded on the github release here.

We provide two versions of models depending on input types (mel, cpc).

4. Custom convert

For custom conversion, you can run the code with convert.py. The codes include data processing, predicting, and vocoding.

python convert.py 
Conversion arguments:
  --src_name:   Sample source names
  --trg_name:   Sample target names
  --checkpoint: Checkpoint path

You can find converted examples in ./samples or please visit our demo site.

Experimental Results

The experimental results are from the provided pre-trained weights, and the results can be slightly different from the paper. "VCTK Split" indicates the pre-trained weight with dataset split as in the paper.

Below, the results are summarized with the "TriAAN-VC-Split". Each score is the average of seen-to-seen and unseen-to-unseen scenarios.

Model	Pre-trained Ver.	# uttr	WER AVG (%)	CER AVG (%)	SV AVG (%)
TriAAN-VC-Mel	VCTK Split	1	27.61	14.78	89.42
TriAAN-VC-Mel	VCTK Split	3	22.86	12.15	95.92
TriAAN-VC-CPC	VCTK Split	1	21.50	11.24	92.33
TriAAN-VC-CPC	VCTK Split	3	17.42	8.86	97.75

Certificate

Fortunately, our paper was recognized as the top 3% paper in ICASSP 2023.

Citation

@article{park2023triaan,
  title={TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion},
  author={Park, Hyun Joon and Yang, Seok Woo and Kim, Jin Sob and Shin, Wooseok and Han, Sung Won},
  journal={arXiv preprint arXiv:2303.09057},
  year={2023}
}

License

This repository is released under the MIT license. We adapted CPC codes and weights from facebookresearch/CPC_audio, released under the MIT license. We used vocoder codes from kan-bayashi/ParallelWaveGAN, released under the MIT license. For the pre-trained vocoder, we used the weights from Wendison/VQMIVC which is released under the MIT license. We also modified preprocess codes from Wendison/VQMIVC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion (ICASSP 2023 Top 3% paper recognized)

Installation & Enviornment

Prepare for usage

1. Prepare dataset

2. Prepare pre-trained vocoder and feature extractor

3. Preprocess

How to use

1. Train

Training with settings

Training with logging

2. Evaluation

3. Pretrained weights

4. Custom convert

Experimental Results

Certificate

Citation

License

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
base_data		base_data
checkpoints		checkpoints
config		config
cpc		cpc
img		img
model		model
preprocess		preprocess
samples		samples
src		src
vocoder		vocoder
LICENSE		LICENSE
README.md		README.md
convert.py		convert.py
generate_eval_pair.py		generate_eval_pair.py
main.py		main.py
preprocess.py		preprocess.py
preprocess_cpc.py		preprocess_cpc.py
requirements.txt		requirements.txt
test.py		test.py

License

winddori2002/TriAAN-VC

Folders and files

Latest commit

History

Repository files navigation

TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion (ICASSP 2023 Top 3% paper recognized)

Installation & Enviornment

Prepare for usage

1. Prepare dataset

2. Prepare pre-trained vocoder and feature extractor

3. Preprocess

How to use

1. Train

Training with settings

Training with logging

2. Evaluation

3. Pretrained weights

4. Custom convert

Experimental Results

Certificate

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages