GitHub

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech (ICLR 2024)

Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

Update 2024/12/2

Provide the Quality_Estimation "inference_folder" code to evaluate the utterances in a folder. We found that the inference speed of VQScore is quite fast (15.5 hours of speech takes less than 2 minutes on a single A100 GPU), making it suitable for filtering out noisy training data when training speech enhancement or TTS models.

Introduction

This work is about training a speech quality estimator and enhancement model WITHOUT any labeled (paired) data. Specifically, during training, we only need CLEAN speech for model training.

Environment

CUDA Version: 12.2

python: 3.8

Note: To use 'CUDAExecutionProvider' for accelerated DNSMOS ONNX model inference, please check CUDA and ONNX Runtime version compatibility, here.

Dataset used in the paper/code

If you want to train from scratch, please download the dataset to the corresponding path depicted in the .csv and .pickle files.

Speech enhancement:

=> Training: clean speech of VoiceBank-DEMAND trainset (Its original sampling rate is 48kHz, you have to down-sample it to 16kHz)

=> validation: As in MetricGAN-U, noisy speech (speakers p226 and p287) of VoiceBank-DEMAND trainset

=> Evaluation: noisy speech of VoiceBank-DEMAND testset and DNS1 and DNS3

Quality estimation (VQScore):

=> Training: LibriSpeech clean-460 hours

=> validation: noisy speech of VoiceBank-DEMAND testset

=> Evaluation: Tencent and IUB

Training

To Train our speech enhancement model (using only Clean Speech). Below is an example command.

python trainVQVAE.py \
-c config/SE_cbook_4096_1_128_lr_1m5_1m5_github.yaml \
--tag SE_cbook_4096_1_128_lr_1m5_1m5_github

To Train our speech quality estimator, VQScore. Below is an example command.

python trainVQVAE.py \
-c config/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github.yaml \
--tag QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github

Inference

Below is an example command for generating enhanced speech/ estimated quality scores from the model. Where '-c' is the path of the config file, '-m' is the path of the pre-trained model, and '-i' is the path of the input wav file.

Note: Because our training data is 16kHz clean speech, only 16kHz speech input is supported.

python inference.py \
-c ./config/SE_cbook_4096_1_128_lr_1m5_1m5_github.yaml \
-m ./exp/SE_cbook_4096_1_128_lr_1m5_1m5_github/checkpoint-dnsmos_ovr=2.761_AT.pkl \
-i ./noisy_p232_005.wav

python inference.py \
-c ./config/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github.yaml \
-m ./exp/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github/checkpoint-dnsmos_ovr_CC=0.835.pkl \
-i ./noisy_p232_005.wav

python inference_folder.py \
-c ./config/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github.yaml \
-m ./exp/QE_cbook_size_2048_1_32_IN_input_encoder_z_Librispeech_clean_github/checkpoint-dnsmos_ovr_CC=0.835.pkl \
-i path to the folder you want to evaluate

Pretrained Models

We provide the checkpoints of trained models in the corresponding ./exp/config_name folder.

Note that the provided checkpoints are the models after we reorganize the code, so the results are slightly different from those shown in the paper.
However, the overall trend should be similar.

Adversarial noise

As shown in the following spectrogram, the applied adversarial noise doesn't have a fixed pattern as Gaussian noise. So it may be a good one to train a robust speech enhancement model.

Collaboration

I'm open to collaboration! If you find this Self-Supervised SE/QE topic interesting, please let me know (e-mail: szuweif@nvidia.com).

Citation

If you find the code useful in your research, please cite our ICLR paper :)

References

vector-quantize (for VQ-VAE)
DNSMOS

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
DNSMOS		DNSMOS
bin		bin
config		config
dataloader		dataloader
exp		exp
models		models
trainer		trainer
IUB_ind2.pickle		IUB_ind2.pickle
Librispeech_clean.csv		Librispeech_clean.csv
README.md		README.md
Tencent_ind2.pickle		Tencent_ind2.pickle
VCTK_clean_test.csv		VCTK_clean_test.csv
VCTK_clean_train.csv		VCTK_clean_train.csv
VCTK_noisy_testSet_with_scores.pickle		VCTK_noisy_testSet_with_scores.pickle
VCTK_noisy_validationSet.pickle		VCTK_noisy_validationSet.pickle
VQScore.png		VQScore.png
adv_wav.png		adv_wav.png
clean_p232_005.wav		clean_p232_005.wav
inference.py		inference.py
inference_folder.py		inference_folder.py
noisy_p232_005.wav		noisy_p232_005.wav
requirements.txt		requirements.txt
trainVQVAE.py		trainVQVAE.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech (ICLR 2024)

Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

Update 2024/12/2

Introduction

Environment

Dataset used in the paper/code

Training

Inference

Pretrained Models

Adversarial noise

Collaboration

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

JasonSWFu/VQscore

Folders and files

Latest commit

History

Repository files navigation

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech (ICLR 2024)

Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

Update 2024/12/2

Introduction

Environment

Dataset used in the paper/code

Training

Inference

Pretrained Models

Adversarial noise

Collaboration

Citation

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages