VAT-SS: Investigation of Speech Separation Models Including Video Source

About • Installation • How To Extract Video Embeddings • How To Train • How To Evaluate • Credits • License

About

VAT-SS is Andrey-Vera-Teasgen-SpeechSeparation models family. This repository allows user to train and evaluate mentioned in report SS models.

Pay attention that in all configs base model is state-of-the-art DPTN-AV-repack-by-teasgen, but you may use other SS models reported in the paper additionally (take a look at other configs)

Installation

Follow these steps to install the project:

(Optional) Create and activate new environment using conda.

# create env
conda create -n project_env python=3.10

# activate env
conda activate project_env

Install all required packages:
```
pip install -r requirements.txt
```
Install pre-commit:
```
pre-commit install
```

How To Extract Video Embeddings

This section is mandatory for running Train and Evaluation script for Audio-Video models. Preliminary video embeddings extraction is necessary for speed up forward time.

bash download_lipreader.sh

python make_embeddings.py \
    --cfg_path src/lipreader/configs/lrw_resnet18_mstcn.json \
    --lipreader_path lrw_resnet18_mstcn_video.pth \
    --mouths_dir mouths \
    --embeds_dir embeddings

The embeddings will be saved to --embeds_dir. Please set correct path to your directory in all Hydra configs at Datasets level

How To Train

You should have single A100-80gb GPU to exactly reproduce training, otherwise please implement and use gradient accumulation

To train a model, run the following commands and register in WandB:

Two-steps training:

python3 train.py -cn dptn_wav_av.yaml dataloader.batch_size=16 writer.run_name=av_dptn_wav_av_v1_video_tanh_gate

Moreover, training logs are available in WandBs

DPRNN & DPTN https://wandb.ai/teasgen/ss/overview
ConvTasNet https://wandb.ai/verabuylova-nes/ss/overview
VoiceFilter & RTFS https://wandb.ai/aapetukhov-new-economic-school/ss?nw=nwuseraapetukhov

How To Evaluate

Read How To Extract Video Embeddings section before

All generated texts will be saved into data/saved/inferenced/<dataset part> directory with corresponing names. Download SOTA pretrained model using

wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1egOSgh3qaADxWpxd379nmhLrfZ-5xYEf' -O ./model.tar
tar xvf ./model.tar

To run inference and calculate metrics, provide custom dataset, change paths to WAVs and video embeddings in cmd arguments datasets.val.audio_dir, datasets.val.embedding_dir and run:

python3 inference.py -cn inference_dptn_av.yaml dataloader.batch_size=32 inferencer.from_pretrained=model_best.pth datasets.val.part=null datasets.val.audio_dir=<PATH_TO_WAVS> datasets.val.embedding_dir=<PATH_TO_EMBEDDINGS>

Set dataloader.batch_size not more than len(dataset)

In case you don't have GT please change device_tensors in inference_dptn_av.yaml config to device_tensors: ["mix_spectrogram", "mix", "s1_embedding", "s2_embedding"], following that metrics won't be calculated and only predictions will be saved. Or via cmd arguments: inferencer.device_tensors="["mix_spectrogram","mix","s1_embedding","s2_embedding"]"

Use following command to run SiSNRi calculation on GT and predicted directories

export PYTHONPATH=./
python3 src/utils/eval_si_snri.py --predicts-dir <PATH_TO_PREDS> --gt-dir <PATH_TO_GTS>

<PATH_TO_PREDS> is directory containing predicts file in .pth format <PATH_TO_GTS> is directory containing s1, s2, mix dirs

To evaluate the computational performance of the model, run:

python3 profiler.py

Best model DPTN-AV-repack profiler results in Kaggle enviroment with P100 GPU:

Metric	Value
GFLOPs	108.556458241
CUDA Memory	14378.582016
Inference Time (Mean)	0.09988968074321747
Inference Time (Std)	0.04486224800348282
Number of Parameters	40809590

Credits

This repository is based on a PyTorch Project Template.

Name		Name	Last commit message	Last commit date
Latest commit History 221 Commits
src		src
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
download_lipreader.sh		download_lipreader.sh
inference.py		inference.py
make_embeddings.py		make_embeddings.py
profiler.py		profiler.py
requirements.txt		requirements.txt
train.py		train.py

Uh oh!

Repository files navigation

VAT-SS: Investigation of Speech Separation Models Including Video Source

About

Installation

How To Extract Video Embeddings

How To Train

How To Evaluate

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

teasgen/speech_separation

Folders and files

Latest commit

History

Repository files navigation

VAT-SS: Investigation of Speech Separation Models Including Video Source

About

Installation

How To Extract Video Embeddings

How To Train

How To Evaluate

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages