8000 GitHub - JaeDukSeo/knn-vc: Voice Conversion With Just Nearest Neighbors
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

JaeDukSeo/knn-vc

 
 

Repository files navigation

Voice Conversion With Just Nearest Neighbors (kNN-VC)

The official code repo! This repo contains training and inference code for kNN-VC -- an any-to-any voice conversion model from our paper, "Voice Conversion With Just k-Nearest Neighbors". The trained checkpoints are available under the 'Releases' tab and through torch.hub.

Links:

kNN-VC method

Figure: kNN-VC setup. The source and reference utterance(s) are encoded into self-supervised features using WavLM. Each source feature is assigned to the mean of the k closest features from the reference. The resulting feature sequence is then vocoded with HiFi-GAN to arrive at the converted waveform output.

Authors:

*Equal contribution

Quickstart

We use torch.hub to make loading the model easy -- no cloning of the repo needed. The steps to perform inference are simple:

  1. Install dependancies: we have 3 inference dependencies only torch, torchaudio, and numpy. Python must be at version 3.10 or greater, and torch must be v2.0 or greater.
  2. Load models: load the WavLM encoder and HiFiGAN vocoder:
import torch, torchaudio

knn_vc = torch.hub.load('bshall/knn-vc', 'knn_vc', prematched=True, trust_repo=True, pretrained=True)
# Or, if you would like the vocoder trained not using prematched data, set prematched=False.
  1. Compute features for input and reference audio:
src_wav_path = '<path to arbitrary 16kHz waveform>.wav'
ref_wav_paths = ['<path to arbitrary 16kHz waveform from target speaker>.wav', '<path to 2nd utterance from target speaker>.wav', ...]

query_seq = knn_vc.get_features(src_wav_path)
matching_set = knn_vc.get_matching_set(ref_wav_paths)
  1. Perform the kNN matching and vocoding:
out_wav = knn_vc.match(query_seq, matching_set, topk=4)
# out_wav is (T,) tensor converted 16kHz output wav using k=4 for kNN.

That's it! These default settings provide pretty good results, but feel free to modify the kNN topk or use the non-prematched vocoder. Note: the target speaker from ref_wav_paths can be anything, but should be clean speech from the desired speaker. The longer the cumulative duration of all reference waveforms, the better the quality will be (but the slower it will take to run). The improvement in quality diminishes beyond 5 minutes of reference speech.

Checkpoints