- This is a comprehensive repository for multispeaker voice conversion (VC) with nonparallel modeling based on cyclic variational autoencoder (CycleVAE) [1,2] / vector-quantized VAE (CycleVQVAE) [2].
- It also provides neural vocoder implementations based on shallow WaveNet architecture [3,4] / compact WaveRNN (LPCNet-like [5]) with data-driven linear prediction (LPC) [4] for high-quality waveform synthesis (conversion / copy-synthesis).
- The VC modeling further allows speaker interpolation of either spectral / excitation characteristics on a 2-dimensional space.
- Default speech features and waveform synthesis are based on WORLD [6] (mel-cepstrum from spectral envelope, F0, aperiodicity).
- Modeling and synthesis with mel-spectrogram are under construction.
- Our group is located in Nagoya University, Japan
- Python 3.7 or 3.6
- CUDA 10.0
$ cd tools
$ make
- Available
- VCC 2018 [7] [12 speakers] (VC)
egs/vcc18
- download_wavs --> put in
egs/vcc18/wav_22.05kHz
- download_mdls --> put in
egs/vcc18/exp
- download_wavs --> put in
- VCC 2018 [7] [12 speakers] (VC)
- Under-construction
- VCC 2018 (Neural vocoder)
- VCC 2020 [14 speakers]
- VCC 2018 + ARCTIC [2 speakers]
- VCC 2020 + VCTK [24 speakers]
- VCC 2018 + VCC 2020 + ARCTIC + VCTK
- Spectral CycleVAE
[mdl_name: cycmcepvae-laplace]
- Spectral + excitation CycleVAE
[mdl_name: cycmceplf0capvae-laplace]
- Spectral + excitation CycleVAE with 2-dim speaker space
[spkidtr_dim: 2]
- Spectral CyleVQVAE
[mdl_name: cycmcepvae-vq]
- Spectral + excitation CycleVQVAE
[mdl_name: cycmceplf0capvae-vq]
- Spectral + excitation CycleVQVAE with 2-dim speaker space
[spkidtr_dim: 2]
- Shallow WaveNet
[mdl_name_wave: wavenet]
- Compact WaveRNN with data driven LPC
[mdl_name_wave: wavernn_dualgru_compact_lpc]
- Compact WaveRNN with data driven LPC and multiple samples output
[mdl_name_wave: wavernn_dualgru_compact_lpcseg]
Located in egs/<dataset>/conf/config.yml
fs:
sampling rateshiftms:
frame-shift in msright_size:
limit of lookup frame on input convolution of encoder/neural-vocoder [for low-latency app.] (if 0, use a balanced two-sided conv.; else it is a skewed conv.)pad_len
: number of maximum frame length in padding for batch processingepoch_count:
number of maximum epoch
n_half_cyc:
number of half-cycles, e.g., for 1 cycle, it is 2, and for 2 cycles, it is 4.- default
- CycleVAE:
2
- CycleVQVAE:
4
- CycleVAE:
- default
lat_dim:
number of latent-dimension for spectral latent- default
- Spectral / spectral+excitation CycleVAE:
32
- Spectral CycleVQVAE:
40
- Spectral+excitation CycleVQVAE:
50
- Spectral / spectral+excitation CycleVAE:
- default
lat_dim_e:
number of latent-dimension for excitation latent- default
- Spectral+excitation CycleVAE:
32
- Spectral+excitation CycleVQVAE:
50
- Spectral+excitation CycleVAE:
- default
causal_conv_dec:
flag to use causal input conv. on spectral decoder (for low-latency)causal_conv_lf0:
flag to use causal input conv. on excitation decoderctr_size:
size of VQ-codebook for CycleVQVAEar_enc:
flag to use autoregressive (AR) flow (feedback output) in encoder- default:
false
- default:
ar_dec:
flag to use AR-flow in decoder- default
- CycleVAE:
true
- CycleVQVAE:
false
- CycleVAE:
- default
diff:
flag to use differential spectrum estimation for spectral decoder (only for decoder with AR-flow)- default CycleVAE:
true
- default CycleVAE:
detach:
flag to detach conversion/cyclic flow in backpropagation graph- default:
true
- default:
spkidtr_dim
: number of the reduced dimensionality of N-dimensional one-hot speaker-code for speaker interpolation (if 0, keep using one-hot speaker-code)batch_size:
number of frames per batchbatch_size_utt:
number of utterances per batch in optimizationbatch_size_utt_eval:
number of utterances per batch in validationmdl_name:
type of VC model
batch_size_wave:
numer of frames per batchbatch_size_utt_wave:
number of utterances per batch in optimizationbatch_size_utt_eval_wave:
number of utterances per batch in validationhidden_units_wave_2:
number of hidden units of 2nd GRU in compact WaveRNNt_start:
starting step for sparsification in compact WaveRNNt_end:
ending step for sparsificationinterval:
interval step in sparsificationdensities:
target densities of reset, update, and new gates of 1st GRU in compact WaveRNNn_stage:
number of stages in sparsification
# at each sparsification step, this is the function of target density
r = 1 - (iter_idx-t_start)/(t_end - t_start)
density = density_stage[k-1] - (density_stage[k-1]-density_stage[k])*(1 - r)**5
# k is the index of stage [0,..,n_stage-1]
# density_stage contains the target density on each stage, for k=0, it is set 1
# number of steps per stage is set to:
## [0.2, 0.3, 0.5]*t_delta for n_stage=3
## [0.1, 0.1, 0.3, 0.5]*t_delta for n_stage=4
## [0.1, 0.1, 0.15, 0.15, 0.5]*t_delta for n_stage=5
## where t_delta = t_end - t_start + 1
lpc:
number of data-driven LPC (modeled/estimated by network) in compact WaveRNNseg:
number of multiple samples outputmdl_name_wave:
type of neural vocoder model
Located in egs/<dataset>/conf/run.sh
stage=0
prepare lists of training/validation/testing sets (egs/<dataset>/data/<tr/dv/ts>_<dataset>_<sampling-rate>
) and speaker configs (egs/<dataset>/conf/spkr.yml
)stage=init
compute speaker statistics [histograms of F0 and normalized-power] with initial F0 configurations (egs/<dataset>/init_spk_stat/tr_<dataset>_<sampling-rate>
). For new speakers, runstage=0
andstage=init
, and change the initial F0 and pow values onconf/spkr.yml
by following the procedure given in this slide.stage=1
feature extractionstage=2
calculate feature statisticsstage=3
pre-emphasis (noise-shaping) on waveform data for neural vocoder developmentstage=4
training of VC modelstage=5
decoding of reconstruction and cyclic-reconstruction features to compute global variance (GV) [8] statistics and to be used for neural vocoder training with data augmentationstage=6
decoding of converted features/waveform from VC model using conventional vocoderstage=7
training of neural vocoder modelstage=8
decoding of converted/copy-synthesis waveform from neural vocoderstage=9
de-emphasis (restored noise-shaping) on synthesized waveform
Run as $ bash run.sh
n_jobs=
number of jobs/threads in feature extraction, statistics calculation, and pre-emphasis processingspks=
list of speakersdata_name=
name of datasetGPU_device=
index of GPU device used during trainingidx_resume=
resume VC model training from this checkpoint[set the arguments in running call on STAGE 4]
idx_resume_wave=
resume neural vocoder model training from this checkpoint[set the arguments in running call on STAGE 7]
min_idx=
decode VC model using this checkpointn_interp=
decode spectral-excitation VC conversion with this number of interpolated speaker points (if 0, it is just a source-to-target conversion)min_idx_wave=
decode neural vocoder model using this checkpointGPU_device_str=
indices of GPUs used during decodingn_gpus=
number of GPUs used during decodingspks_trg_rec=
list of speakers considered during reconstruction/cyclic-reconstruction instage=5
spks_src_dec=
list of source speakers considered during conversion instage=6
spks_trg_dec=
list of target speakers considered during conversion instage=6
decode_batch_size=
number of concurrent utterances when decoding with neural vocoder
Located in egs/<dataset>/local
proc_loss_log_vae-spec.awk
for spectral modelproc_loss_log_vae-spec-excit.awk
for spectral-excitation modelloss_summary.sh
to run theawk
scripts for summarizing model accuracies during trainingsummary_acc.awk
for decoding accuracy in development/testing setssummary_acc.sh
to extract desired accuracy statistics in development/testing setsget_max_frame.sh
to get maximum frame number statistics for each speaker forpad_len
config
Tensorboard graph statistics are also simultaneously calculated/provided during training in the corresponding model folder egs/<dataset>/exp/<model_expdir>
.
2-dimensional speaker space on VCC 2018 dataset with spectral-excitation CycleVAE
[1] P. L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, and T. Toda, "Non-parallel voice conversion with cyclic variational autoencoder," in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 674--678.
[2] P. L. Tobing, T. Hayashi, Y.-C. Wu, K. Kobayashi, and T. Toda, "Cyclic spectral modeling for unsupervised unit discovery into voice conversion with excitation and waveform modeling," Accepted for INTERSPEECH 2020.
[3] P. L. Tobing, T. Hayashi, and T. Toda, "Investigation of shallow WaveNet vocoder with Laplacian distribution output," in Proc. IEEE ASRU, Sentosa, Singapore, Dec. 2019, pp. 176--183.
[4] P. L. Tobing, Y.-C. Wu, T. Hayashi, K. Kobayashi, and T. Toda, “Efficient shallow WaveNet vocoder using multiple samples output based on Laplacian distribution and linear prediction,” in Proc. ICASSP, Barcelona, Spain, May 2020, pp. 7204–-7208.
[5] J.-M. Valin, J. Skoglund, A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet, in Proc. INTERSPEECH, Graz, Austria, Sep. 2019, pp. 3406--3410.
[6] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder based high-quality speech synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. 99, no. 7, pp. 1877--1884, 2016.
[7] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The Voice Conversion Challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Speaker Odyssey, Les Sables d’Olonne, France, Jun. 2018, pp. 195--202.
[8] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222-–2235, 2007.
- Complete VC pretrained models
- Complete neural vocoder decoding scripts
- Complete neural vocoder models
- Mel-spectrogram modeling