CN114582363A - High-quality voice conversion method for non-parallel corpus - Google Patents
High-quality voice conversion method for non-parallel corpus Download PDFInfo
- Publication number
- CN114582363A CN114582363A CN202210156203.4A CN202210156203A CN114582363A CN 114582363 A CN114582363 A CN 114582363A CN 202210156203 A CN202210156203 A CN 202210156203A CN 114582363 A CN114582363 A CN 114582363A
- Authority
- CN
- China
- Prior art keywords
- voice
- loss
- conversion
- speaker
- mel spectrogram
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 18
- 230000003042 antagnostic effect Effects 0.000 claims abstract description 12
- 238000010606 normalization Methods 0.000 claims abstract description 4
- 230000009466 transformation Effects 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 20
- 238000013507 mapping Methods 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 11
- 125000004122 cyclic group Chemical group 0.000 claims description 10
- 238000009826 distribution Methods 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 3
- 230000006735 deficit Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000000087 stabilizing effect Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 208000032538 Depersonalisation Diseases 0.000 description 1
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008707 rearrangement Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Complex Calculations (AREA)
Abstract
The invention relates to a voice conversion method for non-parallel linguistic data, which comprises the following steps: (1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion; (2) creating a time mask m with the same size as a source speaker Mel spectrogram x, adding m to x, and filling the lacking frames on x to obtain x'; (3) extracting fundamental tone frequency F0 of a source speaker, and converting F0 into fundamental frequency F0' of a target speaker through logarithmic Gaussian normalization transformation; (4) training a CycleGAN model, and adding a gradient penalty to the antagonistic loss; (5) varying the overall objective function; (6) inputting x 'obtained in (2) and (3), fundamental frequency F0' and created time mask m into generator GX→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice; (7) feeding the obtained conversion Mel spectrogram yThe voice waveform is synthesized in the voice decoder to obtain the voice similar to the target speaker.
Description
Technical Field
The invention belongs to the technical field of voice conversion, and particularly relates to a non-parallel high-quality voice conversion method.
Background
Speech conversion refers to converting the personality characteristics of a source speaker to the personality characteristics of a target speaker so that the converted speech sounds consistent with the speech of the target speaker, while preserving the content of the source speech during the conversion process. With the increasing demand of people for personalized voices, voice conversion has been applied to various fields such as psychology, biomedicine, information security, and the like. The currently used voice conversion method can be divided into parallel conversion and non-parallel conversion according to whether there is parallel voice data. Parallel voice conversion has proposed many implementation techniques, and the development is mature, however, in practical applications, it is not easy to collect parallel voice data, and a time alignment preprocessing is required for parallel voice, and if the alignment is not accurate, the voice conversion effect is not good. The non-parallel voice conversion technology has no requirement on parallel data and time alignment, so that the data collection is simple and the cost is low, and most of the current research is inclined to the non-parallel voice conversion. However, the existing non-parallel voice conversion scheme still needs to be improved in the aspect of similarity.
Disclosure of Invention
Aiming at the defects in the prior art, the invention discloses a high-quality non-parallel voice conversion method, which uses a cyclic-dependent adaptive network (CycleGAN) to realize voice conversion, proposes to fill missing frames by using a time mask in a training stage and adds R1 zero center Gradient Penalties (GP) into the antagonistic loss of the CycleGAN so as to improve the naturalness and the similarity of converted voice and solve the problem of unstable CycleGAN training.
The invention adopts the following technical scheme:
the voice conversion method for the non-parallel linguistic data is carried out according to the following steps:
(1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion;
(2) creating a time mask m with the same size as a Mel spectrogram x of a source speaker, adding m to x, filling frames lacking in x, and obtaining a Mel spectrogram x' of the source voice after filling the frames;
(3) extracting fundamental tone frequency F0 of the source speaker, and converting F0 into fundamental frequency F0' of the target speaker through logarithmic Gaussian normalization transformation:
wherein, mux,σxAnd muy,σyThe mean and standard deviation of the source speaker and the target speaker on a logarithmic scale F0 respectively;
(4) the CycleGAN model is trained, adding a gradient penalty to the challenge loss, i.e.:
EPD(X)what is what meaning? E (-) is the mathematically expected symbol, PD(x)Is the distribution of the real data that is,is a hyper-parameter;
(5) the overall objective function becomes:
L=L'adv+Ladv2+λcycLcyc+λidLid (16)
(6) obtained by the steps (2) and (3)X 'to, fundamental frequency F0' and created time mask m are input together into generator GX→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice;
(7) and feeding the converted Mel spectrogram y' obtained in the previous step into a vocoder to synthesize a voice waveform, so as to obtain the voice similar to the target speaker.
Preferably, the step (2) is specifically as follows: giving an input Mel spectrogram x of a source speaker, using a time mask m with the same size as x, wherein the partial value of m is 0, the rest values are 1, and a zero region is randomly determined by a preset rule; the time mask m is added to the source Mel spectrum x, i.e.:
x'=x·m (1)
generator G of cycleGANX→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:
y'=GX→Y(x',m,F0') (2)
using m as condition information, GX→YFilling the missing frames, and adjusting the conversion of the Mel spectrum by the auxiliary feature F0'; for the resulting y', use the penalty to ensure that it is similar to the true target feature;
using inverse generators GY→XReconstruction x ", i.e.:
x”=GY→X(y',m',F0') (3)
based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix; the use of a second pair of impairments ensures that the reconstructed x "is similar to the original x.
Preferably, in step (4), a gradient penalty is applied to the discriminator on the real sample by using an R1 zero center gradient penalty technology;
the regularization term for the R1 zero-center gradient penalty is defined as:
wherein,e (-) is the mathematically expected symbol, PD(x)Is the distribution of the real data that is,is a hyper-parameter.
Preferably, in the CycleGAN model in the step (4), the generator G is trained by using four losses, and the mapping between X and Y is learned;
discriminator DYThe expression for the penalty is:
Ladv(GX→Y,DY)=Ey~P(Y)[logDY(y)]+Ex~P(X)[log(1-DY(GX→Y(x)))] (5)
discriminator DXThe expression for the penalty is:
Ladv(GY→X,DX)=Ex~P(X)[logDX(x)]+Ey~P(Y)[log(1-DX(GY→X(y)))] (6)
wherein, P (X), P (Y) are the distribution of source speech data and target speech data respectively;
the antagonistic loss of CycleGAN is then:
Ladv=Ladv(GX→Y,DY)+Ladv(GY→X,DX) (7)
the cyclic consistent loss is to preserve the speech content in the conversion process, and the expression is as follows:
Lcyc(GX→Y,GY→X)=Ex~P(X)[GY→X(GX→Y(x))-x]+Ey~P(Y)[GX→Y(GY→X(y))-y] (8)
the identity mapping loss is to better store the input and introduce the identity mapping loss, and the expression is
Lid(GX→Y,GY→X)=Ex~P(X)[GY→X(x)-x]+Ey~P(Y)[GX→Y(y)-y] (9)
Adding an additional discriminator DX' andthe cyclic conversion characteristic is to add one countermeasure loss, called the second countermeasure loss, namely:
similarly, an additional discriminator D is added during the reverse conversionY′,DYThe second countermeasure loss of' is:
the second antagonistic loss for CycleGAN is then:
Ladv2=Ladv2(GX→Y,GY→X,D'X)+Ladv2(GX→Y,GY→X,D'Y) (12)
the overall objective function of the resulting model is thus:
L=Ladv+Ladv2+λcycLcyc(GX→Y,GY→X)+λidLid(GX→Y,GY→X) (13)
wherein λ iscyc、λidRespectively, the hyperparameters of the cycle consistency loss and the identity mapping loss, and the weight of the corresponding loss is adjusted in the training process.
Preferably, the generator in the CycleGAN model is a complete convolution network, and comprises convolution layers, a gated linear unit, a 1-dimensional convolution and a 2-dimensional convolution, wherein the 2-dimensional convolution is applied to the down-sampling module and the up-sampling module; the 1-dimensional convolution is applied to the residual block and is responsible for the main conversion process; before and after characteristic input and output, 1 multiplied by 1 convolutional layer is applied to adjust the channel size, the number of input channels is 2, and the input channels are used for receiving m and x'; the gating linear unit is used for adaptively learning the sequence and the hierarchical structure of the acoustic features.
Preferably, the discriminator in the CycleGAN model is a 2-dimensional convolutional neural network, is used for discriminating data based on 2-dimensional spectral textures, and comprises a down-sampling module, a gate control linear unit and two convolutional layers, wherein the data are input from the former convolutional layer, sequentially pass through the gate control linear unit and the down-sampling module, and then pass through the last convolutional layer.
The technical scheme of the invention has the following advantages:
(1) the method adopts the time mask to fill the lacking frames in the Mel spectrogram, effectively protects the harmonic structure of the voice in the voice conversion process, and combines the use of the cycleGAN model to convert the voice characteristics, so that the generated converted voice has higher quality and shorter time consumption.
(2) The invention adds zero center gradient punishment technology in the training process of the CycleGAN model, and solves the problem of unstable training of the generated confrontation network by punishing the identifier deviating from Nash balance.
(3) The voice conversion method provided by the invention does not need parallel voice data, reduces the data collection cost and effectively saves the resource and time cost.
Drawings
Fig. 1 is a flow chart of a voice conversion method according to a preferred embodiment of the present invention.
FIG. 2 is a diagram of the training process of cycleGAN.
Fig. 3 is a diagram of a generator structure.
Fig. 4 is a view showing the construction of the discriminator.
Detailed Description
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings.
In this embodiment, a generative confrontation network is first trained to complete the conversion of Mel spectrum, and in order to improve the convergence performance of CycleGAN, a zero center gradient penalty is added to the confrontation loss of the generator during training. And then extracting a Mel spectrogram x of the source speaker voice, applying the created time mask m to the x, filling the missing frame, and obtaining the Mel spectrogram x' filled with the frame. The linearly transformed fundamental frequency F0 ' is then fed as an ancillary feature into the generator G, together with the time mask m and x ', to generate the converted target speech feature y '. Finally, y 'and F0' are used as the input of MelGAN vocoder to synthesize the voice waveform and obtain the converted target voice.
The preferred embodiment of the present invention counters the loss L by calculatingadvEnsure that the generated speech features are approximately consistent with the target speech features and introduce a cyclic consistency loss LcycNormalizing mappings between sources and targets, loss of identity mapping LidFurther preserving language content, then adding zero center gradient punishment in the anti-loss, and stabilizing the training of the model to ensure that the generated anti-network approaches Nash balance. Loss function of the whole system is composed of the antagonistic loss LadvSecond pair of resistance loss Ladv2Cyclic consistent loss LcycAnd identity mapping loss LidAnd (4) forming. The flow of the voice conversion method of the present embodiment is shown in fig. 1, and the main contents of each part are described in detail below.
Filling up missing frames with time mask
In the preferred embodiment of the invention, the generator takes the previous and next frames as the basis to obtain useful information to fill in the missing frames through the cyclic conversion process. First, given an input source speaker's Mel-spectrum x, then a time mask m of the same size as x is used, with m having a partial value of 0 and the remainder of 1, and zero regions randomly determined by a predetermined rule. The time mask m is added to the source Mel spectrum x, i.e.:
x'=x·m (1)
next, the generator G of cycleGANX→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:
y'=GX→Y(x',m,F0') (2)
using m as condition information, GX→YThe missing frames can be filled in and the assist feature F0' adjusts the conversion of the Mel-spectrum. For the resulting y', a penalty is used to ensure that it is similar to the true target feature.
Then, using an inverse generator GY→XReconstruction x ", i.e.:
x”=GY→X(y',m',F0') (3)
based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix. While using a second pair of impairments ensures that the reconstructed x "is similar to the original x.
Two, R1 zero center gradient penalty
In the traditional GAN training process, when the difference between the sample generated by the generator and the real sample is large, the discriminator optimizes the generator through gradient reduction, however, as the discriminator can distinguish the true and false of the sample more and more, although the sample generated by the generator is close to the real sample, the discriminator can judge the sample as false, so GAN is pushed away from nash balance, unstable training is caused, and the quality of the generated sample is poor.
The preferred embodiment of the present invention proposes to apply a gradient penalty to the discriminator on the true sample using a zero-centered gradient penalty technique, and when the sample generated by the generator is similar to the true sample, the discriminator will produce a gradient close to zero, preventing the generator from leaving the nash equilibrium.
The regularization term for the R1 zero-center gradient penalty is defined as:
The invention adds a zero-center gradient penalty to the antagonistic loss of CycleGAN to stabilize the training process of the model.
Three, cycleGAN model
The method adopts a CycleGAN model to convert the Mel spectrogram of the source speech into the Mel spectrogram of the target speech. In CycleGAN, the generator G is trained with 4 losses, learning the mapping between X and Y. The training process of CycleGAN is shown in figure 2.
The countermeasure loss is used for measuring the similarity degree of the conversion characteristic and the target characteristic, and the smaller the countermeasure loss is, the more similar the converted acoustic characteristic and the target acoustic characteristic is, and the discriminator DYThe expression for the penalty is:
Ladv(GX→Y,DY)=Ey~P(Y)[logDY(y)]+Ex~P(X)[log(1-DY(GX→Y(x)))] (5)
likewise, discriminator DXThe challenge loss of (a) is:
Ladv(GY→X,DX)=Ex~P(X)[logDX(x)]+Ey~P(Y)[log(1-DX(GY→X(y)))] (6)
wherein, p (x), p (y) are distributions of source speech data and target speech data, respectively.
The antagonistic loss of CycleGAN is then:
Ladv=Ladv(GX→Y,DY)+Ladv(GY→X,DX) (7)
the cyclic consistent loss is to preserve the speech content during the conversion process and is expressed as:
Lcyc(GX→Y,GY→X)=Ex~P(X)[GY→X(GX→Y(x))-x]+Ey~P(Y)[GX→Y(GY→X(y))-y] (8)
the identity mapping loss is to better store the input and introduce the identity mapping loss, and the expression is
Lid(GX→Y,GY→X)=Ex~P(X)[GY→X(x)-x]+Ey~P(Y)[GX→Y(y)-y] (9)
The second countermeasure loss is to alleviate the over-smoothing effect caused by the use of the cyclic consistent loss, and an additional discriminator D is addedX', and the cyclic switching feature is to add one penalty, called the second penalty, namely:
similarly, an additional discriminator D is added during the reverse conversionY′,DYThe second countermeasure loss of' is:
the second antagonistic loss for CycleGAN is then:
Ladv2=Ladv2(GX→Y,GY→X,D'X)+Ladv2(GX→Y,GY→X,D'Y) (12)
thus, the overall objective function of the available model is:
L=Ladv+Ladv2+λcycLcyc(GX→Y,GY→X)+λidLid(GX→Y,GY→X) (13)
wherein λ iscyc、λidRespectively, the hyperparameters of the cycle consistency loss and the identity mapping loss, and the weight of the corresponding loss is adjusted in the training process.
A generator in the cycleGAN model is a complete convolution network and is composed of a 1-dimensional CNN, a 2-dimensional CNN and the like, and the 2-dimensional convolution is applied to a down-sampling module and an up-sampling module, so that the time structure can be reserved while the overall relation and direction of input features are effectively captured. The 1-dimensional convolution is applied in the residual block and is responsible for the main conversion process. Before and after the feature input and output, 1 × 1 convolution is applied to adjust the channel size, and the number of input channels is 2, which is used to receive m and x'. A Gated Linear Unit (GLU) is used to adaptively learn the order and hierarchy of acoustic features. The structure of the generator is shown in fig. 3.
The discriminator is a 2-dimensional convolution neural network and is used for discriminating data based on 2-dimensional spectrum textures, the discriminator mainly comprises a down-sampling module, a gate control linear unit and a convolution layer, and the convolution of the last layer is used for reducing the number of parameters and stabilizing the training of the GAN model. The structure is shown in fig. 4.
Fourth, voice conversion process
The method proposed by the preferred embodiment of the present invention is mainly composed of two parts: the first part is that a zero center gradient punishment is added in the training process of the CycleGAN, the problem of gradient disappearance during GAN training is solved, and the trained CycleGAN model is used for synthesizing a target Mel spectrum from a source speaker Mel spectrum, a time mask and a fundamental frequency. And the second part is to fill the extracted place where the Mel spectrum lacks frames, and the created time mask and the Mel spectrum are used as products to obtain the voice characteristics for conversion. And finally, sending the converted target Mel spectrogram to a vocoder, synthesizing a voice waveform to obtain converted voice, namely generating the voice with the identity information of the target speaker and reserving the content of the source speaker.
The specific process of the voice conversion in this embodiment is as follows:
(1) and acquiring a voice database of the source speaker, and extracting the Mel spectrogram x of the source speaker as voice characteristics for conversion.
(2) And creating a time mask m with the same size as the source speaker Mel spectrogram x, adding m to x, and filling the missing frames on x to obtain x'.
(3) Extracting fundamental tone frequency F0 of the source speaker, and converting F0 into fundamental frequency F0' of the target speaker through logarithmic Gaussian normalization transformation:
wherein, mux,σxAnd muy,σyThe mean and standard deviation of the source speaker and the target speaker, respectively, on a logarithmic scale of F0.
(4) The CycleGAN model is trained, adding a gradient penalty to the challenge loss, i.e.:
(5) the overall objective function becomes:
L=L'adv+Ladv2+λcycLcyc+λidLid (16)
(6) inputting x ', F0' obtained in the steps (2) and (3) and the created m into a generator GX→YIn the method, F0 ' is used as an auxiliary feature to adjust the conversion direction of the Mel spectrogram, and the generator converts x ' into the Mel spectrogram y ' of the target voice.
(7) And feeding the converted Mel spectrogram y' obtained in the last step into a vocoder to synthesize a voice waveform, so as to obtain high-quality voice similar to the target speaker.
The foregoing is considered as illustrative only of the preferred embodiments of the invention and accompanying technical principles. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments, but is capable of various obvious changes, rearrangements and substitutions without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.
Claims (6)
1. The speech conversion method for the non-parallel corpus is characterized by comprising the following steps of:
(1) acquiring a voice database of a source speaker, and extracting a Mel spectrogram x of the source speaker as a voice feature for conversion;
(2) creating a time mask m with the same size as a Mel spectrogram x of a source speaker, adding m to x, filling frames lacking in x, and obtaining a Mel spectrogram x' of the source voice after filling the frames;
(3) extracting fundamental tone frequency F0 of the source speaker, and converting F0 into fundamental frequency F0' of the target speaker through logarithmic Gaussian normalization transformation:
wherein, mux,σxAnd muy,σyThe mean and standard deviation of the source speaker and the target speaker, respectively, on a logarithmic scale of F0;
(4) the CycleGAN model is trained, adding a gradient penalty to the challenge loss, i.e.:
e (-) is the mathematically expected symbol, PD(x)Is the distribution of the real data that is,is a hyper-parameter;
(5) the overall objective function becomes:
L=L'adv+Ladv2+λcycLcyc+λidLid (16)
(6) inputting x 'obtained in the steps (2) and (3), a fundamental frequency F0' and the created time mask m into a generator GX→YF0 ' is used as an auxiliary feature, the conversion direction of the Mel spectrogram is adjusted, and the generator converts x ' into the Mel spectrogram y ' of the target voice;
(7) and feeding the converted Mel spectrogram y' obtained in the previous step into a vocoder to synthesize a voice waveform, so as to obtain the voice similar to the target speaker.
2. The method as claimed in claim 1, wherein the step (2) is as follows: giving an input Mel spectrogram x of a source speaker, using a time mask m with the same size as x, wherein the partial value of m is 0, the rest values are 1, and a zero region is randomly determined by a preset rule; the time mask m is added to the source Mel spectrum x, i.e.:
x'=x·m (1)
generator G of cycleGANX→YSynthesizing y ' according to x ' and m and the assistant feature F0 ', namely:
y'=GX→Y(x',m,F0') (2)
using m as condition information, GX→YFilling in missing frames, auxiliary featuresThe feature F0' adjusts the conversion of Mel spectrum; for the resulting y', use the penalty to ensure that it is similar to the true target feature;
using inverse generators GY→XReconstruction x ", i.e.:
x”=GY→X(y',m',F0') (3)
based on the assumption that the missing frames have been filled in the previous operation, m' is represented by an all 1 matrix; the use of a second pair of impairments ensures that the reconstructed x "is similar to the original x.
3. The speech conversion method for non-parallel corpora according to claim 1 or 2, wherein in the step (4), a gradient penalty is applied to the discriminator on the real sample using an R1 zero-center gradient penalty technique;
the regularization term for the R1 zero-center gradient penalty is defined as:
4. The speech conversion method for non-parallel corpora according to claim 3, wherein in the CycleGAN model of step (4), the generator G is trained using four losses to learn the mapping between X and Y;
discriminator DYThe expression for the penalty is:
Ladv(GX→Y,DY)=Ey~P(Y)[logDY(y)]+Ex~P(X)[log(1-DY(GX→Y(x)))] (5)
discriminator DXThe expression for the penalty is:
Ladv(GY→X,DX)=Ex~P(X)[logDX(x)]+Ey~P(Y)[log(1-DX(GY→X(y)))] (6)
wherein, P (X), P (Y) are the distribution of source speech data and target speech data respectively;
the antagonistic loss of CycleGAN is then:
Ladv=Ladv(GX→Y,DY)+Ladv(GY→X,DX) (7)
the cyclic consistent loss is to preserve the speech content during the conversion process, and the expression is:
Lcyc(GX→Y,GY→X)=Ex~P(X)[GY→X(GX→Y(x))-x]+Ey~P(Y)[GX→Y(GY→X(y))-y] (8)
the identity mapping loss is to better store the input and introduce the identity mapping loss, and the expression is
Lid(GX→Y,GY→X)=Ex~P(X)[GY→X(x)-x]+Ey~P(Y)[GX→Y(y)-y] (9)
Adding an additional discriminator DX', and the cyclic conversion characteristic is to add one antagonistic loss, called the second antagonistic loss, namely:
similarly, an additional discriminator D is added during the reverse conversionY′,DYThe second countermeasure loss of' is:
the second antagonistic loss for CycleGAN is then:
Ladv2=Ladv2(GX→Y,GY→X,D'X)+Ladv2(GX→Y,GY→X,D'Y) (12)
the overall objective function of the resulting model is thus:
L=Ladv+Ladv2+λcycLcyc(GX→Y,GY→X)+λidLid(GX→Y,GY→X) (13)
wherein λ iscyc、λidRespectively, the hyperparameters of the cycle consistency loss and the identity mapping loss, and the weight of the corresponding loss is adjusted in the training process.
5. The speech conversion method for non-parallel corpora according to claim 4, wherein the generator in the CycleGAN model is a complete convolution network including convolutional layers, gated linear units, 1-dimensional convolution and 2-dimensional convolution, the 2-dimensional convolution being applied to the down-sampling and up-sampling modules; the 1-dimensional convolution is applied to the residual block and is responsible for the main conversion process; before and after characteristic input and output, 1 multiplied by 1 convolutional layer is applied to adjust the channel size, the number of input channels is 2, and the input channels are used for receiving m and x'; the gating linear unit is used for adaptively learning the sequence and the hierarchical structure of the acoustic features.
6. The method as claimed in claim 4, wherein the discriminator in the CycleGAN model is a 2-dimensional convolutional neural network for discriminating data based on 2-dimensional spectral textures, and comprises a down-sampling module, a gate-controlled linear unit and two convolutional layers, which are inputted from the previous convolutional layer, sequentially pass through the gate-controlled linear unit and the down-sampling module, and then pass through the last convolutional layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210156203.4A CN114582363A (en) | 2022-02-21 | 2022-02-21 | High-quality voice conversion method for non-parallel corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210156203.4A CN114582363A (en) | 2022-02-21 | 2022-02-21 | High-quality voice conversion method for non-parallel corpus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114582363A true CN114582363A (en) | 2022-06-03 |
Family
ID=81771061
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210156203.4A Pending CN114582363A (en) | 2022-02-21 | 2022-02-21 | High-quality voice conversion method for non-parallel corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114582363A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115294970A (en) * | 2022-10-09 | 2022-11-04 | 苏州大学 | Voice conversion method, device and storage medium for pathological voice |
-
2022
- 2022-02-21 CN CN202210156203.4A patent/CN114582363A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115294970A (en) * | 2022-10-09 | 2022-11-04 | 苏州大学 | Voice conversion method, device and storage medium for pathological voice |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109671442B (en) | Many-to-many speaker conversion method based on STARGAN and x vectors | |
CN101064104B (en) | Emotion voice creating method based on voice conversion | |
CN110600047A (en) | Perceptual STARGAN-based many-to-many speaker conversion method | |
CN110060701A (en) | Multi-to-multi phonetics transfer method based on VAWGAN-AC | |
CN112331183B (en) | Non-parallel corpus voice conversion method and system based on autoregressive network | |
CN111429894A (en) | Many-to-many speaker conversion method based on SE-ResNet STARGAN | |
CN113343705A (en) | Text semantic based detail preservation image generation method and system | |
CN113066475B (en) | Speech synthesis method based on generating type countermeasure network | |
CN115101085B (en) | Multi-speaker time domain voice separation method for enhancing external attention through convolution | |
CN110189766B (en) | Voice style transfer method based on neural network | |
CN113593588B (en) | Multi-singer singing voice synthesis method and system based on generation of countermeasure network | |
CN114141237A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN102930863B (en) | Voice conversion and reconstruction method based on simplified self-adaptive interpolation weighting spectrum model | |
Wu et al. | Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations | |
CN114582363A (en) | High-quality voice conversion method for non-parallel corpus | |
Fu et al. | Cycletransgan-evc: A cyclegan-based emotional voice conversion model with transformer | |
Guo et al. | Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training | |
Moritani et al. | Stargan-based emotional voice conversion for japanese phrases | |
Yook et al. | Voice conversion using conditional CycleGAN | |
Gao et al. | Personalized Singing Voice Generation Using WaveRNN. | |
Tobing et al. | Voice conversion with CycleRNN-based spectral mapping and finely tuned WaveNet vocoder | |
CN111382576B (en) | Neural machine translation decoding acceleration method based on discrete variable | |
Tobing et al. | Low-latency real-time non-parallel voice conversion based on cyclic variational autoencoder and multiband WaveRNN with data-driven linear prediction | |
CN108417198A (en) | A kind of men and women's phonetics transfer method based on spectrum envelope and pitch period | |
CN118135990A (en) | End-to-end text speech synthesis method and system combining autoregressive |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |