CN114038447B - Training method of speech synthesis model, speech synthesis method, device and medium - Google Patents
Training method of speech synthesis model, speech synthesis method, device and medium Download PDFInfo
- Publication number
- CN114038447B CN114038447B CN202111460685.4A CN202111460685A CN114038447B CN 114038447 B CN114038447 B CN 114038447B CN 202111460685 A CN202111460685 A CN 202111460685A CN 114038447 B CN114038447 B CN 114038447B
- Authority
- CN
- China
- Prior art keywords
- speech synthesis
- training
- current time
- time step
- mel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 136
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 136
- 238000012549 training Methods 0.000 title claims abstract description 119
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000001308 synthesis method Methods 0.000 title claims abstract description 13
- 238000001228 spectrum Methods 0.000 claims abstract description 130
- 238000006243 chemical reaction Methods 0.000 claims abstract description 34
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000004364 calculation method Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 108
- 239000011159 matrix material Substances 0.000 claims description 65
- 230000015654 memory Effects 0.000 claims description 37
- 230000006870 function Effects 0.000 claims description 30
- 230000004913 activation Effects 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 14
- 238000012805 post-processing Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 11
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000002194 synthesizing effect Effects 0.000 abstract description 2
- 239000010410 layer Substances 0.000 description 51
- 238000004891 communication Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 8
- 230000011218 segmentation Effects 0.000 description 7
- 230000009471 action Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 239000002355 dual-layer Substances 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Abstract
The application relates to the technical field of artificial intelligence, and discloses a training method of a speech synthesis model, which comprises the following steps: obtaining a training text, and performing phoneme conversion on the training text by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a WaveGlow vocoder; sequentially processing the phoneme sequence by using the encoder, the decoder and the residual error network to obtain a target Mel frequency spectrum; performing parallel audio conversion on the target Mel frequency spectrum by using the WaveGlow vocoder to obtain target audio; and carrying out loss calculation on the target audio to obtain a training loss value, and adjusting parameters of the speech synthesis model according to the loss value to obtain the target speech synthesis model. In addition, the application also relates to a voice synthesis method, a device, equipment and a storage medium. The application can improve the accuracy of the speech synthesis model and accelerate the training speed of the speech synthesis model for synthesizing the speech.
Description
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a training method for a speech synthesis model, a speech synthesis method, a device, an electronic apparatus, and a storage medium.
Background
Speech synthesis, which refers to a technique of converting an arbitrarily input text into a corresponding speech, is an important research branch in the field of natural speech processing.
At present, a common speech synthesis model is a Tacotron model, character embedding is adopted as an input model during Tacotron model training, so that a large number of training corpuses are needed to ensure that the Tacotron model obtained through training cannot generate a pronunciation error, but a large number of training corpuses cannot be obtained in the actual application process, further, the speech synthesis accuracy is low, and the speech synthesis speed is reduced even if a large number of training corpuses are obtained. In addition, the Tacotron model adopts WaveNet as a vocoder, and the WaveNet vocoder adopts a speech waveform generation mechanism from sample point to sample point, so that the synthetic speech speed is very slow and the accuracy is not high.
Disclosure of Invention
In order to solve the above technical problems or at least partially solve the above technical problems, the present application provides a training method of a speech synthesis model, a speech synthesis method, a device, an electronic apparatus and a storage medium.
In a first aspect, the present application provides a method for training a speech synthesis model, the method comprising:
Obtaining a training text, and performing phoneme conversion on the training text by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a WaveGlow vocoder;
sequentially processing the phoneme sequence by using the encoder, the decoder and the residual error network to obtain a target Mel frequency spectrum;
Performing parallel audio conversion on the target Mel frequency spectrum by using the WaveGlow vocoder to obtain target audio;
And carrying out loss calculation on the target audio to obtain a training loss value, and adjusting parameters of the speech synthesis model according to the loss value to obtain the target speech synthesis model.
Optionally, the processing the phoneme sequence sequentially by using the encoder, the decoder and the residual network to obtain a target mel spectrum includes:
Extracting context characteristics of the phoneme sequence by using the encoder to obtain a hidden characteristic matrix;
according to the hidden characteristic matrix, predicting the Mel frequency spectrum of the training text by using the decoder to obtain a predicted Mel frequency spectrum;
And carrying out residual connection on the predicted Mel frequency spectrum by using the residual network to obtain a target Mel frequency spectrum.
Optionally, the decoder includes an attention network and a post-processing network, and the predicting, by using the decoder, the mel spectrum of the training text according to the hidden feature matrix, to obtain a predicted mel spectrum includes:
Extracting a context vector in the hidden feature matrix by using the attention network to obtain a context vector of a first current time step;
Performing series connection operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a series connection result into a preset double-layer long-short time memory layer to obtain the context vector of the second current time step;
performing a first linear projection on the context vector of the second current time step by using the post-processing network to obtain a context Wen Biaoliang of the current time step;
performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel spectrum prediction on the context vector after the second linear projection to obtain Mel spectrum of the second current time step;
calculating the probability of the completion of the Mel spectrum prediction by using a preset first activation function according to the context scalar of the current time step;
Judging whether the probability of the completion of the Mel spectrum prediction is smaller than a preset threshold value;
and when the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, carrying out series connection operation on the context vector of the second current time step and the Mel spectrum of the second current time step, and returning to the step of inputting the context vector into a preset double-layer long-short-time memory layer until the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, ending the Mel spectrum prediction, and obtaining a predicted Mel spectrum.
Optionally, the attention network includes an attention weight value, a linear layer, a second activation function, and a mapping function, and the extracting, by using a preset attention network, a context vector in the hidden feature matrix to obtain a context vector of the first current time step includes:
performing linear projection on the hidden feature matrix by using the linear layer to obtain a key matrix;
inputting the attention weight value into a preset convolution layer to generate a position feature matrix;
performing linear projection on the position feature matrix by using the linear layer to obtain an additional feature matrix;
adding the additional feature matrix and the key matrix, and processing an addition result by utilizing the second activation function to obtain an attention probability vector;
Mapping the attention probability vector by using the mapping function to obtain a weight vector of the current attention;
And multiplying the current attention weight vector with the hidden feature matrix to obtain a context vector of the first current time step.
Optionally, the performing a tandem operation on the context vector of the first current time step and a preset mel spectrum, and inputting a tandem result into a preset double-layer long-short time memory layer to obtain a context vector of a second current time step, where the method includes:
Performing series operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting the context vector into one of the long-short-time memory layers to obtain a query vector;
The query vector and the context vector of the first time step are input into the other layer of long-short-time memory layer in series, and a decoder hidden state is obtained;
and performing tandem operation on the decoder hidden state and the context vector of the first time step to obtain the context vector of the second current time step.
Optionally, the performing residual connection on the predicted mel spectrum by using the residual network to obtain a target mel spectrum includes:
Residual calculation is carried out on the predicted Mel frequency spectrum by using a preset residual network, so as to obtain a residual Mel frequency spectrum;
and superposing the residual Mel frequency spectrum and the predicted Mel frequency spectrum to obtain a target Mel frequency spectrum.
In a second aspect, the present application provides a method of speech synthesis, the method comprising:
acquiring a voice text to be synthesized;
and performing voice synthesis on the voice text to be synthesized by using a target voice synthesis model to obtain synthesized voice, wherein the target voice synthesis model is obtained by training by adopting the training method of the voice synthesis model.
In a fourth aspect, the present application provides a training apparatus for a speech synthesis model, the apparatus comprising:
The training text conversion module is used for obtaining training texts, performing phoneme conversion on the training files by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a WaveGlow vocoder;
The target audio generation module is used for sequentially processing the phoneme sequence by utilizing the encoder, the decoder and the residual error network to obtain a target Mel frequency spectrum, and performing parallel audio conversion on the target Mel frequency spectrum by utilizing the WaveGlow vocoder to obtain target audio;
the model loss value calculation module is used for carrying out loss calculation on the target audio to obtain a training loss value, and adjusting parameters of the speech synthesis model according to the loss value to obtain a target speech synthesis model.
In a fifth aspect, the present application provides a speech synthesis apparatus, the apparatus comprising:
The text acquisition module is used for acquiring a voice text to be synthesized;
The model voice synthesis module is used for carrying out voice synthesis on the voice text to be synthesized by utilizing a target voice synthesis model to obtain synthesized voice, wherein the target voice synthesis model is obtained by training by adopting the training device of the voice synthesis model.
In a seventh aspect, an electronic device is provided, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the steps of the method for training a speech synthesis model according to any one of the embodiments of the first aspect or implement the steps of the method for speech synthesis according to the second aspect when executing the program stored in the memory.
In an eighth aspect, a computer readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the method for training a speech synthesis model according to any of the embodiments of the first aspect, or implements the steps of the method for speech synthesis according to the second aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
According to the training method, the device, the equipment and the storage medium for the speech synthesis model, provided by the embodiment of the invention, the training text is converted into the phoneme sequence to obtain the pronunciation attribute of each word in the training text, so that the pronunciation error of the training text caused by the problem of one word with multiple voices is avoided, the demand of a training corpus is reduced, the accuracy of the speech synthesis model is improved, further, the phoneme sequence is sequentially processed by using an encoder, a decoder and a residual network to obtain a target Mel frequency spectrum, finally, the target Mel frequency spectrum is subjected to parallel audio conversion by using a WaveGlow vocoder to obtain target audio, and the WaveGlow can realize the parallel conversion of the target Mel frequency spectrum into a speech waveform, so that the situation of low speech synthesis speed caused by sample point acquisition is avoided, and the training speed of the synthesized speech of the speech synthesis model is accelerated. Therefore, the training method, the device, the equipment and the storage medium for the speech synthesis model improve the accuracy of the speech synthesis model and accelerate the training speed of the speech synthesis model for synthesizing speech.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a flow chart of a training method of a speech synthesis model according to an embodiment of the present application;
Fig. 2 is a flow chart of a speech synthesis method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a training device for a speech synthesis model according to an embodiment of the present application;
Fig. 4 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present application;
fig. 5 is a schematic diagram of an internal structure of an electronic device for implementing a training method of a speech synthesis model according to an embodiment of the present application;
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Fig. 1 is a flow chart of a training method of a speech synthesis model according to an embodiment of the present application. In this embodiment, the training method of the speech synthesis model includes:
S11, acquiring a training text, and performing phoneme conversion on the training file by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual network and a WaveGlow vocoder.
In the embodiment of the invention, the source and the text type of the training text can be various, wherein the text type comprises Chinese, english and the like. The phoneme sequence can be the minimum phonetic unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation actions in syllables, and one action forms a phoneme, for example, the phonemes of Chinese characters can be Chinese pinyin and tones.
In an alternative embodiment of the present invention, when an english text is selected as a training text, english voices have the phenomenon of different pronunciations of the same letters, and these complex pronunciation rules can be learned only by a large amount of training data, so when learning these pronunciation rules by using a neural network, if the training data is insufficient, it is difficult to learn all pronunciation rules, especially when some rules occur too many times in the training data, the neural network will not learn these rules sufficiently, which inevitably results in the occurrence of pronunciation errors in some synthesized voices. Similarly, chinese characters also have the phenomenon of different pronunciations of the same Chinese character, such as polyphones. The phonemes are used as the most basic units of pronunciation, and the pronunciation attribute is reflected, so that the training text is converted into a phoneme sequence for input, and the problem of pronunciation errors can be avoided.
In detail, performing phoneme conversion on the training file by using a preset speech synthesis model to obtain a phoneme sequence, including:
performing language analysis on the training text by using a language analysis tool, and determining the language of the training text;
Performing sentence segmentation processing on the training text of the determined language according to the language to obtain a segmented sentence text;
According to a preset text format rule, converting non-characters in the segmentation sentence text into characters;
Word segmentation processing is carried out on the segmentation sentence text to obtain a word segmentation text;
mapping the word segmentation text according to a preset word-phoneme mapping dictionary to obtain phonemes;
Vector conversion is carried out on the phonemes to obtain phoneme vectors;
And carrying out coding sequencing on the phoneme vectors according to the text sequence to obtain a phoneme sequence.
According to the embodiment of the invention, the pronunciation rules of the training text can be determined by carrying out language analysis on the training text, further, word segmentation processing is carried out on the training text so as to accurately carry out phoneme conversion on the training text to obtain phonemes, finally, the phonemes are encoded and sequenced to obtain a phoneme sequence, the accuracy of the phoneme sequence is ensured by arrangement, the pronunciation confusion of the training text is avoided, and the accuracy of speech synthesis is improved.
In the embodiment of the present invention, the preset text format rule may be that if an arabic number exists in the obtained training text, the arabic number is converted into a text, and the synthesized text is standardized according to the set rule, for example, "there is 123 flowers", where "123" is an arabic number, and the arabic number needs to be converted into a kanji "two and three", so that the subsequent processes of converting the text into a phoneme and the like are facilitated.
In another alternative embodiment of the present invention, the phoneme conversion may be implemented by using an open-source grapheme-to-phoneme conversion tool G2P, so as to obtain a phoneme sequence.
And S12, sequentially processing the phoneme sequence by using the encoder, the decoder and the residual error network to obtain a target Mel frequency spectrum.
In an embodiment of the invention, the encoder comprises a convolution layer and a bidirectional long-short-time memory network. The decoder may be an autoregressive recurrent neural network including an attention network and a post-processing network. The residual network includes a convolutional layer and a series of functions.
In the embodiment of the invention, the conversion from text data to audio data is realized by converting the phoneme sequence into the target Mel frequency spectrum.
In detail, the processing the phoneme sequence by using the encoder, the decoder and the residual network sequentially to obtain a target mel spectrum includes:
Extracting context characteristics of the phoneme sequence by using the encoder to obtain a hidden characteristic matrix;
according to the hidden characteristic matrix, predicting the Mel frequency spectrum of the training text by using the decoder to obtain a predicted Mel frequency spectrum;
And carrying out residual connection on the predicted Mel frequency spectrum by using the residual network to obtain a target Mel frequency spectrum.
In the embodiment of the present invention, the hidden feature matrix includes information such as a context vector of the phoneme sequence.
In the embodiment of the invention, because the meaning of each word in the training text is often closely related to the context, for example, in the statement of 'I love Chinese', 'good' words have two pronunciations, the pronunciation of the "good" word cannot be determined by analyzing the "good" word alone, and the problem of pronunciation error is easily caused, so that the context characteristic information of each word needs to be extracted.
In detail, the encoder includes a convolutional layer and a bidirectional long-short-time memory network, and the method for extracting the context features of the phoneme sequence by using the encoder to obtain a hidden feature matrix includes:
Carrying out convolution processing on the phoneme sequence by using a convolution layer with a preset layer number to obtain a feature matrix of the phoneme sequence;
Performing correction linear unit activation processing and batch normalization processing on the feature matrix to obtain an optimized feature matrix;
and calculating the optimized feature matrix by using a preset bidirectional long-short-time memory network to obtain a hidden feature matrix.
In the embodiment of the invention, the bidirectional long-short-time memory network can be used for acquiring and storing the context vector of the phoneme sequence.
In the embodiment of the invention, the encoder is utilized to perform feature extraction on the phoneme sequence to obtain the hidden feature matrix, and the hidden feature matrix contains information such as the context vector of the phoneme sequence, so that the context feature of the phoneme sequence can be obtained by obtaining the hidden feature matrix, thereby improving the influence of the context feature of the phoneme sequence on the phoneme sequence and improving the pronunciation accuracy of the speech synthesis model.
Further, the decoder includes an attention network and a post-processing network, and the predicting, by using the decoder, the mel spectrum of the training text according to the hidden feature matrix, to obtain a predicted mel spectrum includes:
Extracting a context vector in the hidden feature matrix by using the attention network to obtain a context vector of a first current time step;
Performing series connection operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a series connection result into a preset double-layer long-short time memory layer to obtain the context vector of the second current time step;
performing a first linear projection on the context vector of the second current time step by using the post-processing network to obtain a context Wen Biaoliang of the current time step;
performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel spectrum prediction on the context vector after the second linear projection to obtain Mel spectrum of the second current time step;
calculating the probability of the completion of the Mel spectrum prediction by using a preset first activation function according to the context scalar of the current time step;
Judging whether the probability of the completion of the Mel spectrum prediction is smaller than a preset threshold value;
and when the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, carrying out series connection operation on the context vector of the second current time step and the Mel spectrum of the second current time step, and returning to the step of inputting the context vector into a preset double-layer long-short-time memory layer until the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, ending the Mel spectrum prediction, and obtaining a predicted Mel spectrum.
In the embodiment of the present invention, the attention network includes the position-sensitive attention mechanism and the dual-layer long short-term memory layer, which are mainly used for determining which part of the encoder input needs to be focused. The first activation function may be a sigmoid function.
The extracting the context vector in the hidden feature matrix by using a position sensitive attention mechanism in a preset attention network to obtain a context vector of a first current time step includes:
performing linear projection on the hidden feature matrix by using the linear layer to obtain a key matrix;
inputting the attention weight value into a preset convolution layer to generate a position feature matrix;
performing linear projection on the position feature matrix by using the linear layer to obtain an additional feature matrix;
adding the additional feature matrix and the key matrix, and processing an addition result by utilizing the second activation function to obtain an attention probability vector;
Mapping the attention probability vector by using the mapping function to obtain a weight vector of the current attention;
And multiplying the current attention weight vector with the hidden feature matrix to obtain a context vector of the first current time step.
Wherein, the attention weight value may be a value obtained by serially connecting the attention weight of the previous time step with the accumulation of all the previous attention weights. The second activation function may be a Tanh function. The mapping function may be a softmax function.
For example, the hidden feature matrix is input into a 128-unit linear layer for linear projection to generate a key matrix, and then the attention weight of the previous time step and the accumulation of all the previous attention weights are connected in series and sent into another layer of convolution layer, and the position feature is generated through calculation of 32 one-dimensional convolution kernels with the length of 31. The position features are added with the key matrix as additional features after being processed by a layer of 128-unit linear layers, and the added results are processed together through a Tanh function and then processed by a layer of linear layers to generate the attention probability vector. And finally, processing the attention probability vector through a softmax function to obtain a current attention weight vector.
In addition, the step of performing a series operation on the context vector of the first current time step and a preset mel spectrum, and inputting a series result into a preset double-layer long-short time memory layer to obtain a context vector of a second current time step includes:
Performing series operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting the context vector into one of the long-short-time memory layers to obtain a query vector;
The query vector and the context vector of the first time step are input into the other layer of long-short-time memory layer in series, and a decoder hidden state is obtained;
and performing tandem operation on the decoder hidden state and the context vector of the first time step to obtain the context vector of the second current time step.
The first current time step can be understood as the last time step in the current cyclic operation flow. The second current time step may be understood as a current time step in the current cyclical operating flow.
For example, a context vector of a previous time step is obtained, the context vector of the previous time step is connected in series with a preset mel frequency spectrum, the series result is sent to a first layer of long-short-time memory layer together to obtain a query vector, the query vector is connected in series with the context vector of the previous time step, the series result is input to a second layer of long-short-time memory layer to generate a decoder hidden state, and the decoder hidden state is connected in series with the context vector of a previous time step again to obtain the context vector of the current time step.
In the embodiment of the invention, the predicted Mel frequency spectrum is obtained by predicting the Mel frequency spectrum by fusing the context vector of the previous time step, so that the predicted Mel frequency spectrum is easy to lose the characteristics of the predicted Mel frequency spectrum, and word sense understanding is wrong, thereby causing pronunciation errors. Therefore, residual linking is needed to be carried out on the predicted mel spectrum so as to deepen the influence of the word on the word, thereby improving the accuracy of word meaning understanding and reducing pronunciation errors.
Therefore, the performing residual connection on the predicted mel spectrum to obtain a target mel spectrum includes:
Residual calculation is carried out on the predicted Mel frequency spectrum by using a preset residual network, so as to obtain a residual Mel frequency spectrum;
and superposing the residual Mel frequency spectrum and the predicted Mel frequency spectrum to obtain a target Mel frequency spectrum.
In an alternative embodiment of the present invention, the predicted mel spectrum obtained through N decoding steps is sent to a residual error network, and a residual error is generated and superimposed with itself to generate the target mel spectrum. The residual network is composed of 5 layers of convolution layers, each layer of convolution is composed of 512 convolution kernels with the shape of 5*1, each layer of convolution layer is followed by batch normalization processing, and other four layers of convolution layers except the last layer of convolution layer are activated by adopting a Tanh activation function.
And S13, performing parallel audio conversion on the target Mel frequency spectrum by using the WaveGlow vocoder to obtain target audio.
In the embodiment of the invention, waveNet is adopted as a vocoder in the traditional voice synthesis model to convert the target mel frequency spectrum into the target audio. Wherein the main structure of the WaveNet vocoder is a series of expansion convolutions stacked together, resulting in an increased receptive field of the entire network, and because the WaveNet vocoder is an autoregressive structure, the WaveNet vocoder requires a previously generated acoustic waveform to condition when predicting the acoustic waveform of the current time step. Thus, the speed of speech synthesis is very slow.
In the embodiment of the invention, waveGlow is selected as a vocoder to convert the target mel frequency spectrum into the target audio. Wherein the WaveGlow vocoder is a stream-based model that can generate high quality audio samples in parallel, thereby improving the speed of speech synthesis.
In detail, the performing parallel audio conversion on the target mel spectrum by using the WaveGlow vocoder in the speech synthesis model to obtain target audio includes:
performing parallel voice waveform conversion on the target Mel frequency spectrum by using the WaveGlow vocoder to obtain a target voice waveform;
and performing audio conversion on the target voice waveform to obtain target audio.
In the embodiment of the invention, performing audio conversion on the target voice waveform comprises the following steps: and sampling, quantizing and encoding the target voice waveform signal to obtain target audio. Wherein, sampling the target voice waveform signal is a process of discretizing the continuous target voice waveform signal on the time axis, and quantizing the sampled target voice waveform signal means converting each sample with continuous values on the amplitude into discrete value representation.
S14, carrying out loss calculation on the target audio to obtain a training loss value, and adjusting parameters of the speech synthesis model according to the loss value to obtain the target speech synthesis model.
In the embodiment of the invention, the loss function calculation in the voice synthesis model is utilized to calculate the loss of the target audio frequency to obtain a training loss value, and the parameters of the voice synthesis model are adjusted according to the loss value until the loss value is smaller than a preset threshold value, so that the target voice synthesis model is obtained.
According to the training method of the speech synthesis model, the training text is converted into the phoneme sequence, the pronunciation attribute of each word in the training text is obtained, the pronunciation error of the training text caused by the problem of one word with multiple voices is avoided, the demand of a training corpus is reduced, the accuracy of the speech synthesis model is improved, further, the phoneme sequence is sequentially processed by an encoder, a decoder and a residual error network to obtain a target Mel frequency spectrum, and finally, the target Mel frequency spectrum is subjected to parallel audio conversion by a WaveGlow vocoder to obtain target audio, and as WaveGlow can realize the parallel conversion of the target Mel frequency spectrum into a speech waveform, the situation that the speech synthesis speed is slow due to sample point acquisition is avoided, and the training speed of the speech synthesis model synthesized speech is accelerated. Therefore, the training method of the voice synthesis model provided by the embodiment of the invention improves the accuracy of the voice synthesis model and accelerates the training speed of the voice synthesis model to synthesize voice.
Fig. 2 is a flow chart of a speech synthesis method according to an embodiment of the present application. In this embodiment, the speech synthesis method includes:
S21, acquiring a voice text to be synthesized.
In this embodiment, the voice text to be synthesized may be obtained from any channel, for example, the voice text to be synthesized is input by a user or obtained from a database.
The type of the voice text to be synthesized can be a Chinese type or an English type, etc.
The speech text to be synthesized may be paragraph text that needs abstract extraction, where the paragraph text includes a plurality of sentences.
S22, performing voice synthesis on the voice text to be synthesized by using a target voice synthesis model to obtain synthesized voice.
In this embodiment, the target speech synthesis model is obtained by training the speech synthesis model training method described in the foregoing method embodiment.
In this embodiment, text cleaning may be performed on the text of the speech to be synthesized, that is, characters that cannot be pronounced in the text of the speech to be synthesized are deleted, and then the text of the speech to be synthesized is processed by using the target speech synthesis model, so as to obtain the synthesized speech.
In this embodiment, since the target speech synthesis model is obtained by training using the model training method described in the foregoing method embodiment, speech synthesis can be performed on a text to be synthesized by using the target speech synthesis model, so as to obtain synthesized speech.
As shown in fig. 3, an embodiment of the present application provides a schematic block diagram of a training device 30 for a speech synthesis model, where the training device 30 for a speech synthesis model includes: a training text conversion module 31, a target audio generation module 32 and a model loss value calculation module 33.
The training text conversion module 31 is configured to obtain a training text, and perform phoneme conversion on the training document by using a preset speech synthesis model to obtain a phoneme sequence, where the speech synthesis model includes an encoder, a decoder, a residual network and a WaveGlow vocoder;
the target audio generating module 32 is configured to sequentially process the phoneme sequence by using the encoder, the decoder and the residual network to obtain a target mel frequency spectrum, and perform parallel audio conversion on the target mel frequency spectrum by using the WaveGlow vocoder to obtain a target audio;
the model loss value calculation module 33 is configured to perform loss calculation on the target audio to obtain a training loss value, and adjust parameters of the speech synthesis model according to the loss value to obtain a target speech synthesis model.
In detail, each module in the training device 30 for a speech synthesis model in the embodiment of the present application adopts the same technical means as the training method for a speech synthesis model described in fig. 1 and can produce the same technical effects when in use, and will not be described here again.
As shown in fig. 4, an embodiment of the present application provides a schematic block diagram of a speech synthesis apparatus 40, where the speech synthesis apparatus 40 includes: a text acquisition module 41 and a model speech synthesis module 42.
The text obtaining module 41 is configured to obtain a voice text to be synthesized;
the model speech synthesis module 42 is configured to perform speech synthesis on the speech text to be synthesized by using a target speech synthesis model, where the target speech synthesis model is obtained by training by using the training device of the foregoing speech synthesis model.
In detail, each module in the speech synthesis apparatus 40 in the embodiment of the present application adopts the same technical means as the speech synthesis method described in fig. 2 and can produce the same technical effects when in use, and will not be described again here.
As shown in fig. 5, an embodiment of the present application provides an electronic device, which includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 perform communication with each other through the communication bus 114.
A memory 113 for storing a computer program.
In one embodiment of the present application, the processor 111 is configured to implement the training method of the speech synthesis model provided in any one of the foregoing method embodiments, or implement the speech synthesis method provided in any one of the foregoing method embodiments, when executing the program stored in the memory 113.
The training method of the speech synthesis model comprises the following steps:
Obtaining a training text, and performing phoneme conversion on the training text by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a WaveGlow vocoder;
sequentially processing the phoneme sequence by using the encoder, the decoder and the residual error network to obtain a target Mel frequency spectrum;
Performing parallel audio conversion on the target Mel frequency spectrum by using the WaveGlow vocoder to obtain target audio;
And carrying out loss calculation on the target audio to obtain a training loss value, and adjusting parameters of the speech synthesis model according to the loss value to obtain the target speech synthesis model.
The voice synthesis method comprises the following steps:
acquiring a voice text to be synthesized;
And performing voice synthesis on the voice text to be synthesized by using a target voice synthesis model to obtain synthesized voice, wherein the target voice synthesis model is obtained by training by adopting the training method of the voice synthesis model in any one of the method embodiments.
The communication bus 114 may be a peripheral component interconnect standard (PeripheralComponentInterconnect, PCI) bus, an extended industry standard architecture (ExtendedIndustryStandardArchitecture, EISA) bus, or the like. The communication bus 114 may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface 112 is used for communication between the above-described electronic device and other devices.
The memory 113 may include Random Access Memory (RAM) or nonvolatile memory (non-volatilememory), such as at least one disk memory. Alternatively, the memory 113 may be at least one memory device located remotely from the processor 111.
The processor 111 may be a general-purpose processor, including a Central Processing Unit (CPU), a network processor (NetworkProcessor NP), and the like; but also digital signal processors (DigitalSignalProcessing, DSP for short), application specific integrated circuits (ApplicationSpecificIntegratedCircuit, ASIC for short), field-programmable gate arrays (Field-ProgrammableGateArray, FPGA for short), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The embodiment of the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method for training a speech synthesis model provided in any one of the method embodiments described above, or implements the steps of the method for speech synthesis provided in any one of the method embodiments described above.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. Usable media may be magnetic media (e.g., floppy disk, hard disk, magnetic tape), optical media (e.g., DVD), or semiconductor media (e.g., solid state disk SolidStateDisk (SSD)), among others.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (8)
1. A method of training a speech synthesis model, the method comprising:
Obtaining a training text, and performing phoneme conversion on the training text by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a WaveGlow vocoder;
sequentially processing the phoneme sequence by using the encoder, the decoder and the residual error network to obtain a target Mel frequency spectrum;
Performing parallel audio conversion on the target Mel frequency spectrum by using the WaveGlow vocoder to obtain target audio;
Performing loss calculation on the target audio to obtain a training loss value, and adjusting parameters of the speech synthesis model according to the loss value to obtain a target speech synthesis model;
the processing of the phoneme sequence by the encoder, the decoder and the residual network in turn, to obtain a target mel spectrum, includes: extracting context characteristics of the phoneme sequence by using the encoder to obtain a hidden characteristic matrix; according to the hidden characteristic matrix, predicting the Mel frequency spectrum of the training text by using the decoder to obtain a predicted Mel frequency spectrum; residual connection is carried out on the predicted Mel frequency spectrum by utilizing the residual network, so as to obtain a target Mel frequency spectrum;
The decoder comprises an attention network and a post-processing network, and the decoder is used for predicting the mel frequency spectrum of the training text according to the hidden characteristic matrix to obtain a predicted mel frequency spectrum, and the method comprises the following steps: extracting a context vector in the hidden feature matrix by using the attention network to obtain a context vector of a first current time step; performing series connection operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a series connection result into a preset double-layer long-short time memory layer to obtain the context vector of the second current time step; performing a first linear projection on the context vector of the second current time step by using the post-processing network to obtain a context Wen Biaoliang of the current time step; performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel spectrum prediction on the context vector after the second linear projection to obtain Mel spectrum of the second current time step; calculating the probability of the completion of the Mel spectrum prediction by using a preset first activation function according to the context scalar of the current time step; judging whether the probability of the completion of the Mel spectrum prediction is smaller than a preset threshold value; when the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, carrying out series connection operation on the context vector of the second current time step and the Mel spectrum of the second current time step, and returning to the step of inputting the context vector into a preset double-layer long-short-time memory layer until the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, ending the Mel spectrum prediction to obtain a predicted Mel spectrum;
The attention network comprises an attention weight value, a linear layer, a second activation function and a mapping function, the context vector in the hidden feature matrix is extracted by using a preset attention network to obtain a context vector of a first current time step, and the method comprises the following steps: performing linear projection on the hidden feature matrix by using the linear layer to obtain a key matrix;
Inputting the attention weight value into a preset convolution layer to generate a position feature matrix; performing linear projection on the position feature matrix by using the linear layer to obtain an additional feature matrix; adding the additional feature matrix and the key matrix, and processing an addition result by utilizing the second activation function to obtain an attention probability vector;
Mapping the attention probability vector by using the mapping function to obtain a weight vector of the current attention;
And multiplying the current attention weight vector with the hidden feature matrix to obtain a context vector of the first current time step.
2. The method for training a speech synthesis model according to claim 1, wherein the step of performing a tandem operation on the context vector of the first current time step and a predetermined mel frequency spectrum, and inputting the tandem result into a predetermined double-layer long-short-time memory layer to obtain the context vector of the second current time step comprises: performing series operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting the context vector into one of the long-short-time memory layers to obtain a query vector; the query vector and the context vector of the first current time step are input into another layer of the long-short-time memory layer in series, and a decoder hidden state is obtained; and carrying out serial connection operation on the decoder hidden state and the context vector of the first current time step to obtain the context vector of the second current time step.
3. The method for training a speech synthesis model according to claim 1, wherein performing residual connection on the predicted mel spectrum by using the residual network to obtain a target mel spectrum comprises:
Residual calculation is carried out on the predicted Mel frequency spectrum by using a preset residual network, so as to obtain a residual Mel frequency spectrum;
and superposing the residual Mel frequency spectrum and the predicted Mel frequency spectrum to obtain a target Mel frequency spectrum.
4. A method of speech synthesis, the method comprising:
acquiring a voice text to be synthesized;
and performing voice synthesis on the voice text to be synthesized by using a target voice synthesis model to obtain synthesized voice, wherein the target voice synthesis model is obtained by training by using the training method of the voice synthesis model according to any one of claims 1 to 3.
5. A training device for a speech synthesis model, comprising:
The training text conversion module is used for obtaining training texts, performing phoneme conversion on the training texts by using a preset speech synthesis model to obtain a phoneme sequence, wherein the speech synthesis model comprises an encoder, a decoder, a residual error network and a WaveGlow vocoder;
The target audio generation module is used for sequentially processing the phoneme sequence by utilizing the encoder, the decoder and the residual error network to obtain a target Mel frequency spectrum, and performing parallel audio conversion on the target Mel frequency spectrum by utilizing the WaveGlow vocoder to obtain target audio;
The model loss value calculation module is used for carrying out loss calculation on the target audio to obtain a training loss value, and adjusting parameters of the speech synthesis model according to the loss value to obtain a target speech synthesis model;
the processing of the phoneme sequence by the encoder, the decoder and the residual network in turn, to obtain a target mel spectrum, includes: extracting context characteristics of the phoneme sequence by using the encoder to obtain a hidden characteristic matrix; according to the hidden characteristic matrix, predicting the Mel frequency spectrum of the training text by using the decoder to obtain a predicted Mel frequency spectrum; residual connection is carried out on the predicted Mel frequency spectrum by utilizing the residual network, so as to obtain a target Mel frequency spectrum;
The decoder comprises an attention network and a post-processing network, and the decoder is used for predicting the mel frequency spectrum of the training text according to the hidden characteristic matrix to obtain a predicted mel frequency spectrum, and the method comprises the following steps: extracting a context vector in the hidden feature matrix by using the attention network to obtain a context vector of a first current time step; performing series connection operation on the context vector of the first current time step and a preset Mel frequency spectrum, and inputting a series connection result into a preset double-layer long-short time memory layer to obtain the context vector of the second current time step; performing a first linear projection on the context vector of the second current time step by using the post-processing network to obtain a context Wen Biaoliang of the current time step; performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel spectrum prediction on the context vector after the second linear projection to obtain Mel spectrum of the second current time step; calculating the probability of the completion of the Mel spectrum prediction by using a preset first activation function according to the context scalar of the current time step; judging whether the probability of the completion of the Mel spectrum prediction is smaller than a preset threshold value; when the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, carrying out series connection operation on the context vector of the second current time step and the Mel spectrum of the second current time step, and returning to the step of inputting the context vector into a preset double-layer long-short-time memory layer until the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, ending the Mel spectrum prediction to obtain a predicted Mel spectrum;
the attention network comprises an attention weight value, a linear layer, a second activation function and a mapping function, the context vector in the hidden feature matrix is extracted by using a preset attention network to obtain a context vector of a first current time step, and the method comprises the following steps: performing linear projection on the hidden feature matrix by using the linear layer to obtain a key matrix; inputting the attention weight value into a preset convolution layer to generate a position feature matrix; performing linear projection on the position feature matrix by using the linear layer to obtain an additional feature matrix; adding the additional feature matrix and the key matrix, and processing an addition result by utilizing the second activation function to obtain an attention probability vector; mapping the attention probability vector by using the mapping function to obtain a weight vector of the current attention; and multiplying the current attention weight vector with the hidden feature matrix to obtain a context vector of the first current time step.
6. A speech synthesis apparatus, the apparatus comprising:
The text acquisition module is used for acquiring a voice text to be synthesized;
and the model voice synthesis module is used for carrying out voice synthesis on the voice text to be synthesized by utilizing a target voice synthesis model to obtain synthesized voice, wherein the target voice synthesis model is obtained by training by adopting the training device of the voice synthesis model according to claim 5.
7. An electronic device, the electronic device comprising:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the training method of the speech synthesis model of any one of claims 1 to 3 or to perform the speech synthesis method of claim 4.
8. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method of the speech synthesis model according to any one of claims 1 to 3 or performs the speech synthesis method according to claim 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111460685.4A CN114038447B (en) | 2021-12-02 | Training method of speech synthesis model, speech synthesis method, device and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111460685.4A CN114038447B (en) | 2021-12-02 | Training method of speech synthesis model, speech synthesis method, device and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114038447A CN114038447A (en) | 2022-02-11 |
CN114038447B true CN114038447B (en) | 2024-11-12 |
Family
ID=
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113345415A (en) * | 2021-06-01 | 2021-09-03 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN113450765A (en) * | 2021-07-29 | 2021-09-28 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium |
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113345415A (en) * | 2021-06-01 | 2021-09-03 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN113450765A (en) * | 2021-07-29 | 2021-09-28 | 平安科技(深圳)有限公司 | Speech synthesis method, apparatus, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11929059B2 (en) | Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature | |
CN113470662B (en) | Generating and using text-to-speech data for keyword detection system and speaker adaptation in speech recognition system | |
CN113168828B (en) | Conversation agent pipeline based on synthetic data training | |
JP7464621B2 (en) | Speech synthesis method, device, and computer-readable storage medium | |
CN113439301A (en) | Reconciling between analog data and speech recognition output using sequence-to-sequence mapping | |
CN112435654B (en) | Data enhancement of speech data by frame insertion | |
CN111339278B (en) | Method and device for generating training speech generating model and method and device for generating answer speech | |
CN112397056B (en) | Voice evaluation method and computer storage medium | |
CN113205792A (en) | Mongolian speech synthesis method based on Transformer and WaveNet | |
WO2023245389A1 (en) | Song generation method, apparatus, electronic device, and storage medium | |
WO2019167296A1 (en) | Device, method, and program for natural language processing | |
CN114023300A (en) | Chinese speech synthesis method based on diffusion probability model | |
CN113450757A (en) | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium | |
Suyanto et al. | End-to-End speech recognition models for a low-resourced Indonesian Language | |
Dossou et al. | OkwuGb\'e: End-to-End Speech Recognition for Fon and Igbo | |
CN113327578A (en) | Acoustic model training method and device, terminal device and storage medium | |
CN114974218A (en) | Voice conversion model training method and device and voice conversion method and device | |
JP6577900B2 (en) | Phoneme error acquisition device, phoneme error acquisition method, and program | |
CN114038447B (en) | Training method of speech synthesis model, speech synthesis method, device and medium | |
Thalengala et al. | Study of sub-word acoustical models for Kannada isolated word recognition system | |
CN115775554A (en) | Method, device, storage medium and equipment for disambiguating polyphone | |
CN114038447A (en) | Training method of speech synthesis model, speech synthesis method, apparatus and medium | |
Raju et al. | Speech recognition to build context: A survey | |
CN115294955B (en) | Model training and speech synthesis method, device, equipment and medium | |
Akther et al. | AUTOMATED SPEECH-TO-TEXT CONVERSION SYSTEMS IN BANGLA LANGUAGE: A SYSTEMATIC LITERATURE REVIEW |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |