CN113053353B

CN113053353B - Training method and device of speech synthesis model

Info

Publication number: CN113053353B
Application number: CN202110259482.2A
Authority: CN
Inventors: 黄选平; 马达标
Original assignee: Du Xiaoman Technology Beijing Co Ltd
Current assignee: Du Xiaoman Technology Beijing Co Ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2022-10-04
Anticipated expiration: 2041-03-10
Also published as: CN113053353A

Abstract

The embodiment of the invention provides a method and a device for training a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of a multi-speaker model in a training process, and phonemes of the current input text take finals as units; the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, the accuracy in the training process can be improved because the multi-speaker model is trained in advance, even when the data volume of the input text is reduced, the number of phonemes of the input text cannot be reduced because the input text takes vowels as units, the precision of speech recognition is further ensured, and the problem of low precision of speech synthesis caused by the fact that the input text contains few factors and phoneme characteristics in the prior art is solved.

Description

Training method and device of speech synthesis model

Technical Field

The invention relates to the technical field of intelligent voice, in particular to a method and a device for training a voice synthesis model.

Background

The Speech synthesis is a process from Text to Speech, the Text is input to an acoustic model to obtain acoustic features, and then input to a synthesizer to obtain audio, wherein in the prior art, a high Parallel acoustic model FPUTS (full Parallel UFANS-based End-to-End Text-to-Speech System) is adopted for Speech synthesis, a schematic diagram of the FTPUS acoustic model is shown in fig. 1, wherein an encoder, an alignment module and a decoder are all composed of a neural network. The general process of generating audio is: obtaining a speaker vector (which is an N-dimensional vector, different speaker vectors for different speakers) from the ID of the speaker (e.g., 0-100); the speaker vector and the text are input into an encoder for encoding; the speaker and text input alignment module obtains the pronunciation duration information of the audio (the module determines the pronunciation duration, speaking speed and the like of the final audio); and the pronunciation time information and the codes from the coder enter a decoder for decoding to obtain the final audio.

The FPTUS model needs to be trained in advance, and after training is completed, speech recognition can be performed, and a training process for the PTPUS is shown in fig. 2, where the first step is training an alignment module. See fig. 2 (a). In this case, the main structures of the model are speaker vector, encoder, alignment module and a decoder with a very simple structure (the simple structure of the decoder is very important for training the alignment module). Training in cooperation with data to obtain a trained alignment module; the second step is to train the speaker vectors, the encoder and the decoder. Referring to fig. 2 (b), the model is mainly constructed as an encoder, from the first step trained alignment module, the (complex and final) decoder. The alignment module is fixed here and does not participate in training.

The inventors studied the training process of the FPTUS model and found that when the input text is short, since the data amount of the input text is small, when the phonemes are divided in units of words, the input text contains few phoneme features, resulting in low accuracy of speech synthesis during the training process.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for training a speech synthesis model, so as to solve the problem in the prior art that in the training process of an FPTUS model, when an input text is short, because the data size of the input text is small, when phonemes are divided by taking a word as a unit, the input text contains few phoneme features, which results in low accuracy of speech synthesis. The specific scheme is as follows:

a method of training a speech synthesis model, comprising:

training a historical speech synthesis model to obtain a multi-speaker model;

acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit;

training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.

Optionally, in the foregoing method, the historical speech synthesis model is trained to obtain a multiple speaker model, where the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: the training process comprises the following steps:

acquiring a historical speaker ID and a historical input text in training data;

determining a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, training the historical alignment module by the historical encoder and the historical first decoder to obtain a first speaker model, wherein the first speaker model comprises: a historical target alignment module;

based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the target historical alignment module are trained to obtain a second speaker model.

The method described above, optionally, further includes:

the phonemes in the historical input text are in units of words.

Optionally, in the above method, the current speech synthesis model is trained based on the multi-speaker model, the speaker ID, and the input text, where the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the training process comprising:

linearly combining the first speaker model and the current speaker ID to determine a first speaker vector;

training the current alignment module by adopting the first speaker vector, the current input text, the current encoder and the first current decoder to obtain a target current alignment module;

determining a second speaker vector based on the second speaker model and the current speaker ID, and training the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.

The above method, optionally, further includes:

acquiring a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;

assigning the first training result to the first current decoder as an initial value;

and assigning the second training result to the second current decoder as an initial value.

An apparatus for training a speech synthesis model, comprising:

the first training module is used for training the historical speech synthesis model to obtain a multi-speaker model;

the acquisition module is used for acquiring the current speaker ID and the current input text in the current training data, wherein the data volume of the current input text is less than that of the historical input text of the multi-speaker model in the training process, and the phoneme of the current input text takes the vowel as a unit;

a second training module to train a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.

The above apparatus, optionally, the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module comprising:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a historical speaker ID and a historical input text in training data;

a first training unit, configured to determine a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, train the historical alignment module through the historical encoder and the historical first decoder, and obtain a first speaker model, where the first speaker model includes: a historical target alignment module;

and the second training unit is used for training the history input text, the history encoder, the second history decoder and the target history alignment module based on the history speaker vector to obtain a second speaker model.

The above apparatus, optionally, further comprises:

the phonemes in the historical input text are in units of words.

The above apparatus, optionally, the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module comprising:

a determining unit, configured to perform linear combination on the first speaker model and the current speaker ID, and determine a first speaker vector;

a third training unit, configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;

a fourth training unit, configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.

The above apparatus, optionally, further comprises:

a second obtaining unit, configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;

a first assigning unit configured to assign the first training result to the first current decoder as an initial value;

and the second assignment unit is used for assigning the second training result to the second current decoder as an initial value.

Compared with the prior art, the invention has the following advantages:

the embodiment of the invention provides a method and a device for training a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of an FPTUS model according to the prior art;

FIG. 2 is a diagram illustrating a FPTUS model training process according to the prior art;

FIG. 3 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a FPTUS model training process disclosed in an embodiment of the present application;

fig. 5 is a block diagram of a structure of a training apparatus for a speech synthesis model according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The invention discloses a training method and a device of a speech synthesis model, which are applied to the training process of the speech synthesis model based on FPUTS. In order to solve the above problem, the present invention provides a training method of a speech synthesis model, the execution flow of the training method is shown in fig. 3, and the training method comprises the following steps:

s101, training a historical speech synthesis model to obtain a multi-speaker model;

in the embodiment of the invention, the speech synthesis is a process of synthesizing text into speech, and the historical speech synthesis model (FPTUS model) comprises a historical coder, a historical decoder and a historical alignment module, wherein the decoder comprises a first historical decoder and a second historical decoder; the multi-speaker model includes: a first speaker model and a second speaker model, where a process of training the historical speech synthesis model is the same as the training process shown in fig. 2, a training alignment module obtains a historical speaker ID and a historical input text in training data, where the training data is given in advance, the training data includes the historical speaker ID and the historical input text corresponding to the ID, the historical speaker ID is pre-assigned based on experience or specific conditions, a historical speaker vector is determined based on the historical speaker ID, the historical input text, the historical encoder and the historical first decoder train the historical alignment module to obtain the first speaker model, where the first speaker model includes: a historical target alignment module; and training a speaker vector, an encoder and a decoder, wherein based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the target historical alignment module are trained to obtain a second speaker model.

S102, acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units;

in the embodiment of the present invention, the current training data is given in advance, the current training data includes a current speaker ID and a current input text corresponding to the current speaker ID, and a current speaker ID and a current input text in the current training data are obtained, where the current speaker ID may be set based on experience or specific conditions, the data size of the current input text is the data size of a historical input text of the multi-speaker model during the training process, the current input text is a series of phonemes, for example, the current input text is "yi xi lie yin su", in a usage scenario of the historical speech synthesis model, pinyin is taken as a unit, in this example, 'yi', 'xi', 'lie', 'yin', 'su' are five different phoneme units, and this factor system has 460 different factors. For high amount of data, such a phoneme system may be used naturally, but for low amount of data, it may not be able to cover the whole, and therefore, in the embodiment of the present invention, the phoneme system is modified to take the initials and finals as units. The example becomes ` yixi l ie y in s u `. With this phoneme system, there are only 80 different phonemes, and the low amount of data can still be covered completely, but the accuracy requirement for model training becomes large.

S103, training a current voice synthesis model based on the multi-speaker model, the current speaker ID and the current input text.

In the embodiment of the present invention, as shown in fig. 4, the multi-speaker model includes: a first speaker model, which is the speaker vector obtained in the first training step shown in fig. 2, an encoder, an alignment module, a (simple) decoder, which is labeled as multiple speaker vector mul,1, encoder mul,1, alignment module mul,1, and (simple) decoder mul,1. The second speaker model comprises the speaker vector obtained in the second training step, an encoder, an alignment module, a (complex) decoder, which is marked as a multi-speaker vector mul,2, an encoder mul,2, an alignment module mul,2, and a (complex) decoder mul,2. Note that the alignment block mul,1 and the alignment block mul,2 are identical.

The speaker vector determines the characteristics of the synthesized audio, such as pronunciation duration, speech rate and pitch. In the same case of the encoder, alignment module, and decoder, different speaker vectors synthesize different voices. In the multi-speaker model, assuming that data of a total of N speakers participate in training, there are N different speaker vectors h _i ,i＝1...N。

Thus, the speaker vector of the currently input text needs to be determined first. To fully utilize the model of the multiple speakers, define the speaker vector of the speaker as

l＝∑p _i ×h _i (1)

Wherein p is _i Is a trainable variable and is a linear combination of multiple speaker vectors. The linear combinations are labeled in fig. 4.

As shown in fig. 4 (a), the speaker vector is a linear combination of the speaker vectors in the first training step of the multi-speaker model shown in fig. 2.

The encoder (current encoder) part in the multi-speaker model is a process of abstracting the current input text, the module does not receive speaker vector information in training, and the encoder (current encoder) part is directly obtained from the multi-speaker model and is kept fixed when a small amount of data is trained.

The alignment module (current alignment module) and the (simple) decoder (first current decoder) use the corresponding part of the first training step in the multi-speaker model shown in fig. 2 as initial values, but still require training. This speeds up convergence and results in improved final accuracy.

As shown in figure 4 (b) of the drawings,

the speaker vector in the second training step is a linear combination of the speaker vectors in the second training step of the multi-speaker model shown in fig. 2.

The encoder (current encoder) part in the multi-speaker model is an abstraction process of input text, the module does not receive speaker vector information in training, and the encoder (current encoder) part is directly obtained from the multi-speaker model and is kept fixed when training a small amount of data.

The alignment module was trained using fig. 4 (a) and held stationary.

The (complex) decoder (the second current decoder) uses the corresponding part of the second training step in the multi-speaker model shown in fig. 2 as an initial value, but still requires training

The embodiment of the invention provides a training method of a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.

In the embodiment of the invention, the existing mature FPUTS-based synthesis algorithm is combined with the specific migration algorithm, so that the data cost is reduced by at most one fifth, and the synthesis quality can be ensured to be basically unaffected.

Based on the foregoing speech synthesis model training method, in an embodiment of the present invention, a speech synthesis model training apparatus is provided, a structural block diagram of the training apparatus is shown in fig. 5, and the training apparatus includes:

a first training module 201, an acquisition module 202 and a second training module 203.

Wherein,

the first training module 201 is configured to train a historical speech synthesis model to obtain a multiple speaker model;

the obtaining module 202 is configured to obtain a current speaker ID and a current input text in current training data, where a data amount of the current input text is less than a data amount of a historical input text of the multi-speaker model in a training process, and a phoneme of the current input text takes a vowel as a unit;

the second training module 203 is configured to train a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.

The invention provides a training device of a speech synthesis model, which comprises: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.

In an embodiment of the present invention, the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module 201 comprising:

a first acquisition unit 204, a first training unit 205 and a second training unit 206.

Wherein,

the first obtaining unit 204 is configured to obtain a historical speaker ID and a historical input text in training data;

the first training unit 205 is configured to determine a historical speaker vector based on the historical speaker ID, and train the historical alignment module based on the historical speaker vector, the historical input text, the historical encoder, and the historical first decoder to obtain a first speaker model, where the first speaker model includes: a historical target alignment module;

the second training unit 206 is configured to train the history input text, the history encoder, the second history decoder, and the target history alignment module based on the history speaker vector to obtain a second speaker model.

In this embodiment of the present invention, the first training module 201 further includes:

the phonemes in the historical input text are in units of words.

In an embodiment of the present invention, the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module 203 including:

a determination unit 207, a third training unit 208 and a fourth training unit 209.

Wherein,

the determining unit 207 is configured to perform linear combination on the first speaker model and the current speaker ID to determine a first speaker vector;

the third training unit 208 is configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;

the fourth training unit 209 is configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, the current decoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, and the target current alignment module.

In this embodiment of the present invention, the second training module 203 further includes:

a second retrieving unit 210, a first assigning unit 211 and a second assigning unit 212.

Wherein,

the second obtaining unit 210 is configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;

the first assigning unit 211 is configured to assign the first training result to the first current decoder as an initial value;

the second assigning unit 212 is configured to assign the second training result to the second current decoder as an initial value.

It should be noted that, in this specification, each embodiment is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same as and similar to each other in each embodiment may be referred to. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The above detailed description is provided for the training method and apparatus of a speech synthesis model provided by the present invention, and the present document applies specific examples to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for training a speech synthesis model, comprising:

training a historical speech synthesis model to obtain a multi-speaker model;

training a current speech synthesis model based on the multi-speaker model, the current speaker ID and the current input text;

the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: the training process comprises the following steps:

acquiring a historical speaker ID and a historical input text in training data;

based on the historical speaker ID, determining a historical speaker vector, based on the historical speaker vector, the historical input text, the historical encoder and the first historical decoder train the historical alignment module to obtain a first speaker model, wherein the first speaker model comprises: a historical target alignment module;

based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the historical target alignment module are trained to obtain a second speaker model.

2. The method of claim 1, further comprising:

the phonemes in the historical input text are in units of words.

3. The method of claim 1, wherein a current speech synthesis model is trained based on the multi-speaker model, the speaker ID, and the input text, wherein the current speech synthesis model comprises: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the training process comprising:

4. The method of claim 3, further comprising:

5. An apparatus for training a speech synthesis model, comprising:

a second training module for training a current speech synthesis model based on the multi-speaker model, the current speaker ID and the current input text;

the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module comprising:

the first acquisition unit is used for acquiring a historical speaker ID and a historical input text in the training data;

a first training unit, configured to determine a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, train the historical alignment module through the historical encoder and the first historical decoder, and obtain a first speaker model, where the first speaker model includes: a historical target alignment module;

and the second training unit is used for training the history input text, the history encoder, the second history decoder and the history target alignment module based on the history speaker vector to obtain a second speaker model.

6. The apparatus of claim 5, further comprising:

the phonemes in the historical input text are in units of words.

7. The apparatus of claim 5, wherein the current speech synthesis model comprises: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module comprising:

a fourth training unit, configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, the current decoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, and the target current alignment module.

8. The apparatus of claim 7, further comprising:

a second assigning unit, configured to assign the second training result to the second current decoder as an initial value.