CN113053353B - Training method and device of speech synthesis model - Google Patents
Training method and device of speech synthesis model Download PDFInfo
- Publication number
- CN113053353B CN113053353B CN202110259482.2A CN202110259482A CN113053353B CN 113053353 B CN113053353 B CN 113053353B CN 202110259482 A CN202110259482 A CN 202110259482A CN 113053353 B CN113053353 B CN 113053353B
- Authority
- CN
- China
- Prior art keywords
- current
- speaker
- historical
- training
- decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 67
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 67
- 230000008569 process Effects 0.000 claims abstract description 42
- 239000013598 vector Substances 0.000 claims description 65
- 238000010586 diagram Methods 0.000 description 6
- 230000009471 action Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the invention provides a method and a device for training a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of a multi-speaker model in a training process, and phonemes of the current input text take finals as units; the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, the accuracy in the training process can be improved because the multi-speaker model is trained in advance, even when the data volume of the input text is reduced, the number of phonemes of the input text cannot be reduced because the input text takes vowels as units, the precision of speech recognition is further ensured, and the problem of low precision of speech synthesis caused by the fact that the input text contains few factors and phoneme characteristics in the prior art is solved.
Description
Technical Field
The invention relates to the technical field of intelligent voice, in particular to a method and a device for training a voice synthesis model.
Background
The Speech synthesis is a process from Text to Speech, the Text is input to an acoustic model to obtain acoustic features, and then input to a synthesizer to obtain audio, wherein in the prior art, a high Parallel acoustic model FPUTS (full Parallel UFANS-based End-to-End Text-to-Speech System) is adopted for Speech synthesis, a schematic diagram of the FTPUS acoustic model is shown in fig. 1, wherein an encoder, an alignment module and a decoder are all composed of a neural network. The general process of generating audio is: obtaining a speaker vector (which is an N-dimensional vector, different speaker vectors for different speakers) from the ID of the speaker (e.g., 0-100); the speaker vector and the text are input into an encoder for encoding; the speaker and text input alignment module obtains the pronunciation duration information of the audio (the module determines the pronunciation duration, speaking speed and the like of the final audio); and the pronunciation time information and the codes from the coder enter a decoder for decoding to obtain the final audio.
The FPTUS model needs to be trained in advance, and after training is completed, speech recognition can be performed, and a training process for the PTPUS is shown in fig. 2, where the first step is training an alignment module. See fig. 2 (a). In this case, the main structures of the model are speaker vector, encoder, alignment module and a decoder with a very simple structure (the simple structure of the decoder is very important for training the alignment module). Training in cooperation with data to obtain a trained alignment module; the second step is to train the speaker vectors, the encoder and the decoder. Referring to fig. 2 (b), the model is mainly constructed as an encoder, from the first step trained alignment module, the (complex and final) decoder. The alignment module is fixed here and does not participate in training.
The inventors studied the training process of the FPTUS model and found that when the input text is short, since the data amount of the input text is small, when the phonemes are divided in units of words, the input text contains few phoneme features, resulting in low accuracy of speech synthesis during the training process.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for training a speech synthesis model, so as to solve the problem in the prior art that in the training process of an FPTUS model, when an input text is short, because the data size of the input text is small, when phonemes are divided by taking a word as a unit, the input text contains few phoneme features, which results in low accuracy of speech synthesis. The specific scheme is as follows:
a method of training a speech synthesis model, comprising:
training a historical speech synthesis model to obtain a multi-speaker model;
acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit;
training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.
Optionally, in the foregoing method, the historical speech synthesis model is trained to obtain a multiple speaker model, where the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: the training process comprises the following steps:
acquiring a historical speaker ID and a historical input text in training data;
determining a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, training the historical alignment module by the historical encoder and the historical first decoder to obtain a first speaker model, wherein the first speaker model comprises: a historical target alignment module;
based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the target historical alignment module are trained to obtain a second speaker model.
The method described above, optionally, further includes:
the phonemes in the historical input text are in units of words.
Optionally, in the above method, the current speech synthesis model is trained based on the multi-speaker model, the speaker ID, and the input text, where the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the training process comprising:
linearly combining the first speaker model and the current speaker ID to determine a first speaker vector;
training the current alignment module by adopting the first speaker vector, the current input text, the current encoder and the first current decoder to obtain a target current alignment module;
determining a second speaker vector based on the second speaker model and the current speaker ID, and training the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.
The above method, optionally, further includes:
acquiring a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
assigning the first training result to the first current decoder as an initial value;
and assigning the second training result to the second current decoder as an initial value.
An apparatus for training a speech synthesis model, comprising:
the first training module is used for training the historical speech synthesis model to obtain a multi-speaker model;
the acquisition module is used for acquiring the current speaker ID and the current input text in the current training data, wherein the data volume of the current input text is less than that of the historical input text of the multi-speaker model in the training process, and the phoneme of the current input text takes the vowel as a unit;
a second training module to train a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.
The above apparatus, optionally, the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module comprising:
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a historical speaker ID and a historical input text in training data;
a first training unit, configured to determine a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, train the historical alignment module through the historical encoder and the historical first decoder, and obtain a first speaker model, where the first speaker model includes: a historical target alignment module;
and the second training unit is used for training the history input text, the history encoder, the second history decoder and the target history alignment module based on the history speaker vector to obtain a second speaker model.
The above apparatus, optionally, further comprises:
the phonemes in the historical input text are in units of words.
The above apparatus, optionally, the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module comprising:
a determining unit, configured to perform linear combination on the first speaker model and the current speaker ID, and determine a first speaker vector;
a third training unit, configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;
a fourth training unit, configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.
The above apparatus, optionally, further comprises:
a second obtaining unit, configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
a first assigning unit configured to assign the first training result to the first current decoder as an initial value;
and the second assignment unit is used for assigning the second training result to the second current decoder as an initial value.
Compared with the prior art, the invention has the following advantages:
the embodiment of the invention provides a method and a device for training a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of an FPTUS model according to the prior art;
FIG. 2 is a diagram illustrating a FPTUS model training process according to the prior art;
FIG. 3 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a FPTUS model training process disclosed in an embodiment of the present application;
fig. 5 is a block diagram of a structure of a training apparatus for a speech synthesis model according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses a training method and a device of a speech synthesis model, which are applied to the training process of the speech synthesis model based on FPUTS. In order to solve the above problem, the present invention provides a training method of a speech synthesis model, the execution flow of the training method is shown in fig. 3, and the training method comprises the following steps:
s101, training a historical speech synthesis model to obtain a multi-speaker model;
in the embodiment of the invention, the speech synthesis is a process of synthesizing text into speech, and the historical speech synthesis model (FPTUS model) comprises a historical coder, a historical decoder and a historical alignment module, wherein the decoder comprises a first historical decoder and a second historical decoder; the multi-speaker model includes: a first speaker model and a second speaker model, where a process of training the historical speech synthesis model is the same as the training process shown in fig. 2, a training alignment module obtains a historical speaker ID and a historical input text in training data, where the training data is given in advance, the training data includes the historical speaker ID and the historical input text corresponding to the ID, the historical speaker ID is pre-assigned based on experience or specific conditions, a historical speaker vector is determined based on the historical speaker ID, the historical input text, the historical encoder and the historical first decoder train the historical alignment module to obtain the first speaker model, where the first speaker model includes: a historical target alignment module; and training a speaker vector, an encoder and a decoder, wherein based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the target historical alignment module are trained to obtain a second speaker model.
S102, acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units;
in the embodiment of the present invention, the current training data is given in advance, the current training data includes a current speaker ID and a current input text corresponding to the current speaker ID, and a current speaker ID and a current input text in the current training data are obtained, where the current speaker ID may be set based on experience or specific conditions, the data size of the current input text is the data size of a historical input text of the multi-speaker model during the training process, the current input text is a series of phonemes, for example, the current input text is "yi xi lie yin su", in a usage scenario of the historical speech synthesis model, pinyin is taken as a unit, in this example, 'yi', 'xi', 'lie', 'yin', 'su' are five different phoneme units, and this factor system has 460 different factors. For high amount of data, such a phoneme system may be used naturally, but for low amount of data, it may not be able to cover the whole, and therefore, in the embodiment of the present invention, the phoneme system is modified to take the initials and finals as units. The example becomes ` yixi l ie y in s u `. With this phoneme system, there are only 80 different phonemes, and the low amount of data can still be covered completely, but the accuracy requirement for model training becomes large.
S103, training a current voice synthesis model based on the multi-speaker model, the current speaker ID and the current input text.
In the embodiment of the present invention, as shown in fig. 4, the multi-speaker model includes: a first speaker model, which is the speaker vector obtained in the first training step shown in fig. 2, an encoder, an alignment module, a (simple) decoder, which is labeled as multiple speaker vector mul,1, encoder mul,1, alignment module mul,1, and (simple) decoder mul,1. The second speaker model comprises the speaker vector obtained in the second training step, an encoder, an alignment module, a (complex) decoder, which is marked as a multi-speaker vector mul,2, an encoder mul,2, an alignment module mul,2, and a (complex) decoder mul,2. Note that the alignment block mul,1 and the alignment block mul,2 are identical.
The speaker vector determines the characteristics of the synthesized audio, such as pronunciation duration, speech rate and pitch. In the same case of the encoder, alignment module, and decoder, different speaker vectors synthesize different voices. In the multi-speaker model, assuming that data of a total of N speakers participate in training, there are N different speaker vectors h i ,i=1...N。
Thus, the speaker vector of the currently input text needs to be determined first. To fully utilize the model of the multiple speakers, define the speaker vector of the speaker as
l=∑p i ×h i (1)
Wherein p is i Is a trainable variable and is a linear combination of multiple speaker vectors. The linear combinations are labeled in fig. 4.
As shown in fig. 4 (a), the speaker vector is a linear combination of the speaker vectors in the first training step of the multi-speaker model shown in fig. 2.
The encoder (current encoder) part in the multi-speaker model is a process of abstracting the current input text, the module does not receive speaker vector information in training, and the encoder (current encoder) part is directly obtained from the multi-speaker model and is kept fixed when a small amount of data is trained.
The alignment module (current alignment module) and the (simple) decoder (first current decoder) use the corresponding part of the first training step in the multi-speaker model shown in fig. 2 as initial values, but still require training. This speeds up convergence and results in improved final accuracy.
As shown in figure 4 (b) of the drawings,
the speaker vector in the second training step is a linear combination of the speaker vectors in the second training step of the multi-speaker model shown in fig. 2.
The encoder (current encoder) part in the multi-speaker model is an abstraction process of input text, the module does not receive speaker vector information in training, and the encoder (current encoder) part is directly obtained from the multi-speaker model and is kept fixed when training a small amount of data.
The alignment module was trained using fig. 4 (a) and held stationary.
The (complex) decoder (the second current decoder) uses the corresponding part of the second training step in the multi-speaker model shown in fig. 2 as an initial value, but still requires training
The embodiment of the invention provides a training method of a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.
In the embodiment of the invention, the existing mature FPUTS-based synthesis algorithm is combined with the specific migration algorithm, so that the data cost is reduced by at most one fifth, and the synthesis quality can be ensured to be basically unaffected.
Based on the foregoing speech synthesis model training method, in an embodiment of the present invention, a speech synthesis model training apparatus is provided, a structural block diagram of the training apparatus is shown in fig. 5, and the training apparatus includes:
a first training module 201, an acquisition module 202 and a second training module 203.
Wherein,
the first training module 201 is configured to train a historical speech synthesis model to obtain a multiple speaker model;
the obtaining module 202 is configured to obtain a current speaker ID and a current input text in current training data, where a data amount of the current input text is less than a data amount of a historical input text of the multi-speaker model in a training process, and a phoneme of the current input text takes a vowel as a unit;
the second training module 203 is configured to train a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.
The invention provides a training device of a speech synthesis model, which comprises: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.
In an embodiment of the present invention, the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module 201 comprising:
a first acquisition unit 204, a first training unit 205 and a second training unit 206.
Wherein,
the first obtaining unit 204 is configured to obtain a historical speaker ID and a historical input text in training data;
the first training unit 205 is configured to determine a historical speaker vector based on the historical speaker ID, and train the historical alignment module based on the historical speaker vector, the historical input text, the historical encoder, and the historical first decoder to obtain a first speaker model, where the first speaker model includes: a historical target alignment module;
the second training unit 206 is configured to train the history input text, the history encoder, the second history decoder, and the target history alignment module based on the history speaker vector to obtain a second speaker model.
In this embodiment of the present invention, the first training module 201 further includes:
the phonemes in the historical input text are in units of words.
In an embodiment of the present invention, the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module 203 including:
a determination unit 207, a third training unit 208 and a fourth training unit 209.
Wherein,
the determining unit 207 is configured to perform linear combination on the first speaker model and the current speaker ID to determine a first speaker vector;
the third training unit 208 is configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;
the fourth training unit 209 is configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, the current decoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, and the target current alignment module.
In this embodiment of the present invention, the second training module 203 further includes:
a second retrieving unit 210, a first assigning unit 211 and a second assigning unit 212.
Wherein,
the second obtaining unit 210 is configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
the first assigning unit 211 is configured to assign the first training result to the first current decoder as an initial value;
the second assigning unit 212 is configured to assign the second training result to the second current decoder as an initial value.
It should be noted that, in this specification, each embodiment is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same as and similar to each other in each embodiment may be referred to. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The above detailed description is provided for the training method and apparatus of a speech synthesis model provided by the present invention, and the present document applies specific examples to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (8)
1. A method for training a speech synthesis model, comprising:
training a historical speech synthesis model to obtain a multi-speaker model;
acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit;
training a current speech synthesis model based on the multi-speaker model, the current speaker ID and the current input text;
the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: the training process comprises the following steps:
acquiring a historical speaker ID and a historical input text in training data;
based on the historical speaker ID, determining a historical speaker vector, based on the historical speaker vector, the historical input text, the historical encoder and the first historical decoder train the historical alignment module to obtain a first speaker model, wherein the first speaker model comprises: a historical target alignment module;
based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the historical target alignment module are trained to obtain a second speaker model.
2. The method of claim 1, further comprising:
the phonemes in the historical input text are in units of words.
3. The method of claim 1, wherein a current speech synthesis model is trained based on the multi-speaker model, the speaker ID, and the input text, wherein the current speech synthesis model comprises: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the training process comprising:
linearly combining the first speaker model and the current speaker ID to determine a first speaker vector;
training the current alignment module by adopting the first speaker vector, the current input text, the current encoder and the first current decoder to obtain a target current alignment module;
determining a second speaker vector based on the second speaker model and the current speaker ID, and training the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.
4. The method of claim 3, further comprising:
acquiring a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
assigning the first training result to the first current decoder as an initial value;
and assigning the second training result to the second current decoder as an initial value.
5. An apparatus for training a speech synthesis model, comprising:
the first training module is used for training the historical speech synthesis model to obtain a multi-speaker model;
the acquisition module is used for acquiring the current speaker ID and the current input text in the current training data, wherein the data volume of the current input text is less than that of the historical input text of the multi-speaker model in the training process, and the phoneme of the current input text takes the vowel as a unit;
a second training module for training a current speech synthesis model based on the multi-speaker model, the current speaker ID and the current input text;
the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module comprising:
the first acquisition unit is used for acquiring a historical speaker ID and a historical input text in the training data;
a first training unit, configured to determine a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, train the historical alignment module through the historical encoder and the first historical decoder, and obtain a first speaker model, where the first speaker model includes: a historical target alignment module;
and the second training unit is used for training the history input text, the history encoder, the second history decoder and the history target alignment module based on the history speaker vector to obtain a second speaker model.
6. The apparatus of claim 5, further comprising:
the phonemes in the historical input text are in units of words.
7. The apparatus of claim 5, wherein the current speech synthesis model comprises: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module comprising:
a determining unit, configured to perform linear combination on the first speaker model and the current speaker ID, and determine a first speaker vector;
a third training unit, configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;
a fourth training unit, configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, the current decoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, and the target current alignment module.
8. The apparatus of claim 7, further comprising:
a second obtaining unit, configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
a first assigning unit configured to assign the first training result to the first current decoder as an initial value;
a second assigning unit, configured to assign the second training result to the second current decoder as an initial value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110259482.2A CN113053353B (en) | 2021-03-10 | 2021-03-10 | Training method and device of speech synthesis model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110259482.2A CN113053353B (en) | 2021-03-10 | 2021-03-10 | Training method and device of speech synthesis model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113053353A CN113053353A (en) | 2021-06-29 |
CN113053353B true CN113053353B (en) | 2022-10-04 |
Family
ID=76511007
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110259482.2A Active CN113053353B (en) | 2021-03-10 | 2021-03-10 | Training method and device of speech synthesis model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113053353B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102598057B1 (en) * | 2018-09-10 | 2023-11-06 | 삼성전자주식회사 | Apparatus and Methof for controlling the apparatus therof |
CN113781996B (en) * | 2021-08-20 | 2023-06-27 | 北京淇瑀信息科技有限公司 | Voice synthesis model training method and device and electronic equipment |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101116135A (en) * | 2005-02-10 | 2008-01-30 | 皇家飞利浦电子股份有限公司 | Sound synthesis |
CN109036375A (en) * | 2018-07-25 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Phoneme synthesizing method, model training method, device and computer equipment |
WO2019175574A1 (en) * | 2018-03-14 | 2019-09-19 | Papercup Technologies Limited | A speech processing system and a method of processing a speech signal |
CN111048064A (en) * | 2020-03-13 | 2020-04-21 | 同盾控股有限公司 | Voice cloning method and device based on single speaker voice synthesis data set |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111681639A (en) * | 2020-05-28 | 2020-09-18 | 上海墨百意信息科技有限公司 | Multi-speaker voice synthesis method and device and computing equipment |
CN111724765A (en) * | 2020-06-30 | 2020-09-29 | 上海优扬新媒信息技术有限公司 | Method and device for converting text into voice and computer equipment |
EP3739572A1 (en) * | 2018-01-11 | 2020-11-18 | Neosapience, Inc. | Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium |
CN112133282A (en) * | 2020-10-26 | 2020-12-25 | 厦门大学 | Lightweight multi-speaker speech synthesis system and electronic equipment |
CN112435650A (en) * | 2020-11-11 | 2021-03-02 | 四川长虹电器股份有限公司 | Multi-speaker and multi-language voice synthesis method and system |
CN112466276A (en) * | 2020-11-27 | 2021-03-09 | 出门问问(苏州)信息科技有限公司 | Speech synthesis system training method and device and readable storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2388352A1 (en) * | 2002-05-31 | 2003-11-30 | Voiceage Corporation | A method and device for frequency-selective pitch enhancement of synthesized speed |
US20190019500A1 (en) * | 2017-07-13 | 2019-01-17 | Electronics And Telecommunications Research Institute | Apparatus for deep learning based text-to-speech synthesizing by using multi-speaker data and method for the same |
-
2021
- 2021-03-10 CN CN202110259482.2A patent/CN113053353B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101116135A (en) * | 2005-02-10 | 2008-01-30 | 皇家飞利浦电子股份有限公司 | Sound synthesis |
EP3739572A1 (en) * | 2018-01-11 | 2020-11-18 | Neosapience, Inc. | Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium |
WO2019175574A1 (en) * | 2018-03-14 | 2019-09-19 | Papercup Technologies Limited | A speech processing system and a method of processing a speech signal |
CN109036375A (en) * | 2018-07-25 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Phoneme synthesizing method, model training method, device and computer equipment |
CN111048064A (en) * | 2020-03-13 | 2020-04-21 | 同盾控股有限公司 | Voice cloning method and device based on single speaker voice synthesis data set |
CN111489734A (en) * | 2020-04-03 | 2020-08-04 | 支付宝(杭州)信息技术有限公司 | Model training method and device based on multiple speakers |
CN111681639A (en) * | 2020-05-28 | 2020-09-18 | 上海墨百意信息科技有限公司 | Multi-speaker voice synthesis method and device and computing equipment |
CN111724765A (en) * | 2020-06-30 | 2020-09-29 | 上海优扬新媒信息技术有限公司 | Method and device for converting text into voice and computer equipment |
CN112133282A (en) * | 2020-10-26 | 2020-12-25 | 厦门大学 | Lightweight multi-speaker speech synthesis system and electronic equipment |
CN112435650A (en) * | 2020-11-11 | 2021-03-02 | 四川长虹电器股份有限公司 | Multi-speaker and multi-language voice synthesis method and system |
CN112466276A (en) * | 2020-11-27 | 2021-03-09 | 出门问问(苏州)信息科技有限公司 | Speech synthesis system training method and device and readable storage medium |
Non-Patent Citations (3)
Title |
---|
DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding;Junmo Lee,et al.;《2019 International Conference on Electronics, Information, and Communication (ICEIC)》;IEEE;20190506;全文 * |
基于少量数据集的端到端语音合成技术研究;谢永斌;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;中国学术期刊(光盘版)电子杂志社;20210215(第2期);全文 * |
嵌入式语音合成系统的研究与实现;张鹏;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;中国学术期刊(光盘版)电子杂志社;20060815(第8期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113053353A (en) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240062743A1 (en) | Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech | |
US11908448B2 (en) | Parallel tacotron non-autoregressive and controllable TTS | |
WO2021061484A1 (en) | Text-to-speech processing | |
CN116364055B (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
CN112489629B (en) | Voice transcription model, method, medium and electronic equipment | |
US11763797B2 (en) | Text-to-speech (TTS) processing | |
JP2006084715A (en) | Method and device for element piece set generation | |
EP4078571A1 (en) | A text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score | |
US20240087558A1 (en) | Methods and systems for modifying speech generated by a text-to-speech synthesiser | |
CN113450758B (en) | Speech synthesis method, apparatus, device and medium | |
CN113781995A (en) | Speech synthesis method, device, electronic equipment and readable storage medium | |
CN113053353B (en) | Training method and device of speech synthesis model | |
CN112509550A (en) | Speech synthesis model training method, speech synthesis device and electronic equipment | |
CN113628609A (en) | Automatic audio content generation | |
CN114974218A (en) | Voice conversion model training method and device and voice conversion method and device | |
US20110313772A1 (en) | System and method for unit selection text-to-speech using a modified viterbi approach | |
Oh et al. | Durflex-evc: Duration-flexible emotional voice conversion with parallel generation | |
CN114783410B (en) | Speech synthesis method, system, electronic device and storage medium | |
CN114299989A (en) | Voice filtering method and device, electronic equipment and storage medium | |
Ronanki | Prosody generation for text-to-speech synthesis | |
Zhou et al. | Learning and modeling unit embeddings using deep neural networks for unit-selection-based mandarin speech synthesis | |
CN117558263B (en) | Speech recognition method, device, equipment and readable storage medium | |
CN115206281B (en) | Voice synthesis model training method and device, electronic equipment and medium | |
CN112542160B (en) | Coding method for modeling unit of acoustic model and training method for acoustic model | |
Zhang et al. | The TJU-Didi-Huiyan system for Blizzard Challenge 2019 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220916 Address after: 100193 Room 606, 6 / F, building 4, West District, courtyard 10, northwest Wangdong Road, Haidian District, Beijing Applicant after: Du Xiaoman Technology (Beijing) Co.,Ltd. Address before: 401120 b7-7-2, Yuxing Plaza, No.5, Huangyang Road, Yubei District, Chongqing Applicant before: Chongqing duxiaoman Youyang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |