[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113053353B - Training method and device of speech synthesis model - Google Patents

Training method and device of speech synthesis model Download PDF

Info

Publication number
CN113053353B
CN113053353B CN202110259482.2A CN202110259482A CN113053353B CN 113053353 B CN113053353 B CN 113053353B CN 202110259482 A CN202110259482 A CN 202110259482A CN 113053353 B CN113053353 B CN 113053353B
Authority
CN
China
Prior art keywords
current
speaker
historical
training
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110259482.2A
Other languages
Chinese (zh)
Other versions
CN113053353A (en
Inventor
黄选平
马达标
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Du Xiaoman Technology Beijing Co Ltd
Original Assignee
Du Xiaoman Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Du Xiaoman Technology Beijing Co Ltd filed Critical Du Xiaoman Technology Beijing Co Ltd
Priority to CN202110259482.2A priority Critical patent/CN113053353B/en
Publication of CN113053353A publication Critical patent/CN113053353A/en
Application granted granted Critical
Publication of CN113053353B publication Critical patent/CN113053353B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the invention provides a method and a device for training a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of a multi-speaker model in a training process, and phonemes of the current input text take finals as units; the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, the accuracy in the training process can be improved because the multi-speaker model is trained in advance, even when the data volume of the input text is reduced, the number of phonemes of the input text cannot be reduced because the input text takes vowels as units, the precision of speech recognition is further ensured, and the problem of low precision of speech synthesis caused by the fact that the input text contains few factors and phoneme characteristics in the prior art is solved.

Description

Training method and device of speech synthesis model
Technical Field
The invention relates to the technical field of intelligent voice, in particular to a method and a device for training a voice synthesis model.
Background
The Speech synthesis is a process from Text to Speech, the Text is input to an acoustic model to obtain acoustic features, and then input to a synthesizer to obtain audio, wherein in the prior art, a high Parallel acoustic model FPUTS (full Parallel UFANS-based End-to-End Text-to-Speech System) is adopted for Speech synthesis, a schematic diagram of the FTPUS acoustic model is shown in fig. 1, wherein an encoder, an alignment module and a decoder are all composed of a neural network. The general process of generating audio is: obtaining a speaker vector (which is an N-dimensional vector, different speaker vectors for different speakers) from the ID of the speaker (e.g., 0-100); the speaker vector and the text are input into an encoder for encoding; the speaker and text input alignment module obtains the pronunciation duration information of the audio (the module determines the pronunciation duration, speaking speed and the like of the final audio); and the pronunciation time information and the codes from the coder enter a decoder for decoding to obtain the final audio.
The FPTUS model needs to be trained in advance, and after training is completed, speech recognition can be performed, and a training process for the PTPUS is shown in fig. 2, where the first step is training an alignment module. See fig. 2 (a). In this case, the main structures of the model are speaker vector, encoder, alignment module and a decoder with a very simple structure (the simple structure of the decoder is very important for training the alignment module). Training in cooperation with data to obtain a trained alignment module; the second step is to train the speaker vectors, the encoder and the decoder. Referring to fig. 2 (b), the model is mainly constructed as an encoder, from the first step trained alignment module, the (complex and final) decoder. The alignment module is fixed here and does not participate in training.
The inventors studied the training process of the FPTUS model and found that when the input text is short, since the data amount of the input text is small, when the phonemes are divided in units of words, the input text contains few phoneme features, resulting in low accuracy of speech synthesis during the training process.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for training a speech synthesis model, so as to solve the problem in the prior art that in the training process of an FPTUS model, when an input text is short, because the data size of the input text is small, when phonemes are divided by taking a word as a unit, the input text contains few phoneme features, which results in low accuracy of speech synthesis. The specific scheme is as follows:
a method of training a speech synthesis model, comprising:
training a historical speech synthesis model to obtain a multi-speaker model;
acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit;
training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.
Optionally, in the foregoing method, the historical speech synthesis model is trained to obtain a multiple speaker model, where the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: the training process comprises the following steps:
acquiring a historical speaker ID and a historical input text in training data;
determining a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, training the historical alignment module by the historical encoder and the historical first decoder to obtain a first speaker model, wherein the first speaker model comprises: a historical target alignment module;
based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the target historical alignment module are trained to obtain a second speaker model.
The method described above, optionally, further includes:
the phonemes in the historical input text are in units of words.
Optionally, in the above method, the current speech synthesis model is trained based on the multi-speaker model, the speaker ID, and the input text, where the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the training process comprising:
linearly combining the first speaker model and the current speaker ID to determine a first speaker vector;
training the current alignment module by adopting the first speaker vector, the current input text, the current encoder and the first current decoder to obtain a target current alignment module;
determining a second speaker vector based on the second speaker model and the current speaker ID, and training the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.
The above method, optionally, further includes:
acquiring a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
assigning the first training result to the first current decoder as an initial value;
and assigning the second training result to the second current decoder as an initial value.
An apparatus for training a speech synthesis model, comprising:
the first training module is used for training the historical speech synthesis model to obtain a multi-speaker model;
the acquisition module is used for acquiring the current speaker ID and the current input text in the current training data, wherein the data volume of the current input text is less than that of the historical input text of the multi-speaker model in the training process, and the phoneme of the current input text takes the vowel as a unit;
a second training module to train a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.
The above apparatus, optionally, the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module comprising:
the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a historical speaker ID and a historical input text in training data;
a first training unit, configured to determine a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, train the historical alignment module through the historical encoder and the historical first decoder, and obtain a first speaker model, where the first speaker model includes: a historical target alignment module;
and the second training unit is used for training the history input text, the history encoder, the second history decoder and the target history alignment module based on the history speaker vector to obtain a second speaker model.
The above apparatus, optionally, further comprises:
the phonemes in the historical input text are in units of words.
The above apparatus, optionally, the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module comprising:
a determining unit, configured to perform linear combination on the first speaker model and the current speaker ID, and determine a first speaker vector;
a third training unit, configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;
a fourth training unit, configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.
The above apparatus, optionally, further comprises:
a second obtaining unit, configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
a first assigning unit configured to assign the first training result to the first current decoder as an initial value;
and the second assignment unit is used for assigning the second training result to the second current decoder as an initial value.
Compared with the prior art, the invention has the following advantages:
the embodiment of the invention provides a method and a device for training a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of an FPTUS model according to the prior art;
FIG. 2 is a diagram illustrating a FPTUS model training process according to the prior art;
FIG. 3 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a FPTUS model training process disclosed in an embodiment of the present application;
fig. 5 is a block diagram of a structure of a training apparatus for a speech synthesis model according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses a training method and a device of a speech synthesis model, which are applied to the training process of the speech synthesis model based on FPUTS. In order to solve the above problem, the present invention provides a training method of a speech synthesis model, the execution flow of the training method is shown in fig. 3, and the training method comprises the following steps:
s101, training a historical speech synthesis model to obtain a multi-speaker model;
in the embodiment of the invention, the speech synthesis is a process of synthesizing text into speech, and the historical speech synthesis model (FPTUS model) comprises a historical coder, a historical decoder and a historical alignment module, wherein the decoder comprises a first historical decoder and a second historical decoder; the multi-speaker model includes: a first speaker model and a second speaker model, where a process of training the historical speech synthesis model is the same as the training process shown in fig. 2, a training alignment module obtains a historical speaker ID and a historical input text in training data, where the training data is given in advance, the training data includes the historical speaker ID and the historical input text corresponding to the ID, the historical speaker ID is pre-assigned based on experience or specific conditions, a historical speaker vector is determined based on the historical speaker ID, the historical input text, the historical encoder and the historical first decoder train the historical alignment module to obtain the first speaker model, where the first speaker model includes: a historical target alignment module; and training a speaker vector, an encoder and a decoder, wherein based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the target historical alignment module are trained to obtain a second speaker model.
S102, acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units;
in the embodiment of the present invention, the current training data is given in advance, the current training data includes a current speaker ID and a current input text corresponding to the current speaker ID, and a current speaker ID and a current input text in the current training data are obtained, where the current speaker ID may be set based on experience or specific conditions, the data size of the current input text is the data size of a historical input text of the multi-speaker model during the training process, the current input text is a series of phonemes, for example, the current input text is "yi xi lie yin su", in a usage scenario of the historical speech synthesis model, pinyin is taken as a unit, in this example, 'yi', 'xi', 'lie', 'yin', 'su' are five different phoneme units, and this factor system has 460 different factors. For high amount of data, such a phoneme system may be used naturally, but for low amount of data, it may not be able to cover the whole, and therefore, in the embodiment of the present invention, the phoneme system is modified to take the initials and finals as units. The example becomes ` yixi l ie y in s u `. With this phoneme system, there are only 80 different phonemes, and the low amount of data can still be covered completely, but the accuracy requirement for model training becomes large.
S103, training a current voice synthesis model based on the multi-speaker model, the current speaker ID and the current input text.
In the embodiment of the present invention, as shown in fig. 4, the multi-speaker model includes: a first speaker model, which is the speaker vector obtained in the first training step shown in fig. 2, an encoder, an alignment module, a (simple) decoder, which is labeled as multiple speaker vector mul,1, encoder mul,1, alignment module mul,1, and (simple) decoder mul,1. The second speaker model comprises the speaker vector obtained in the second training step, an encoder, an alignment module, a (complex) decoder, which is marked as a multi-speaker vector mul,2, an encoder mul,2, an alignment module mul,2, and a (complex) decoder mul,2. Note that the alignment block mul,1 and the alignment block mul,2 are identical.
The speaker vector determines the characteristics of the synthesized audio, such as pronunciation duration, speech rate and pitch. In the same case of the encoder, alignment module, and decoder, different speaker vectors synthesize different voices. In the multi-speaker model, assuming that data of a total of N speakers participate in training, there are N different speaker vectors h i ,i=1...N。
Thus, the speaker vector of the currently input text needs to be determined first. To fully utilize the model of the multiple speakers, define the speaker vector of the speaker as
l=∑p i ×h i (1)
Wherein p is i Is a trainable variable and is a linear combination of multiple speaker vectors. The linear combinations are labeled in fig. 4.
As shown in fig. 4 (a), the speaker vector is a linear combination of the speaker vectors in the first training step of the multi-speaker model shown in fig. 2.
The encoder (current encoder) part in the multi-speaker model is a process of abstracting the current input text, the module does not receive speaker vector information in training, and the encoder (current encoder) part is directly obtained from the multi-speaker model and is kept fixed when a small amount of data is trained.
The alignment module (current alignment module) and the (simple) decoder (first current decoder) use the corresponding part of the first training step in the multi-speaker model shown in fig. 2 as initial values, but still require training. This speeds up convergence and results in improved final accuracy.
As shown in figure 4 (b) of the drawings,
the speaker vector in the second training step is a linear combination of the speaker vectors in the second training step of the multi-speaker model shown in fig. 2.
The encoder (current encoder) part in the multi-speaker model is an abstraction process of input text, the module does not receive speaker vector information in training, and the encoder (current encoder) part is directly obtained from the multi-speaker model and is kept fixed when training a small amount of data.
The alignment module was trained using fig. 4 (a) and held stationary.
The (complex) decoder (the second current decoder) uses the corresponding part of the second training step in the multi-speaker model shown in fig. 2 as an initial value, but still requires training
The embodiment of the invention provides a training method of a speech synthesis model, which comprises the following steps: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.
In the embodiment of the invention, the existing mature FPUTS-based synthesis algorithm is combined with the specific migration algorithm, so that the data cost is reduced by at most one fifth, and the synthesis quality can be ensured to be basically unaffected.
Based on the foregoing speech synthesis model training method, in an embodiment of the present invention, a speech synthesis model training apparatus is provided, a structural block diagram of the training apparatus is shown in fig. 5, and the training apparatus includes:
a first training module 201, an acquisition module 202 and a second training module 203.
Wherein,
the first training module 201 is configured to train a historical speech synthesis model to obtain a multiple speaker model;
the obtaining module 202 is configured to obtain a current speaker ID and a current input text in current training data, where a data amount of the current input text is less than a data amount of a historical input text of the multi-speaker model in a training process, and a phoneme of the current input text takes a vowel as a unit;
the second training module 203 is configured to train a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text.
The invention provides a training device of a speech synthesis model, which comprises: training a historical speech synthesis model to obtain a multi-speaker model; acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in the training process, and phonemes of the current input text take finals as units; training a current speech synthesis model based on the multi-speaker model, the current speaker ID, and the current input text. In the training process, on the premise that the number of phonemes of the input text is less than that of phonemes of a historical input text of the multi-speaker model in the training process, the current speech synthesis model is trained based on the multi-speaker model, the current speaker ID and the current input text, and the accuracy in the training process can be improved due to the fact that the multi-speaker model is trained in advance.
In an embodiment of the present invention, the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module 201 comprising:
a first acquisition unit 204, a first training unit 205 and a second training unit 206.
Wherein,
the first obtaining unit 204 is configured to obtain a historical speaker ID and a historical input text in training data;
the first training unit 205 is configured to determine a historical speaker vector based on the historical speaker ID, and train the historical alignment module based on the historical speaker vector, the historical input text, the historical encoder, and the historical first decoder to obtain a first speaker model, where the first speaker model includes: a historical target alignment module;
the second training unit 206 is configured to train the history input text, the history encoder, the second history decoder, and the target history alignment module based on the history speaker vector to obtain a second speaker model.
In this embodiment of the present invention, the first training module 201 further includes:
the phonemes in the historical input text are in units of words.
In an embodiment of the present invention, the current speech synthesis model includes: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module 203 including:
a determination unit 207, a third training unit 208 and a fourth training unit 209.
Wherein,
the determining unit 207 is configured to perform linear combination on the first speaker model and the current speaker ID to determine a first speaker vector;
the third training unit 208 is configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;
the fourth training unit 209 is configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, the current decoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, and the target current alignment module.
In this embodiment of the present invention, the second training module 203 further includes:
a second retrieving unit 210, a first assigning unit 211 and a second assigning unit 212.
Wherein,
the second obtaining unit 210 is configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
the first assigning unit 211 is configured to assign the first training result to the first current decoder as an initial value;
the second assigning unit 212 is configured to assign the second training result to the second current decoder as an initial value.
It should be noted that, in this specification, each embodiment is described in a progressive manner, and each embodiment focuses on differences from other embodiments, and portions that are the same as and similar to each other in each embodiment may be referred to. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The above detailed description is provided for the training method and apparatus of a speech synthesis model provided by the present invention, and the present document applies specific examples to explain the principle and the implementation of the present invention, and the description of the above embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (8)

1. A method for training a speech synthesis model, comprising:
training a historical speech synthesis model to obtain a multi-speaker model;
acquiring a current speaker ID and a current input text in current training data, wherein the data volume of the current input text is less than that of a historical input text of the multi-speaker model in a training process, and phonemes of the current input text take vowels as a unit;
training a current speech synthesis model based on the multi-speaker model, the current speaker ID and the current input text;
the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: the training process comprises the following steps:
acquiring a historical speaker ID and a historical input text in training data;
based on the historical speaker ID, determining a historical speaker vector, based on the historical speaker vector, the historical input text, the historical encoder and the first historical decoder train the historical alignment module to obtain a first speaker model, wherein the first speaker model comprises: a historical target alignment module;
based on the historical speaker vector, the historical input text, the historical encoder, the second historical decoder and the historical target alignment module are trained to obtain a second speaker model.
2. The method of claim 1, further comprising:
the phonemes in the historical input text are in units of words.
3. The method of claim 1, wherein a current speech synthesis model is trained based on the multi-speaker model, the speaker ID, and the input text, wherein the current speech synthesis model comprises: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the training process comprising:
linearly combining the first speaker model and the current speaker ID to determine a first speaker vector;
training the current alignment module by adopting the first speaker vector, the current input text, the current encoder and the first current decoder to obtain a target current alignment module;
determining a second speaker vector based on the second speaker model and the current speaker ID, and training the second speaker vector, the current encoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, the second current decoder, and the target current alignment module.
4. The method of claim 3, further comprising:
acquiring a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
assigning the first training result to the first current decoder as an initial value;
and assigning the second training result to the second current decoder as an initial value.
5. An apparatus for training a speech synthesis model, comprising:
the first training module is used for training the historical speech synthesis model to obtain a multi-speaker model;
the acquisition module is used for acquiring the current speaker ID and the current input text in the current training data, wherein the data volume of the current input text is less than that of the historical input text of the multi-speaker model in the training process, and the phoneme of the current input text takes the vowel as a unit;
a second training module for training a current speech synthesis model based on the multi-speaker model, the current speaker ID and the current input text;
the historical speech synthesis model includes: a history encoder, a history decoder, and a history alignment module, the decoders including a first history decoder and a second history decoder; the multi-speaker model includes: a first speaker model and a second speaker model, the first training module comprising:
the first acquisition unit is used for acquiring a historical speaker ID and a historical input text in the training data;
a first training unit, configured to determine a historical speaker vector based on the historical speaker ID, based on the historical speaker vector and the historical input text, train the historical alignment module through the historical encoder and the first historical decoder, and obtain a first speaker model, where the first speaker model includes: a historical target alignment module;
and the second training unit is used for training the history input text, the history encoder, the second history decoder and the history target alignment module based on the history speaker vector to obtain a second speaker model.
6. The apparatus of claim 5, further comprising:
the phonemes in the historical input text are in units of words.
7. The apparatus of claim 5, wherein the current speech synthesis model comprises: a current encoder, a current decoder, and a current alignment module, the decoders including a first current decoder and a second current decoder, the second training module comprising:
a determining unit, configured to perform linear combination on the first speaker model and the current speaker ID, and determine a first speaker vector;
a third training unit, configured to train the current alignment module using the first speaker vector, the current input text, the current encoder, and the first current decoder to obtain a target current alignment module;
a fourth training unit, configured to determine a second speaker vector based on the second speaker model and the current speaker ID, and train the second speaker vector, the current encoder, the current decoder, and the second current decoder using the second speaker vector, the current input text, the current encoder, and the target current alignment module.
8. The apparatus of claim 7, further comprising:
a second obtaining unit, configured to obtain a first training result of a first historical decoder in the first speaker model and a second training result of a second historical decoder in the second speaker model;
a first assigning unit configured to assign the first training result to the first current decoder as an initial value;
a second assigning unit, configured to assign the second training result to the second current decoder as an initial value.
CN202110259482.2A 2021-03-10 2021-03-10 Training method and device of speech synthesis model Active CN113053353B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110259482.2A CN113053353B (en) 2021-03-10 2021-03-10 Training method and device of speech synthesis model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110259482.2A CN113053353B (en) 2021-03-10 2021-03-10 Training method and device of speech synthesis model

Publications (2)

Publication Number Publication Date
CN113053353A CN113053353A (en) 2021-06-29
CN113053353B true CN113053353B (en) 2022-10-04

Family

ID=76511007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110259482.2A Active CN113053353B (en) 2021-03-10 2021-03-10 Training method and device of speech synthesis model

Country Status (1)

Country Link
CN (1) CN113053353B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102598057B1 (en) * 2018-09-10 2023-11-06 삼성전자주식회사 Apparatus and Methof for controlling the apparatus therof
CN113781996B (en) * 2021-08-20 2023-06-27 北京淇瑀信息科技有限公司 Voice synthesis model training method and device and electronic equipment

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101116135A (en) * 2005-02-10 2008-01-30 皇家飞利浦电子股份有限公司 Sound synthesis
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
WO2019175574A1 (en) * 2018-03-14 2019-09-19 Papercup Technologies Limited A speech processing system and a method of processing a speech signal
CN111048064A (en) * 2020-03-13 2020-04-21 同盾控股有限公司 Voice cloning method and device based on single speaker voice synthesis data set
CN111489734A (en) * 2020-04-03 2020-08-04 支付宝(杭州)信息技术有限公司 Model training method and device based on multiple speakers
CN111681639A (en) * 2020-05-28 2020-09-18 上海墨百意信息科技有限公司 Multi-speaker voice synthesis method and device and computing equipment
CN111724765A (en) * 2020-06-30 2020-09-29 上海优扬新媒信息技术有限公司 Method and device for converting text into voice and computer equipment
EP3739572A1 (en) * 2018-01-11 2020-11-18 Neosapience, Inc. Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
CN112133282A (en) * 2020-10-26 2020-12-25 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN112435650A (en) * 2020-11-11 2021-03-02 四川长虹电器股份有限公司 Multi-speaker and multi-language voice synthesis method and system
CN112466276A (en) * 2020-11-27 2021-03-09 出门问问(苏州)信息科技有限公司 Speech synthesis system training method and device and readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2388352A1 (en) * 2002-05-31 2003-11-30 Voiceage Corporation A method and device for frequency-selective pitch enhancement of synthesized speed
US20190019500A1 (en) * 2017-07-13 2019-01-17 Electronics And Telecommunications Research Institute Apparatus for deep learning based text-to-speech synthesizing by using multi-speaker data and method for the same

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101116135A (en) * 2005-02-10 2008-01-30 皇家飞利浦电子股份有限公司 Sound synthesis
EP3739572A1 (en) * 2018-01-11 2020-11-18 Neosapience, Inc. Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
WO2019175574A1 (en) * 2018-03-14 2019-09-19 Papercup Technologies Limited A speech processing system and a method of processing a speech signal
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
CN111048064A (en) * 2020-03-13 2020-04-21 同盾控股有限公司 Voice cloning method and device based on single speaker voice synthesis data set
CN111489734A (en) * 2020-04-03 2020-08-04 支付宝(杭州)信息技术有限公司 Model training method and device based on multiple speakers
CN111681639A (en) * 2020-05-28 2020-09-18 上海墨百意信息科技有限公司 Multi-speaker voice synthesis method and device and computing equipment
CN111724765A (en) * 2020-06-30 2020-09-29 上海优扬新媒信息技术有限公司 Method and device for converting text into voice and computer equipment
CN112133282A (en) * 2020-10-26 2020-12-25 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN112435650A (en) * 2020-11-11 2021-03-02 四川长虹电器股份有限公司 Multi-speaker and multi-language voice synthesis method and system
CN112466276A (en) * 2020-11-27 2021-03-09 出门问问(苏州)信息科技有限公司 Speech synthesis system training method and device and readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding;Junmo Lee,et al.;《2019 International Conference on Electronics, Information, and Communication (ICEIC)》;IEEE;20190506;全文 *
基于少量数据集的端到端语音合成技术研究;谢永斌;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;中国学术期刊(光盘版)电子杂志社;20210215(第2期);全文 *
嵌入式语音合成系统的研究与实现;张鹏;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;中国学术期刊(光盘版)电子杂志社;20060815(第8期);全文 *

Also Published As

Publication number Publication date
CN113053353A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
US20240062743A1 (en) Unsupervised Parallel Tacotron Non-Autoregressive and Controllable Text-To-Speech
US11908448B2 (en) Parallel tacotron non-autoregressive and controllable TTS
WO2021061484A1 (en) Text-to-speech processing
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN112489629B (en) Voice transcription model, method, medium and electronic equipment
US11763797B2 (en) Text-to-speech (TTS) processing
JP2006084715A (en) Method and device for element piece set generation
EP4078571A1 (en) A text-to-speech synthesis method and system, a method of training a text-to-speech synthesis system, and a method of calculating an expressivity score
US20240087558A1 (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN113781995A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN113053353B (en) Training method and device of speech synthesis model
CN112509550A (en) Speech synthesis model training method, speech synthesis device and electronic equipment
CN113628609A (en) Automatic audio content generation
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
US20110313772A1 (en) System and method for unit selection text-to-speech using a modified viterbi approach
Oh et al. Durflex-evc: Duration-flexible emotional voice conversion with parallel generation
CN114783410B (en) Speech synthesis method, system, electronic device and storage medium
CN114299989A (en) Voice filtering method and device, electronic equipment and storage medium
Ronanki Prosody generation for text-to-speech synthesis
Zhou et al. Learning and modeling unit embeddings using deep neural networks for unit-selection-based mandarin speech synthesis
CN117558263B (en) Speech recognition method, device, equipment and readable storage medium
CN115206281B (en) Voice synthesis model training method and device, electronic equipment and medium
CN112542160B (en) Coding method for modeling unit of acoustic model and training method for acoustic model
Zhang et al. The TJU-Didi-Huiyan system for Blizzard Challenge 2019

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220916

Address after: 100193 Room 606, 6 / F, building 4, West District, courtyard 10, northwest Wangdong Road, Haidian District, Beijing

Applicant after: Du Xiaoman Technology (Beijing) Co.,Ltd.

Address before: 401120 b7-7-2, Yuxing Plaza, No.5, Huangyang Road, Yubei District, Chongqing

Applicant before: Chongqing duxiaoman Youyang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant