CN105719641B - Sound method and apparatus are selected for waveform concatenation speech synthesis - Google Patents
Sound method and apparatus are selected for waveform concatenation speech synthesis Download PDFInfo
- Publication number
- CN105719641B CN105719641B CN201610035220.7A CN201610035220A CN105719641B CN 105719641 B CN105719641 B CN 105719641B CN 201610035220 A CN201610035220 A CN 201610035220A CN 105719641 B CN105719641 B CN 105719641B
- Authority
- CN
- China
- Prior art keywords
- phone
- hmm
- markup information
- waveform
- machine learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 46
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 46
- 238000010801 machine learning Methods 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 20
- 238000003066 decision tree Methods 0.000 claims description 12
- 239000012634 fragment Substances 0.000 claims description 11
- 238000013467 fragmentation Methods 0.000 claims description 8
- 238000006062 fragmentation reaction Methods 0.000 claims description 8
- 230000000694 effects Effects 0.000 abstract description 7
- 230000006870 function Effects 0.000 description 7
- 230000033764 rhythmic process Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 5
- 150000001875 compounds Chemical class 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention propose it is a kind of select sound method and apparatus for waveform concatenation speech synthesis, which includes: acquisition markup information, and the markup information is treated after synthesis text carries out front-end processing and obtained;Obtain pre-generated machine learning model;Machine learning pre-selection is carried out according to the markup information and the machine learning model, obtains candidate phone waveform segment.This method can be improved pre-selection effect when speech synthesis.
Description
Technical field
The present invention relates to speech synthesis technique field more particularly to a kind of sound method is selected for waveform concatenation speech synthesis
And device.
Background technique
How speech synthesis, also known as literary periodicals (Text to Speech) technology, the main problem of solution are by text
Information is converted into audible acoustic information.
In speech synthesis, need first to carry out front-end processing to the text of input, then carry out parameters,acoustic and predict to obtain sound
Parameter is learned, finally directly passes through vocoder synthetic video using parameters,acoustic, or module of selection carries out waveform spelling from sound library
It connects.Relative to the sound of vocoder synthesis, the synthetic video based on waveform concatenation has higher sound quality, and more preferably maintains original
The style of speaker.
During constructing the speech synthesis system based on waveform concatenation, in the related technology, usually first believed according to mark
Breath obtains candidate phone waveform segment, then carries out a series of pre-selection in candidate phone waveform segment, comprising: duration pre-selection,
Rhythm preliminary site selection, context pre-selection, Kullback-Leibler distance (KLD) pre-selection and neighbours' pre-selection etc., later again from pre-
It selects in obtained waveform segment and selects optimal phone waveform fragment sequence, later according to optimal phone waveform segment sequence assembly
Synthesis obtains synthesis voice.
Above scheme in the related technology can there are the following problems:
(1) each pre-selection process is mutually indepedent, these informixes is not got up to fully consider, therefore, it is difficult to obtain very
Good pre-selection effect;
(2) above-mentioned pre-selection process needs to adjust threshold value and weight, and the need of work for adjusting threshold value and weight is largely thin
The manual working of cause is easy to attend to one thing and lose sight of another, and after a sound library adjustment good threshold and weight, changes a sound library and generally requires weight
Newly adjust these parameters;
(3) need to carry out multistep pre-selection, calculation amount is larger (especially KLD pre-selection);
(4) Project Realization of this method is relatively complicated, is related to the maintenance of quantity of parameters, and code complexity is high, it is more difficult to tie up
Shield.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, selecting sound method, the party for waveform concatenation speech synthesis an object of the present invention is to provide a kind of
Method preselects the pre-selection effect of phone when speech synthesis can be improved.
It is another object of the present invention to propose a kind of to select mixer for waveform concatenation speech synthesis.
In order to achieve the above objectives, what first aspect present invention embodiment proposed selects sound side for waveform concatenation speech synthesis
Method, comprising: obtain markup information, the markup information is treated after synthesis text carries out front-end processing and obtained;It obtains preparatory
The machine learning model of generation;Machine learning pre-selection is carried out according to the markup information and the machine learning model, is waited
Select phone waveform segment.
What first aspect present invention embodiment proposed selects sound method for waveform concatenation speech synthesis, by using machine
Learning model is preselected, and various informixes can be got up consider, thus pre-selection effect when improving speech synthesis.
In order to achieve the above objectives, what second aspect of the present invention embodiment proposed selects sound to fill for waveform concatenation speech synthesis
It sets, comprising: first obtains module, and for obtaining markup information, the markup information is to treat synthesis text to carry out front-end processing
It obtains afterwards;Second obtains module, for obtaining pre-generated machine learning model;Module is preselected, for according to the mark
It infuses information and the machine learning model carries out machine learning pre-selection, obtain candidate phone waveform segment.
What second aspect of the present invention embodiment proposed selects mixer for waveform concatenation speech synthesis, by using machine
Learning model is preselected, and various informixes can be got up consider, thus pre-selection effect when improving speech synthesis.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments
Obviously and it is readily appreciated that, in which:
Fig. 1 is the flow diagram for selecting sound method for waveform concatenation speech synthesis that one embodiment of the invention proposes;
Fig. 2 is the process signal for selecting sound method for waveform concatenation speech synthesis that another embodiment of the present invention proposes
Figure;
Fig. 3 is a kind of schematic diagram of phone tree in the embodiment of the present invention;
Fig. 4 is the flow diagram of phoneme synthesizing method in the embodiment of the present invention;
Fig. 5 is the structural representation for selecting mixer for waveform concatenation speech synthesis that another embodiment of the present invention proposes
Figure;
Fig. 6 is the structural representation for selecting mixer for waveform concatenation speech synthesis that another embodiment of the present invention proposes
Figure.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar module or module with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.On the contrary, this
The embodiment of invention includes all changes fallen within the scope of the spiritual and intension of attached claims, modification and is equal
Object.
Fig. 1 is the flow diagram for selecting sound method for waveform concatenation speech synthesis that one embodiment of the invention proposes.
Referring to Fig. 1, this method comprises:
S11: obtaining markup information, and the markup information is treated after synthesis text carries out front-end processing and obtained.
Wherein, front-end processing specifically includes that pretreatment, participle, part-of-speech tagging, phonetic notation, prosody hierarchy prediction etc..
Markup information specifically includes that the contextual information of phone, rhythm location information, tone information etc..
S12: pre-generated machine learning model is obtained.
Optionally, machine learning model can be phone tree or deep neural network model.
Machine learning model can be in the training stage, be generated according to the markup information of phone sample and voice data training
's.
In the present embodiment, by taking machine learning model is phone tree as an example.
Correspondingly, referring to fig. 2, in some embodiments, this method further include:
S21: the markup information of phone sample and the waveform segment of phone sample are obtained, and according to the mark of the phone sample
Information is infused, training obtains hidden Markov model (Hidden Markov Model, HMM), and, establish HMM and waveform segment
Corresponding relationship.
Wherein, training HMM when can using hidden Markov model kit (HMM Tool Kit, HTK) based on
Speech synthesis (HMM-based Speech Synthesis System, HTS) Lai Shixian of HMM.
After the completion of training, the sample of each phone can correspond to a HMM in training data, and each HMM is to mark letter
Breath is to name.
In training data, the same phone, a generally corresponding HMM;There are a HMM correspondence is more under rare occasion
The case where a phone.
For example, simple or compound vowel of a Chinese syllable phone ai4, the title of corresponding HMM acoustic model can be represented simply as: k-ai+b, t-ai+
H, s-ai+n etc..It is understood that complete HMM title, i.e. phone mark, it also include a large amount of other informations.
S22: corresponding each phone, HMM corresponding to the phone carry out decision tree-based clustering, it is corresponding to obtain the phone
Phone tree.
For specific phone, such as " ai4 ", decision tree-based clustering is carried out using its all HMM.
By decision tree-based clustering, in the phone tree of building, the corresponding optimal fragmentation problem of each non-leaf nodes, often
The HMM of a leaf node association a part.
The markup informations such as rhythm position, context are used when cluster.
HMM all at the beginning is on root node, and then selection is so that division front and back log-likelihood increment is maximum
Problem splits into two parts as optimal fragmentation problem, the associated HMM of root node;Then child node is further continued for dividing.When point
When splitting front and back log-likelihood increment less than a certain threshold value, just stop division.Wherein, threshold value is by minimum description length (Minimum
Description distance, MDL) criterion determination.
A kind of phone tree cluster process can be as shown in Figure 3.The corresponding optimal division of each non-leaf nodes in Fig. 3
Problem, the HMM of each leaf node association a part.
Wherein, " L " and " R " in Fig. 3 respectively indicates the phone of the left and right side of current phone, and voice indicates rhythm
Female, silence indicates mute, and " w " and " g " indicates two specific phones.
For example, in Fig. 3, the optimal fragmentation problem used on root node be judge the left side of current phone phone whether
Does is it simple or compound vowel of a Chinese syllable (L=voice?).
S13: machine learning pre-selection is carried out according to the markup information and the machine learning model, obtains candidate phone wave
Shape segment.
By taking phone tree as an example, specifically, may include:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone of the phone
Tree, obtains the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, wave corresponding with the associated HMM of the leaf node is obtained
The waveform segment is determined as obtaining candidate phone waveform segment by shape segment.
For example, in synthesis phase, after obtaining the corresponding markup information of text to be synthesized, according to the markup information and Fig. 3
Shown in phone tree, splitting rule according to Fig.3, can find a leaf node, later can be from the leaf node
Middle acquisition HMM.Specifically, can recorde the title of each HMM in leaf node, the title of each HMM corresponding sound when being trained
The markup information of son.
It can be seen that, the pre-selection of phone tree has obtained the part of a very little of all phone waveform segments from the description above
(corresponding to the part of some leaf node);Also, theoretically, the fundamental frequency and spectral property of the corresponding phone candidate of leaf node and
By acoustic model predict come target acoustical parameters have preferable consistency.
In some embodiments, in the phoneme synthesizing method using the pre-program of the present embodiment, referring to fig. 4, which is closed
May include: at method
S41: it treats synthesis text and carries out front-end processing, obtain markup information.
Front-end processing for example, pretreatment, participle, part-of-speech tagging, phonetic notation and prosody prediction etc..
S42: according to markup information and pre-generated machine learning model, machine learning pre-selection is carried out.
Machine learning model is, for example, phone tree, be may refer in above-mentioned process according to the process that phone tree is preselected
Associated description, details are not described herein.
S43: parameters,acoustic prediction is carried out according to markup information, obtains parameters,acoustic.
Parameters,acoustic includes: spectrum, fundamental frequency and duration.
S44: according to the candidate phone waveform segment obtained after parameters,acoustic and pre-selection, cost calculating is carried out, is selected optimal
Phone waveform fragment sequence.
Wherein, when cost calculates, the minimum sequence of cost can be found by the method for Dynamic Programming as optimal sound
Sub-waveform fragment sequence.
After obtaining optimal phone waveform fragment sequence, the waveform segment in the sequence can be spliced, be obtained
Synthesize voice.
It in the present embodiment, is preselected by using machine learning model, various informixes can be got up consider, from
And pre-selection effect when improving speech synthesis.Specifically, entire pre-selection process is carried out under same frame.Such as phone
Tree pre-selection, it is only necessary to decision tree be traversed according to mark, reliable pre-selection can be completed.The candidate phone and mesh that this method obtains
Marking parameters,acoustic has preferable consistency.Entire pre-selection process does not need to carry out the adjustment of a large amount of artificial threshold values and weight.This
Sample, if training data has altered, it is only necessary to re-start pre-selection model training (such as phone tree pre-selection, it is only necessary to
Re-establish phone tree).Pre-selection process is the ergodic process of a decision tree, computation complexity very little;Moreover, calculating
Complexity and sound library scale are unrelated, but only candidate phone scale corresponding with leaf node is related;And pass through MDL criterion,
The scale that the candidate phone quantity of leaf node can keep relative stability.The Project Realization of this method is simple, clear, is easy to tie up
Shield.
Fig. 5 is the structural representation for selecting mixer for waveform concatenation speech synthesis that another embodiment of the present invention proposes
Figure.Referring to Fig. 5, which includes: that the first acquisition module 51, second obtains module 52 and pre-selection module 53.
First obtains module 51, and for obtaining markup information, the markup information is to treat synthesis text to carry out at front end
It is obtained after reason;
Wherein, front-end processing specifically includes that pretreatment, participle, part-of-speech tagging, phonetic notation, prosody hierarchy prediction etc..
Markup information specifically includes that the contextual information of phone, rhythm location information, tone information etc..
Second obtains module 52, for obtaining pre-generated machine learning model;
Optionally, machine learning model can be phone tree or deep neural network model.
Machine learning model can be in the training stage, be generated according to the markup information of phone sample and voice data training
's.
Module 53 is preselected, for carrying out machine learning pre-selection according to the markup information and the machine learning model, is obtained
Phone after to pre-selection.
In the present embodiment, by taking machine learning model is phone tree as an example.
In some embodiments, referring to Fig. 6, the device 50 further include:
Modeling module 54, for obtaining the markup information of phone sample and the waveform segment of phone sample, and according to described
The markup information of phone sample, training obtain HMM, and, establish the corresponding relationship of HMM Yu waveform segment;
It wherein, can be using hidden Markov model kit (HMM Tool Kit, HTK) based on HMM when training HMM
Speech synthesis (HMM-based Speech Synthesis System, HTS) Lai Shixian.
After the completion of training, the sample of each phone can correspond to a HMM in training data, and each HMM is to mark letter
Breath is to name.
In training data, the same phone, a generally corresponding HMM;There are a HMM correspondence is more under rare occasion
The case where a phone.
For example, simple or compound vowel of a Chinese syllable phone ai4, the title of corresponding HMM acoustic model can be represented simply as: k-ai+b, t-ai+
H, s-ai+n etc..It is understood that complete HMM title, i.e. phone mark, it also include a large amount of other informations.
Cluster module 55 carries out decision tree-based clustering according to the corresponding HMM of the phone, obtains for corresponding to each phone
The corresponding phone tree of the phone.
For specific phone, such as " ai4 ", decision tree-based clustering is carried out using its all HMM.
By decision tree-based clustering, in the phone tree of building, the corresponding optimal fragmentation problem of each non-leaf nodes, often
The HMM of a leaf node association a part.
The problem of using when cluster is the markup informations such as rhythm position, context.
HMM all at the beginning is on root node, and then selection is so that division front and back log-likelihood increment is maximum
Problem splits into two parts as optimal fragmentation problem, the associated HMM of root node;Then child node is further continued for dividing.When point
When splitting front and back log-likelihood increment less than a certain threshold value, just stop division.Wherein, threshold value is by minimum description length (Minimum
Description distance, MDL) criterion determination.
A kind of phone tree cluster process can be as shown in Figure 3.The corresponding optimal division of each non-leaf nodes in Fig. 3
Problem, the HMM of each leaf node association a part.
Wherein, " L " and " R " in Fig. 3 respectively indicates the phone of the left and right side of current phone, and voice indicates rhythm
Female, silence indicates mute, and " w " and " g " indicates two specific phones.
For example, in Fig. 3, the optimal fragmentation problem used on root node be judge the left side of current phone phone whether
Does is it simple or compound vowel of a Chinese syllable (L=voice?).
Optionally, the pre-selection module 53 is specifically used for:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone of the phone
Tree, obtains the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, wave corresponding with the associated HMM of the leaf node is obtained
The waveform segment is determined as obtaining candidate phone waveform segment by shape segment.
For example, in synthesis phase, after obtaining the corresponding markup information of text to be synthesized, according to the markup information and Fig. 3
Shown in phone tree, splitting rule according to Fig.3, can find a leaf node, later can be from the leaf node
Middle acquisition HMM.Specifically, can recorde the title of each HMM in leaf node, the title of each HMM corresponding sound when being trained
The markup information of son.
It can be seen that, the pre-selection of phone tree has obtained the part of a very little of all phone waveform segments from the description above
(corresponding to the part of some leaf node);Also, theoretically, the fundamental frequency and spectral property of the corresponding phone candidate of leaf node and
By acoustic model predict come target acoustical parameters have preferable consistency.
In some embodiments, referring to Fig. 6, the device 50 further include:
Third obtains module 56, and for obtaining parameters,acoustic, the parameters,acoustic is according to the markup information carry out sound
It is obtained after parameter prediction;
Parameters,acoustic includes: spectrum, fundamental frequency and duration.
Determining module 57, for carrying out cost calculating, choosing according to the parameters,acoustic and the candidate phone waveform segment
Optimal phone waveform fragment sequence is selected out, to splice to the waveform segment in the optimal phone waveform fragment sequence,
Obtain synthesis voice.
Wherein, when cost calculates, the minimum sequence of cost can be found by the method for Dynamic Programming as optimal sound
Sub-waveform fragment sequence.
After obtaining optimal phone waveform fragment sequence, the waveform segment in the sequence can be spliced, be obtained
Synthesize voice.
It in the present embodiment, is preselected by using machine learning model, various informixes can be got up consider, from
And pre-selection effect when improving speech synthesis.Specifically, entire pre-selection process is carried out under same frame.Such as phone
Tree pre-selection, it is only necessary to decision tree be traversed according to mark, reliable pre-selection can be completed.The candidate phone and mesh that this method obtains
Marking parameters,acoustic has preferable consistency.Entire pre-selection process does not need to carry out the adjustment of a large amount of artificial threshold values and weight.This
Sample, if training data has altered, it is only necessary to re-start pre-selection model training (such as phone tree pre-selection, it is only necessary to
Re-establish phone tree).Pre-selection process is the ergodic process of a decision tree, computation complexity very little;Moreover, calculating
Complexity and sound library scale are unrelated, but only candidate phone scale corresponding with leaf node is related;And pass through MDL criterion,
The scale that the candidate phone quantity of leaf node can keep relative stability.The Project Realization of this method is simple, clear, is easy to tie up
Shield.
It should be noted that in the description of the present invention, term " first ", " second " etc. are used for description purposes only, without
It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple "
Refer at least two.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware
Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal
Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries
It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium
In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any
One or more embodiment or examples in can be combined in any suitable manner.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example
Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned
Embodiment is changed, modifies, replacement and variant.
Claims (8)
1. a kind of select sound method for waveform concatenation speech synthesis characterized by comprising
Markup information is obtained, the markup information is treated after synthesis text carries out front-end processing and obtained;
Obtain pre-generated machine learning model;
Machine learning pre-selection is carried out according to the markup information and the machine learning model, obtains candidate phone waveform segment;
It is described that machine learning pre-selection is carried out according to the markup information and the machine learning model, obtain candidate phone corrugated sheet
It is disconnected, comprising:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone tree of the phone, obtains
Take the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, corrugated sheet corresponding with the associated HMM of the leaf node is obtained
It is disconnected, the waveform segment is determined as to obtain candidate phone waveform segment.
2. the method according to claim 1, wherein when the machine learning model is phone tree, the side
Method further include:
The markup information of phone sample and the waveform segment of phone sample are obtained, and according to the markup information of the phone sample,
Training obtains HMM, and, establish the corresponding relationship of HMM Yu waveform segment;
Corresponding each phone, HMM corresponding to the phone carry out decision tree-based clustering, obtain the corresponding phone tree of the phone.
3. according to the method described in claim 2, it is characterized in that, each non-leaf nodes is one corresponding in the phone tree
Optimal fragmentation problem, each leaf node are associated with one or more HMM.
4. according to the method described in claim 3, it is characterized in that, the optimal fragmentation problem is so that division front and back logarithm
The problem of likelihood value increment maximum stops division, wherein described when log-likelihood increment is less than preset threshold when division front and back
Preset threshold is determined according to MDL criterion.
5. the method according to claim 1, wherein further include:
Parameters,acoustic is obtained, the parameters,acoustic is obtained after carrying out parameters,acoustic prediction according to the markup information;
According to the parameters,acoustic and the candidate phone waveform segment, cost calculating is carried out, optimal phone corrugated sheet is selected
Disconnected sequence obtains synthesis voice to splice to the waveform segment in the optimal phone waveform fragment sequence.
6. a kind of select mixer for waveform concatenation speech synthesis characterized by comprising
First obtains module, and for obtaining markup information, the markup information is to treat after synthesis text carries out front-end processing to obtain
It arrives;
Second obtains module, for obtaining pre-generated machine learning model;
Module is preselected, for carrying out machine learning pre-selection according to the markup information and the machine learning model, is preselected
Phone afterwards;
The pre-selection module is specifically used for:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone tree of the phone, obtains
Take the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, corrugated sheet corresponding with the associated HMM of the leaf node is obtained
It is disconnected, the waveform segment is determined as to obtain candidate phone waveform segment.
7. device according to claim 6, which is characterized in that when the machine learning model is phone tree, the dress
It sets further include:
Modeling module, for obtaining the markup information of phone sample and the waveform segment of phone sample, and according to the phone sample
This markup information, training obtain HMM, and, establish the corresponding relationship of HMM Yu waveform segment;
Cluster module, for corresponding to each phone, HMM corresponding to the phone carries out decision tree-based clustering, obtains the phone
Corresponding phone tree.
8. device according to claim 6, which is characterized in that further include:
Third obtains module, and for obtaining parameters,acoustic, the parameters,acoustic is to carry out parameters,acoustic according to the markup information
It is obtained after prediction;
Determining module, for carrying out cost calculating, selecting most according to the parameters,acoustic and the candidate phone waveform segment
Excellent phone waveform fragment sequence is closed to splice to the waveform segment in the optimal phone waveform fragment sequence
At voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610035220.7A CN105719641B (en) | 2016-01-19 | 2016-01-19 | Sound method and apparatus are selected for waveform concatenation speech synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610035220.7A CN105719641B (en) | 2016-01-19 | 2016-01-19 | Sound method and apparatus are selected for waveform concatenation speech synthesis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105719641A CN105719641A (en) | 2016-06-29 |
CN105719641B true CN105719641B (en) | 2019-07-30 |
Family
ID=56147931
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610035220.7A Active CN105719641B (en) | 2016-01-19 | 2016-01-19 | Sound method and apparatus are selected for waveform concatenation speech synthesis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105719641B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047463B (en) * | 2019-01-31 | 2021-03-02 | 北京捷通华声科技股份有限公司 | Voice synthesis method and device and electronic equipment |
CN112151009B (en) * | 2020-09-27 | 2024-06-25 | 平安科技(深圳)有限公司 | Voice synthesis method and device based on prosody boundary, medium and equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1835075A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speech synthetizing method combined natural sample selection and acaustic parameter to build mould |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
CN105206264A (en) * | 2015-09-22 | 2015-12-30 | 百度在线网络技术(北京)有限公司 | Speech synthesis method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5722295B2 (en) * | 2012-11-12 | 2015-05-20 | 日本電信電話株式会社 | Acoustic model generation method, speech synthesis method, apparatus and program thereof |
-
2016
- 2016-01-19 CN CN201610035220.7A patent/CN105719641B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1835075A (en) * | 2006-04-07 | 2006-09-20 | 安徽中科大讯飞信息科技有限公司 | Speech synthetizing method combined natural sample selection and acaustic parameter to build mould |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
CN105206264A (en) * | 2015-09-22 | 2015-12-30 | 百度在线网络技术(北京)有限公司 | Speech synthesis method and device |
Non-Patent Citations (3)
Title |
---|
《基于统计声学建模的单元挑选语音合成方法研究》;宋阳;《中国优秀硕士学位论文全文数据库 信息科技辑》;20141015(第10期);第11-18页 |
《汉语连续语音识别系统中三音子模型的优化》;齐耀辉等;《计算机应用研究》;20131031;第30卷(第10期);第2920-2922页 |
《面向大语料库的语音合成方法研究》;于延锁等;《北京大学学报(自然科学版)》;20140930;第50卷(第5期);第791-796页 |
Also Published As
Publication number | Publication date |
---|---|
CN105719641A (en) | 2016-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2179414B1 (en) | Synthesis by generation and concatenation of multi-form segments | |
CN101490740B (en) | Audio combining device | |
JP4469883B2 (en) | Speech synthesis method and apparatus | |
US8386256B2 (en) | Method, apparatus and computer program product for providing real glottal pulses in HMM-based text-to-speech synthesis | |
CN105206258A (en) | Generation method and device of acoustic model as well as voice synthetic method and device | |
CN104916284A (en) | Prosody and acoustics joint modeling method and device for voice synthesis system | |
CN110459202A (en) | A kind of prosodic labeling method, apparatus, equipment, medium | |
CN106057192A (en) | Real-time voice conversion method and apparatus | |
CN108172211B (en) | Adjustable waveform splicing system and method | |
KR20170107683A (en) | Text-to-Speech Synthesis Method using Pitch Synchronization in Deep Learning Based Text-to-Speech Synthesis System | |
CN105206264B (en) | Phoneme synthesizing method and device | |
CN112185341A (en) | Dubbing method, apparatus, device and storage medium based on speech synthesis | |
CN101887719A (en) | Speech synthesis method, system and mobile terminal equipment with speech synthesis function | |
CN105719641B (en) | Sound method and apparatus are selected for waveform concatenation speech synthesis | |
Toman et al. | Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis | |
Mizutani et al. | Concatenative speech synthesis based on the plural unit selection and fusion method | |
JP2013164609A (en) | Singing synthesizing database generation device, and pitch curve generation device | |
JP4274852B2 (en) | Speech synthesis method and apparatus, computer program and information storage medium storing the same | |
WO2008056604A1 (en) | Sound collection system, sound collection method, and collection processing program | |
JP3281281B2 (en) | Speech synthesis method and apparatus | |
WO2012032748A1 (en) | Audio synthesizer device, audio synthesizer method, and audio synthesizer program | |
CN115273806A (en) | Song synthesis model training method and device and song synthesis method and device | |
JP5935545B2 (en) | Speech synthesizer | |
JP2011141470A (en) | Phoneme information-creating device, voice synthesis system, voice synthesis method and program | |
JP4826493B2 (en) | Speech synthesis dictionary construction device, speech synthesis dictionary construction method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |