[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN105719641B - Sound method and apparatus are selected for waveform concatenation speech synthesis - Google Patents

Sound method and apparatus are selected for waveform concatenation speech synthesis Download PDF

Info

Publication number
CN105719641B
CN105719641B CN201610035220.7A CN201610035220A CN105719641B CN 105719641 B CN105719641 B CN 105719641B CN 201610035220 A CN201610035220 A CN 201610035220A CN 105719641 B CN105719641 B CN 105719641B
Authority
CN
China
Prior art keywords
phone
hmm
markup information
waveform
machine learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610035220.7A
Other languages
Chinese (zh)
Other versions
CN105719641A (en
Inventor
张辉
李秀林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201610035220.7A priority Critical patent/CN105719641B/en
Publication of CN105719641A publication Critical patent/CN105719641A/en
Application granted granted Critical
Publication of CN105719641B publication Critical patent/CN105719641B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention propose it is a kind of select sound method and apparatus for waveform concatenation speech synthesis, which includes: acquisition markup information, and the markup information is treated after synthesis text carries out front-end processing and obtained;Obtain pre-generated machine learning model;Machine learning pre-selection is carried out according to the markup information and the machine learning model, obtains candidate phone waveform segment.This method can be improved pre-selection effect when speech synthesis.

Description

Sound method and apparatus are selected for waveform concatenation speech synthesis
Technical field
The present invention relates to speech synthesis technique field more particularly to a kind of sound method is selected for waveform concatenation speech synthesis And device.
Background technique
How speech synthesis, also known as literary periodicals (Text to Speech) technology, the main problem of solution are by text Information is converted into audible acoustic information.
In speech synthesis, need first to carry out front-end processing to the text of input, then carry out parameters,acoustic and predict to obtain sound Parameter is learned, finally directly passes through vocoder synthetic video using parameters,acoustic, or module of selection carries out waveform spelling from sound library It connects.Relative to the sound of vocoder synthesis, the synthetic video based on waveform concatenation has higher sound quality, and more preferably maintains original The style of speaker.
During constructing the speech synthesis system based on waveform concatenation, in the related technology, usually first believed according to mark Breath obtains candidate phone waveform segment, then carries out a series of pre-selection in candidate phone waveform segment, comprising: duration pre-selection, Rhythm preliminary site selection, context pre-selection, Kullback-Leibler distance (KLD) pre-selection and neighbours' pre-selection etc., later again from pre- It selects in obtained waveform segment and selects optimal phone waveform fragment sequence, later according to optimal phone waveform segment sequence assembly Synthesis obtains synthesis voice.
Above scheme in the related technology can there are the following problems:
(1) each pre-selection process is mutually indepedent, these informixes is not got up to fully consider, therefore, it is difficult to obtain very Good pre-selection effect;
(2) above-mentioned pre-selection process needs to adjust threshold value and weight, and the need of work for adjusting threshold value and weight is largely thin The manual working of cause is easy to attend to one thing and lose sight of another, and after a sound library adjustment good threshold and weight, changes a sound library and generally requires weight Newly adjust these parameters;
(3) need to carry out multistep pre-selection, calculation amount is larger (especially KLD pre-selection);
(4) Project Realization of this method is relatively complicated, is related to the maintenance of quantity of parameters, and code complexity is high, it is more difficult to tie up Shield.
Summary of the invention
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, selecting sound method, the party for waveform concatenation speech synthesis an object of the present invention is to provide a kind of Method preselects the pre-selection effect of phone when speech synthesis can be improved.
It is another object of the present invention to propose a kind of to select mixer for waveform concatenation speech synthesis.
In order to achieve the above objectives, what first aspect present invention embodiment proposed selects sound side for waveform concatenation speech synthesis Method, comprising: obtain markup information, the markup information is treated after synthesis text carries out front-end processing and obtained;It obtains preparatory The machine learning model of generation;Machine learning pre-selection is carried out according to the markup information and the machine learning model, is waited Select phone waveform segment.
What first aspect present invention embodiment proposed selects sound method for waveform concatenation speech synthesis, by using machine Learning model is preselected, and various informixes can be got up consider, thus pre-selection effect when improving speech synthesis.
In order to achieve the above objectives, what second aspect of the present invention embodiment proposed selects sound to fill for waveform concatenation speech synthesis It sets, comprising: first obtains module, and for obtaining markup information, the markup information is to treat synthesis text to carry out front-end processing It obtains afterwards;Second obtains module, for obtaining pre-generated machine learning model;Module is preselected, for according to the mark It infuses information and the machine learning model carries out machine learning pre-selection, obtain candidate phone waveform segment.
What second aspect of the present invention embodiment proposed selects mixer for waveform concatenation speech synthesis, by using machine Learning model is preselected, and various informixes can be got up consider, thus pre-selection effect when improving speech synthesis.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:
Fig. 1 is the flow diagram for selecting sound method for waveform concatenation speech synthesis that one embodiment of the invention proposes;
Fig. 2 is the process signal for selecting sound method for waveform concatenation speech synthesis that another embodiment of the present invention proposes Figure;
Fig. 3 is a kind of schematic diagram of phone tree in the embodiment of the present invention;
Fig. 4 is the flow diagram of phoneme synthesizing method in the embodiment of the present invention;
Fig. 5 is the structural representation for selecting mixer for waveform concatenation speech synthesis that another embodiment of the present invention proposes Figure;
Fig. 6 is the structural representation for selecting mixer for waveform concatenation speech synthesis that another embodiment of the present invention proposes Figure.
Specific embodiment
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar module or module with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and for explaining only the invention, and is not considered as limiting the invention.On the contrary, this The embodiment of invention includes all changes fallen within the scope of the spiritual and intension of attached claims, modification and is equal Object.
Fig. 1 is the flow diagram for selecting sound method for waveform concatenation speech synthesis that one embodiment of the invention proposes. Referring to Fig. 1, this method comprises:
S11: obtaining markup information, and the markup information is treated after synthesis text carries out front-end processing and obtained.
Wherein, front-end processing specifically includes that pretreatment, participle, part-of-speech tagging, phonetic notation, prosody hierarchy prediction etc..
Markup information specifically includes that the contextual information of phone, rhythm location information, tone information etc..
S12: pre-generated machine learning model is obtained.
Optionally, machine learning model can be phone tree or deep neural network model.
Machine learning model can be in the training stage, be generated according to the markup information of phone sample and voice data training 's.
In the present embodiment, by taking machine learning model is phone tree as an example.
Correspondingly, referring to fig. 2, in some embodiments, this method further include:
S21: the markup information of phone sample and the waveform segment of phone sample are obtained, and according to the mark of the phone sample Information is infused, training obtains hidden Markov model (Hidden Markov Model, HMM), and, establish HMM and waveform segment Corresponding relationship.
Wherein, training HMM when can using hidden Markov model kit (HMM Tool Kit, HTK) based on Speech synthesis (HMM-based Speech Synthesis System, HTS) Lai Shixian of HMM.
After the completion of training, the sample of each phone can correspond to a HMM in training data, and each HMM is to mark letter Breath is to name.
In training data, the same phone, a generally corresponding HMM;There are a HMM correspondence is more under rare occasion The case where a phone.
For example, simple or compound vowel of a Chinese syllable phone ai4, the title of corresponding HMM acoustic model can be represented simply as: k-ai+b, t-ai+ H, s-ai+n etc..It is understood that complete HMM title, i.e. phone mark, it also include a large amount of other informations.
S22: corresponding each phone, HMM corresponding to the phone carry out decision tree-based clustering, it is corresponding to obtain the phone Phone tree.
For specific phone, such as " ai4 ", decision tree-based clustering is carried out using its all HMM.
By decision tree-based clustering, in the phone tree of building, the corresponding optimal fragmentation problem of each non-leaf nodes, often The HMM of a leaf node association a part.
The markup informations such as rhythm position, context are used when cluster.
HMM all at the beginning is on root node, and then selection is so that division front and back log-likelihood increment is maximum Problem splits into two parts as optimal fragmentation problem, the associated HMM of root node;Then child node is further continued for dividing.When point When splitting front and back log-likelihood increment less than a certain threshold value, just stop division.Wherein, threshold value is by minimum description length (Minimum Description distance, MDL) criterion determination.
A kind of phone tree cluster process can be as shown in Figure 3.The corresponding optimal division of each non-leaf nodes in Fig. 3 Problem, the HMM of each leaf node association a part.
Wherein, " L " and " R " in Fig. 3 respectively indicates the phone of the left and right side of current phone, and voice indicates rhythm Female, silence indicates mute, and " w " and " g " indicates two specific phones.
For example, in Fig. 3, the optimal fragmentation problem used on root node be judge the left side of current phone phone whether Does is it simple or compound vowel of a Chinese syllable (L=voice?).
S13: machine learning pre-selection is carried out according to the markup information and the machine learning model, obtains candidate phone wave Shape segment.
By taking phone tree as an example, specifically, may include:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone of the phone Tree, obtains the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, wave corresponding with the associated HMM of the leaf node is obtained The waveform segment is determined as obtaining candidate phone waveform segment by shape segment.
For example, in synthesis phase, after obtaining the corresponding markup information of text to be synthesized, according to the markup information and Fig. 3 Shown in phone tree, splitting rule according to Fig.3, can find a leaf node, later can be from the leaf node Middle acquisition HMM.Specifically, can recorde the title of each HMM in leaf node, the title of each HMM corresponding sound when being trained The markup information of son.
It can be seen that, the pre-selection of phone tree has obtained the part of a very little of all phone waveform segments from the description above (corresponding to the part of some leaf node);Also, theoretically, the fundamental frequency and spectral property of the corresponding phone candidate of leaf node and By acoustic model predict come target acoustical parameters have preferable consistency.
In some embodiments, in the phoneme synthesizing method using the pre-program of the present embodiment, referring to fig. 4, which is closed May include: at method
S41: it treats synthesis text and carries out front-end processing, obtain markup information.
Front-end processing for example, pretreatment, participle, part-of-speech tagging, phonetic notation and prosody prediction etc..
S42: according to markup information and pre-generated machine learning model, machine learning pre-selection is carried out.
Machine learning model is, for example, phone tree, be may refer in above-mentioned process according to the process that phone tree is preselected Associated description, details are not described herein.
S43: parameters,acoustic prediction is carried out according to markup information, obtains parameters,acoustic.
Parameters,acoustic includes: spectrum, fundamental frequency and duration.
S44: according to the candidate phone waveform segment obtained after parameters,acoustic and pre-selection, cost calculating is carried out, is selected optimal Phone waveform fragment sequence.
Wherein, when cost calculates, the minimum sequence of cost can be found by the method for Dynamic Programming as optimal sound Sub-waveform fragment sequence.
After obtaining optimal phone waveform fragment sequence, the waveform segment in the sequence can be spliced, be obtained Synthesize voice.
It in the present embodiment, is preselected by using machine learning model, various informixes can be got up consider, from And pre-selection effect when improving speech synthesis.Specifically, entire pre-selection process is carried out under same frame.Such as phone Tree pre-selection, it is only necessary to decision tree be traversed according to mark, reliable pre-selection can be completed.The candidate phone and mesh that this method obtains Marking parameters,acoustic has preferable consistency.Entire pre-selection process does not need to carry out the adjustment of a large amount of artificial threshold values and weight.This Sample, if training data has altered, it is only necessary to re-start pre-selection model training (such as phone tree pre-selection, it is only necessary to Re-establish phone tree).Pre-selection process is the ergodic process of a decision tree, computation complexity very little;Moreover, calculating Complexity and sound library scale are unrelated, but only candidate phone scale corresponding with leaf node is related;And pass through MDL criterion, The scale that the candidate phone quantity of leaf node can keep relative stability.The Project Realization of this method is simple, clear, is easy to tie up Shield.
Fig. 5 is the structural representation for selecting mixer for waveform concatenation speech synthesis that another embodiment of the present invention proposes Figure.Referring to Fig. 5, which includes: that the first acquisition module 51, second obtains module 52 and pre-selection module 53.
First obtains module 51, and for obtaining markup information, the markup information is to treat synthesis text to carry out at front end It is obtained after reason;
Wherein, front-end processing specifically includes that pretreatment, participle, part-of-speech tagging, phonetic notation, prosody hierarchy prediction etc..
Markup information specifically includes that the contextual information of phone, rhythm location information, tone information etc..
Second obtains module 52, for obtaining pre-generated machine learning model;
Optionally, machine learning model can be phone tree or deep neural network model.
Machine learning model can be in the training stage, be generated according to the markup information of phone sample and voice data training 's.
Module 53 is preselected, for carrying out machine learning pre-selection according to the markup information and the machine learning model, is obtained Phone after to pre-selection.
In the present embodiment, by taking machine learning model is phone tree as an example.
In some embodiments, referring to Fig. 6, the device 50 further include:
Modeling module 54, for obtaining the markup information of phone sample and the waveform segment of phone sample, and according to described The markup information of phone sample, training obtain HMM, and, establish the corresponding relationship of HMM Yu waveform segment;
It wherein, can be using hidden Markov model kit (HMM Tool Kit, HTK) based on HMM when training HMM Speech synthesis (HMM-based Speech Synthesis System, HTS) Lai Shixian.
After the completion of training, the sample of each phone can correspond to a HMM in training data, and each HMM is to mark letter Breath is to name.
In training data, the same phone, a generally corresponding HMM;There are a HMM correspondence is more under rare occasion The case where a phone.
For example, simple or compound vowel of a Chinese syllable phone ai4, the title of corresponding HMM acoustic model can be represented simply as: k-ai+b, t-ai+ H, s-ai+n etc..It is understood that complete HMM title, i.e. phone mark, it also include a large amount of other informations.
Cluster module 55 carries out decision tree-based clustering according to the corresponding HMM of the phone, obtains for corresponding to each phone The corresponding phone tree of the phone.
For specific phone, such as " ai4 ", decision tree-based clustering is carried out using its all HMM.
By decision tree-based clustering, in the phone tree of building, the corresponding optimal fragmentation problem of each non-leaf nodes, often The HMM of a leaf node association a part.
The problem of using when cluster is the markup informations such as rhythm position, context.
HMM all at the beginning is on root node, and then selection is so that division front and back log-likelihood increment is maximum Problem splits into two parts as optimal fragmentation problem, the associated HMM of root node;Then child node is further continued for dividing.When point When splitting front and back log-likelihood increment less than a certain threshold value, just stop division.Wherein, threshold value is by minimum description length (Minimum Description distance, MDL) criterion determination.
A kind of phone tree cluster process can be as shown in Figure 3.The corresponding optimal division of each non-leaf nodes in Fig. 3 Problem, the HMM of each leaf node association a part.
Wherein, " L " and " R " in Fig. 3 respectively indicates the phone of the left and right side of current phone, and voice indicates rhythm Female, silence indicates mute, and " w " and " g " indicates two specific phones.
For example, in Fig. 3, the optimal fragmentation problem used on root node be judge the left side of current phone phone whether Does is it simple or compound vowel of a Chinese syllable (L=voice?).
Optionally, the pre-selection module 53 is specifically used for:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone of the phone Tree, obtains the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, wave corresponding with the associated HMM of the leaf node is obtained The waveform segment is determined as obtaining candidate phone waveform segment by shape segment.
For example, in synthesis phase, after obtaining the corresponding markup information of text to be synthesized, according to the markup information and Fig. 3 Shown in phone tree, splitting rule according to Fig.3, can find a leaf node, later can be from the leaf node Middle acquisition HMM.Specifically, can recorde the title of each HMM in leaf node, the title of each HMM corresponding sound when being trained The markup information of son.
It can be seen that, the pre-selection of phone tree has obtained the part of a very little of all phone waveform segments from the description above (corresponding to the part of some leaf node);Also, theoretically, the fundamental frequency and spectral property of the corresponding phone candidate of leaf node and By acoustic model predict come target acoustical parameters have preferable consistency.
In some embodiments, referring to Fig. 6, the device 50 further include:
Third obtains module 56, and for obtaining parameters,acoustic, the parameters,acoustic is according to the markup information carry out sound It is obtained after parameter prediction;
Parameters,acoustic includes: spectrum, fundamental frequency and duration.
Determining module 57, for carrying out cost calculating, choosing according to the parameters,acoustic and the candidate phone waveform segment Optimal phone waveform fragment sequence is selected out, to splice to the waveform segment in the optimal phone waveform fragment sequence, Obtain synthesis voice.
Wherein, when cost calculates, the minimum sequence of cost can be found by the method for Dynamic Programming as optimal sound Sub-waveform fragment sequence.
After obtaining optimal phone waveform fragment sequence, the waveform segment in the sequence can be spliced, be obtained Synthesize voice.
It in the present embodiment, is preselected by using machine learning model, various informixes can be got up consider, from And pre-selection effect when improving speech synthesis.Specifically, entire pre-selection process is carried out under same frame.Such as phone Tree pre-selection, it is only necessary to decision tree be traversed according to mark, reliable pre-selection can be completed.The candidate phone and mesh that this method obtains Marking parameters,acoustic has preferable consistency.Entire pre-selection process does not need to carry out the adjustment of a large amount of artificial threshold values and weight.This Sample, if training data has altered, it is only necessary to re-start pre-selection model training (such as phone tree pre-selection, it is only necessary to Re-establish phone tree).Pre-selection process is the ergodic process of a decision tree, computation complexity very little;Moreover, calculating Complexity and sound library scale are unrelated, but only candidate phone scale corresponding with leaf node is related;And pass through MDL criterion, The scale that the candidate phone quantity of leaf node can keep relative stability.The Project Realization of this method is simple, clear, is easy to tie up Shield.
It should be noted that in the description of the present invention, term " first ", " second " etc. are used for description purposes only, without It can be interpreted as indication or suggestion relative importance.In addition, in the description of the present invention, unless otherwise indicated, the meaning of " multiple " Refer at least two.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention Embodiment person of ordinary skill in the field understood.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims (8)

1. a kind of select sound method for waveform concatenation speech synthesis characterized by comprising
Markup information is obtained, the markup information is treated after synthesis text carries out front-end processing and obtained;
Obtain pre-generated machine learning model;
Machine learning pre-selection is carried out according to the markup information and the machine learning model, obtains candidate phone waveform segment;
It is described that machine learning pre-selection is carried out according to the markup information and the machine learning model, obtain candidate phone corrugated sheet It is disconnected, comprising:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone tree of the phone, obtains Take the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, corrugated sheet corresponding with the associated HMM of the leaf node is obtained It is disconnected, the waveform segment is determined as to obtain candidate phone waveform segment.
2. the method according to claim 1, wherein when the machine learning model is phone tree, the side Method further include:
The markup information of phone sample and the waveform segment of phone sample are obtained, and according to the markup information of the phone sample, Training obtains HMM, and, establish the corresponding relationship of HMM Yu waveform segment;
Corresponding each phone, HMM corresponding to the phone carry out decision tree-based clustering, obtain the corresponding phone tree of the phone.
3. according to the method described in claim 2, it is characterized in that, each non-leaf nodes is one corresponding in the phone tree Optimal fragmentation problem, each leaf node are associated with one or more HMM.
4. according to the method described in claim 3, it is characterized in that, the optimal fragmentation problem is so that division front and back logarithm The problem of likelihood value increment maximum stops division, wherein described when log-likelihood increment is less than preset threshold when division front and back Preset threshold is determined according to MDL criterion.
5. the method according to claim 1, wherein further include:
Parameters,acoustic is obtained, the parameters,acoustic is obtained after carrying out parameters,acoustic prediction according to the markup information;
According to the parameters,acoustic and the candidate phone waveform segment, cost calculating is carried out, optimal phone corrugated sheet is selected Disconnected sequence obtains synthesis voice to splice to the waveform segment in the optimal phone waveform fragment sequence.
6. a kind of select mixer for waveform concatenation speech synthesis characterized by comprising
First obtains module, and for obtaining markup information, the markup information is to treat after synthesis text carries out front-end processing to obtain It arrives;
Second obtains module, for obtaining pre-generated machine learning model;
Module is preselected, for carrying out machine learning pre-selection according to the markup information and the machine learning model, is preselected Phone afterwards;
The pre-selection module is specifically used for:
According to the corresponding markup information of the text to be synthesized, corresponding each phone traverses the corresponding phone tree of the phone, obtains Take the associated HMM of leaf node of the phone tree;
According to the corresponding relationship of the HMM and waveform segment, corrugated sheet corresponding with the associated HMM of the leaf node is obtained It is disconnected, the waveform segment is determined as to obtain candidate phone waveform segment.
7. device according to claim 6, which is characterized in that when the machine learning model is phone tree, the dress It sets further include:
Modeling module, for obtaining the markup information of phone sample and the waveform segment of phone sample, and according to the phone sample This markup information, training obtain HMM, and, establish the corresponding relationship of HMM Yu waveform segment;
Cluster module, for corresponding to each phone, HMM corresponding to the phone carries out decision tree-based clustering, obtains the phone Corresponding phone tree.
8. device according to claim 6, which is characterized in that further include:
Third obtains module, and for obtaining parameters,acoustic, the parameters,acoustic is to carry out parameters,acoustic according to the markup information It is obtained after prediction;
Determining module, for carrying out cost calculating, selecting most according to the parameters,acoustic and the candidate phone waveform segment Excellent phone waveform fragment sequence is closed to splice to the waveform segment in the optimal phone waveform fragment sequence At voice.
CN201610035220.7A 2016-01-19 2016-01-19 Sound method and apparatus are selected for waveform concatenation speech synthesis Active CN105719641B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610035220.7A CN105719641B (en) 2016-01-19 2016-01-19 Sound method and apparatus are selected for waveform concatenation speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610035220.7A CN105719641B (en) 2016-01-19 2016-01-19 Sound method and apparatus are selected for waveform concatenation speech synthesis

Publications (2)

Publication Number Publication Date
CN105719641A CN105719641A (en) 2016-06-29
CN105719641B true CN105719641B (en) 2019-07-30

Family

ID=56147931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610035220.7A Active CN105719641B (en) 2016-01-19 2016-01-19 Sound method and apparatus are selected for waveform concatenation speech synthesis

Country Status (1)

Country Link
CN (1) CN105719641B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047463B (en) * 2019-01-31 2021-03-02 北京捷通华声科技股份有限公司 Voice synthesis method and device and electronic equipment
CN112151009B (en) * 2020-09-27 2024-06-25 平安科技(深圳)有限公司 Voice synthesis method and device based on prosody boundary, medium and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835075A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
CN105206264A (en) * 2015-09-22 2015-12-30 百度在线网络技术(北京)有限公司 Speech synthesis method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5722295B2 (en) * 2012-11-12 2015-05-20 日本電信電話株式会社 Acoustic model generation method, speech synthesis method, apparatus and program thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1835075A (en) * 2006-04-07 2006-09-20 安徽中科大讯飞信息科技有限公司 Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
CN105206264A (en) * 2015-09-22 2015-12-30 百度在线网络技术(北京)有限公司 Speech synthesis method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
《基于统计声学建模的单元挑选语音合成方法研究》;宋阳;《中国优秀硕士学位论文全文数据库 信息科技辑》;20141015(第10期);第11-18页
《汉语连续语音识别系统中三音子模型的优化》;齐耀辉等;《计算机应用研究》;20131031;第30卷(第10期);第2920-2922页
《面向大语料库的语音合成方法研究》;于延锁等;《北京大学学报(自然科学版)》;20140930;第50卷(第5期);第791-796页

Also Published As

Publication number Publication date
CN105719641A (en) 2016-06-29

Similar Documents

Publication Publication Date Title
EP2179414B1 (en) Synthesis by generation and concatenation of multi-form segments
CN101490740B (en) Audio combining device
JP4469883B2 (en) Speech synthesis method and apparatus
US8386256B2 (en) Method, apparatus and computer program product for providing real glottal pulses in HMM-based text-to-speech synthesis
CN105206258A (en) Generation method and device of acoustic model as well as voice synthetic method and device
CN104916284A (en) Prosody and acoustics joint modeling method and device for voice synthesis system
CN110459202A (en) A kind of prosodic labeling method, apparatus, equipment, medium
CN106057192A (en) Real-time voice conversion method and apparatus
CN108172211B (en) Adjustable waveform splicing system and method
KR20170107683A (en) Text-to-Speech Synthesis Method using Pitch Synchronization in Deep Learning Based Text-to-Speech Synthesis System
CN105206264B (en) Phoneme synthesizing method and device
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
CN101887719A (en) Speech synthesis method, system and mobile terminal equipment with speech synthesis function
CN105719641B (en) Sound method and apparatus are selected for waveform concatenation speech synthesis
Toman et al. Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis
Mizutani et al. Concatenative speech synthesis based on the plural unit selection and fusion method
JP2013164609A (en) Singing synthesizing database generation device, and pitch curve generation device
JP4274852B2 (en) Speech synthesis method and apparatus, computer program and information storage medium storing the same
WO2008056604A1 (en) Sound collection system, sound collection method, and collection processing program
JP3281281B2 (en) Speech synthesis method and apparatus
WO2012032748A1 (en) Audio synthesizer device, audio synthesizer method, and audio synthesizer program
CN115273806A (en) Song synthesis model training method and device and song synthesis method and device
JP5935545B2 (en) Speech synthesizer
JP2011141470A (en) Phoneme information-creating device, voice synthesis system, voice synthesis method and program
JP4826493B2 (en) Speech synthesis dictionary construction device, speech synthesis dictionary construction method, and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant