CN103531196A

CN103531196A - Sound selection method for waveform concatenation speech synthesis

Info

Publication number: CN103531196A
Application number: CN201310481306.9A
Authority: CN
Inventors: 陶建华; 张冉; 温正棋
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Priority date: 2013-10-15
Filing date: 2013-10-15
Publication date: 2014-01-22
Anticipated expiration: 2033-10-15
Also published as: CN103531196B

Abstract

The invention discloses a sound selection method for waveform concatenation speech synthesis. The method comprises the following steps of: on the basis of an original audio, carrying out hidden markov based model training so as to obtain an acoustic model set and a corresponding characteristic decision tree; inputting a plurality of training texts and on the basis of the characteristic decision tree, searching to obtain related acoustic models so as to obtain corresponding target voice and target syllables; according to similarity of the target voice and corresponding candidate primitives and likelihood probability of each acoustic parameter of the candidate primitives under a current acoustic model, training to obtain a similarity classifier; inputting a random text to be synthesized, removing the dissimilar candidate primitives on the basis of the similarity classifier, selecting the optimal primitive from the residual candidate primitives by utilizing a concatenation cost minimization rule and carrying out concatenation to obtain synthetic speech. The adoption of the method disclosed by the invention can synthesize speech with higher tone quality.

Description

A kind of waveform concatenation phonetic synthesis select sound method

Technical field

The present invention relates to Intelligent Information Processing field, what relate in particular to a kind of waveform concatenation phonetic synthesis selects sound method.

Background technology

Voice are as one of Main Means of mankind's exchange of information, and speech synthesis technique is mainly to allow computing machine can produce the continuous speech of high definition, high naturalness.In the evolution of speech synthesis technique, early stage research is mainly to adopt parameter synthetic method, has occurred again afterwards the synthetic method of waveform concatenation along with the development of computer technology.Along with the continuous increase of corpus, the quantity of candidate's primitive, also in continuous growth, how according to input text, is selected best primitive and is spliced, and more and more receives publicity.

Parameter speech synthesis system based on hidden Markov model and the splicing system based on unit selection are the speech synthesis techniques of main flow of nearly more than ten years, and mixing voice synthesis system combines the advantage of the two, the acoustic model that has adopted the former to train instructs unit selection, thereby select more suitable primitive, splices.This mixing voice synthesis system select sound method more more stable than traditional joining method, and manual intervention is still less, but still exists a lot of deficiencies, be mainly manifested in following some:

1, select sound method not embody the perception effect of people's ear, existing in selecting sound method a high score, and do not mean that and selected the voice that are more suitable for people's sense of hearing;

2, select sound method to adopt the method for factor weighted stacking to select sound, each feature that is about to primitive is calculated respectively filial generation valency, then give respectively weight, stack becomes total sound cost of selecting and selects sound again, the method supposes that all factors are linear superposition on the impact of the acceptance of primitive, and this does not obviously meet the fact.

Summary of the invention

For solving above-mentioned one or more problems, what the invention provides a kind of waveform concatenation phonetic synthesis selects sound method.The method combines people's subjective auditory perception, can select the primitive of the most applicable people's ear sense of hearing, finally splices good voice.

The sound method of selecting of waveform concatenation phonetic synthesis provided by the invention comprises the following steps:

Parameter extraction is carried out in original sound storehouse, and in conjunction with corresponding text marking information, carry out the model training based on hidden Markov; Input some training texts, carry out text analyzing, utilize decision tree search correlation model, and utilize parameter generation algorithm to synthesize corresponding target voice, and carry out the cutting of syllable, obtain target syllable; The artificial similarity of passing judgment on synthetic syllable voice and its candidate's primitive voice is used as categorical attribute, and the likelihood probability under "current" model of each parameters,acoustic of calculated candidate primitive, as the proper vector of input, thereby trains a similarity sorter simultaneously; Given any text to be synthesized, is used sorter to reject dissimilar candidate's primitive, to remaining candidate's primitive, utilizes concatenated cost minimum principle to select best primitive, finally splices synthetic speech.

From technique scheme, can find out, the sound method of selecting of waveform concatenation phonetic synthesis of the present invention has following beneficial effect:

(1) the similar primitive of the syllable synthetic to parameter, has identical with it stress and intonation, and the voice that adopt this standard to select splice, and can obtain having both stability and conforming voice;

(2) the similar primitive of the syllable synthetic to parameter, also more easily splicing, because they reach unanimity more in the feature of boundary, does not need or only needs seldom level and smooth, thereby has guaranteed the level and smooth and nature of raw tone;

(3) in selecting sound, introduced people's subjective sense of hearing factor, made the subjectivity hobby of selecting sound result to be more suitable for people.

Accompanying drawing explanation

Fig. 1 is waveform concatenation phonetic synthesis according to an embodiment of the invention selects sound method flow diagram;

Fig. 2 is acoustic training model flow process according to an embodiment of the invention;

Fig. 3 is the training of hidden Markov according to an embodiment of the invention process flow diagram;

Fig. 4 is the product process figure of target syllable according to an embodiment of the invention;

Fig. 5 is the training of sorter according to an embodiment of the invention process flow diagram;

Fig. 6 is for selecting according to an embodiment of the invention the process flow diagram of sound according to sorter.

Embodiment

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in more detail.

It should be noted that, in accompanying drawing or instructions description, similar or identical part is all used identical figure number.The implementation that does not illustrate in accompanying drawing or describe is form known to a person of ordinary skill in the art in affiliated technical field.In addition, although the demonstration of the parameter that comprises particular value can be provided herein, should be appreciated that, parameter is without definitely equaling corresponding value, but can in acceptable error margin or design constraint, be similar to corresponding value.

Fig. 1 is waveform concatenation phonetic synthesis according to an embodiment of the invention selects sound method flow diagram, and as shown in Figure 1, this selects sound method to comprise the following steps:

Step S1, carries out the model training based on hidden Markov based on extract the original audio obtaining from audio database, obtains acoustic model collection and characteristic of correspondence decision tree;

As shown in Figure 2, described step S1 is further comprising the steps:

Step S11, obtains the original audio in audio database;

Step S12, carries out the extraction of frequency spectrum parameter and base frequency parameters frame by frame for described original audio;

Described step S12 is further comprising the steps:

Step S121, divides frame windowing process by described original audio;

Dividing frame windowing is audio signal processing technique conventional in prior art, and therefore not to repeat here.

Step S122, every frame audio frequency that processing is obtained is such as extract its Mel cepstrum coefficient with STRAIGHT algorithm;

In an embodiment of the present invention, first extract the static Mel cepstrum coefficient in 25 rank, then calculate respectively their first order difference and second order difference, the Mel cepstrum coefficient finally obtaining is 75 dimensions.

Step S123, calculates the base frequency parameters of every frame audio frequency;

In an embodiment of the present invention, first calculate the base frequency parameters of every frame audio frequency, then calculate equally its first order difference and second order difference, the base frequency parameters finally obtaining is 3 dimensions.

Step S13, the text corresponding for described original audio carries out synchronous mark, marks out the contextual feature information of corresponding syllable in described original audio, described original audio carried out to segment cutting mark simultaneously;

In an embodiment of the present invention, take syllable as unit carries out contextual feature information labeling, used altogether the pronunciation character of rhythm structure feature and 24 dimensions of 66 dimensions, described mark is mainly by manually carrying out.

Segmental information in described segment cutting is unimportant, and the present invention adopts the result of automatic segmentation.

Step S14, the frequency spectrum parameter based on described original audio and base frequency parameters, contextual feature information labeling, and segment cutting mark, carry out traditional hidden Markov model training, obtain the Models Sets that comprises duration, fundamental frequency and frequency spectrum, and feature decision tree separately.

In this step, adopt the mode of many spatial probability distribution to carry out modeling, in an embodiment of the present invention, for given parameter and characteristic sequence, carry out the hidden Markov model training of 10 states.Concrete training flow process as shown in Figure 3.

Step S2, inputs some training texts, based on described feature decision tree search, obtains associated acoustic models, and then obtains corresponding target voice and target syllable;

As shown in Figure 4, described step S2 is further comprising the steps:

Step S21, inputs the training text of a plurality of syllable balances, through the text analyzing of front end, by methods such as maximum entropies, the feature in text is extracted, and obtains corresponding contextual feature sequence;

Text analyzing method based on maximum entropy is text analysis technique conventional in prior art, and therefore not to repeat here.

In Chinese, have more than 1300 conventional syllable, therefore, in an embodiment of the present invention, input the text of 500 syllable balances, and the text analyzing of process front end, corresponding context property obtained;

Step S22, is input to described contextual feature sequence in described feature decision tree, obtains the acoustic model sequence that meets current context;

In this step, according to the contextual feature in described contextual feature sequence, respectively the clustering tree of duration, fundamental frequency and frequency spectrum parameter is carried out to decision-making, obtain corresponding acoustic model sequence and duration modeling;

Step S23, based on described acoustic model sequence, adopts parameter generation algorithm to obtain target voice parameter;

Described target voice parameter comprises fundamental frequency and frequency spectrum parameter;

Step S24, based on described target voice parameter, synthesizes target sentences voice with vocoder, and described target sentences phonetic segmentation is become to target syllable.

In this step, the target syllable that cutting obtains is for the target voice of similarity comparison.

Step S3, according to the similarity of the described target voice candidate primitive corresponding with it, and the likelihood probability of each parameters,acoustic of described candidate's primitive under current acoustic model, training obtains similarity sorter;

As shown in Figure 5, described step S3 is further comprising the steps:

Step S31, sentence in described audio database is carried out to cutting by syllable, cutting obtains take the segment that syllable is unit, be candidate's primitive, identical syllable is classified as to a class, with this, build candidate's primitive storehouse, and distribute to frame by frame each candidate's primitive in candidate's primitive storehouse by extracting the frequency spectrum parameter and the base frequency parameters that obtain in described step S12;

Step S32, the parameters,acoustic of each primitive that described in each, target syllable is corresponding is brought in the context acoustic model that described step S22 obtains successively, the probability of duration, fundamental frequency and the frequency spectrum that calculates each primitive under its corresponding acoustic model, and using the set of all probability as characteristic set;

Step S33, convenes some Chinese native persons to carry out binary mark to the similarity of described target syllable and candidate's primitive, similar or dissimilar, and using this result as categorical attribute;

The number of syllables of each class is different, and artificial in order to reduce, in an embodiment of the present invention, each class syllable is got at most 30 syllables for similarity comparison.

Step S34, based on described categorical attribute and characteristic set, carries out the training of similarity sorter.

In an embodiment of the present invention, described similarity sorter can adopt CART sorter or svm classifier device, and experiment shows to adopt the SVM of second order polynomial kernel to have better classifying quality.

Step S4, inputs any text to be synthesized, based on described similarity sorter, rejects dissimilar candidate's primitive, selects sound, for remaining candidate's primitive, utilize concatenated cost minimum principle to select to obtain best primitive, and splicing obtains synthetic speech.

As shown in Figure 6, described step S4 is further comprising the steps:

Step S41, inputs text to be synthesized, and obtains corresponding acoustic model according to described step S22;

Step S42, the likelihood probability set of each parameters,acoustic that calculates each primitive according to described step S32 under current acoustic model, and using it as characteristic set;

Step S43, inputs to described characteristic set in described similarity sorter, can dope each primitive and belong to similar classification or dissimilar classification;

Step S44, removes all primitives in dissimilar classification, to remaining primitive, adopts concatenated cost minimum principle to select sound;

Step S45, carries out windowing to the primitive of selecting to obtain level and smooth, obtains final synthetic speech.

In sum, what the present invention proposes a kind of waveform concatenation phonetic synthesis selects sound method, and the method can synthesize the voice compared with high tone quality.

It should be noted that, the above-mentioned implementation to each parts is not limited in the various implementations of mentioning in embodiment, and those of ordinary skill in the art can know simply and replace it, for example:

(1) the spectrum parameter adopting in training is Mel cepstrum coefficient, can substitute by other parameter, as used the line spectrum pairs parameter of different rank.

(2) to the read statement quantity in sorter training, can suitably increase and decrease according to the computational accuracy of oneself.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; institute is understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

Waveform concatenation phonetic synthesis select a sound method, it is characterized in that, the method comprises the following steps:

Step S1, carries out the model training based on hidden Markov based on extract the original audio obtaining from audio database, obtains acoustic model collection and characteristic of correspondence decision tree;

Step S2, inputs some training texts, based on described feature decision tree search, obtains associated acoustic models, and then obtains corresponding target voice and target syllable;

Step S3, according to the similarity of the described target voice candidate primitive corresponding with it, and the likelihood probability of each parameters,acoustic of described candidate's primitive under current acoustic model, training obtains similarity sorter;

Step S4, inputs any text to be synthesized, based on described similarity sorter, rejects dissimilar candidate's primitive, for remaining candidate's primitive, utilize concatenated cost minimum principle to select to obtain best primitive, and splicing obtains synthetic speech.
2. method according to claim 1, is characterized in that, described step S1 is further comprising the steps:

Step S11, obtains the original audio in audio database;

Step S12, carries out the extraction of frequency spectrum parameter and base frequency parameters frame by frame for described original audio;

Step S13, the text corresponding for described original audio carries out synchronous mark, marks out the contextual feature information of corresponding syllable in described original audio, described original audio carried out to segment cutting mark simultaneously;

Step S14, the frequency spectrum parameter based on described original audio and base frequency parameters, contextual feature information labeling, and segment cutting mark, carry out traditional hidden Markov model training, obtain the Models Sets that comprises duration, fundamental frequency and frequency spectrum, and feature decision tree separately.
3. method according to claim 2, is characterized in that, described step S12 is further comprising the steps:

Step S121, divides frame windowing process by described original audio;

Step S122, to processing its Mel cepstrum coefficient of every frame audio extraction obtaining;

Step S123, calculates the base frequency parameters of every frame audio frequency.
4. method according to claim 1, is characterized in that, described step S2 is further comprising the steps:

Step S21, inputs the training text of a plurality of syllable balances, through text analyzing, obtains corresponding contextual feature sequence;

Step S22, is input to described contextual feature sequence in described feature decision tree, obtains the acoustic model sequence that meets current context;

Step S23, based on described acoustic model sequence, adopts parameter generation algorithm to obtain target voice parameter;

Step S24, based on described target voice parameter, synthesizes target sentences voice with vocoder, and described target sentences phonetic segmentation is become to target syllable.
5. method according to claim 4, is characterized in that, described text analyzing is for to extract the feature in text.
6. method according to claim 4, it is characterized in that, in described step S22, according to the contextual feature in described contextual feature sequence, respectively the clustering tree of duration, fundamental frequency and frequency spectrum parameter is carried out to decision-making, obtain corresponding acoustic model sequence and duration modeling.
7. method according to claim 4, is characterized in that, described target voice parameter comprises fundamental frequency and frequency spectrum parameter.
8. method according to claim 4, is characterized in that, described step S3 is further comprising the steps:

Step S31, sentence in described audio database is carried out to cutting by syllable, cutting obtains take the segment that syllable is unit, be candidate's primitive, identical syllable is classified as to a class, with this, build candidate's primitive storehouse, and distribute to frame by frame each candidate's primitive in candidate's primitive storehouse by extracting the frequency spectrum parameter and the base frequency parameters that obtain in described step S12;

Step S32, the parameters,acoustic of each primitive that described in each, target syllable is corresponding is brought in the context acoustic model that described step S22 obtains successively, the probability of duration, fundamental frequency and the frequency spectrum that calculates each primitive under its corresponding acoustic model, and using the set of all probability as characteristic set;

Step S33, convenes some Chinese native persons to carry out binary mark to the similarity of described target syllable and candidate's primitive, similar or dissimilar, and using this result as categorical attribute;

Step S34, based on described categorical attribute and characteristic set, carries out the training of similarity sorter.
9. method according to claim 8, is characterized in that, described step S4 is further comprising the steps:

Step S41, inputs text to be synthesized, and obtains corresponding acoustic model according to described step S22;

Step S42, the likelihood probability set of each parameters,acoustic that calculates each primitive according to described step S32 under current acoustic model, and using it as characteristic set;

Step S43, inputs to described characteristic set in described similarity sorter, can dope each primitive and belong to similar classification or dissimilar classification;

Step S44, removes all primitives in dissimilar classification, to remaining primitive, adopts concatenated cost minimum principle to select sound;

Step S45, carries out windowing to the primitive of selecting to obtain level and smooth, obtains final synthetic speech.