CN1320482C

CN1320482C - Natural voice pause in identification text strings

Info

Publication number: CN1320482C
Application number: CNB031327087A
Authority: CN
Inventors: 陈桂林; 祖漪清
Original assignee: Motorola Inc
Current assignee: Nuance Communications Inc
Priority date: 2003-09-29
Filing date: 2003-09-29
Publication date: 2007-06-06
Anticipated expiration: 2023-09-29
Also published as: CN1604183A; WO2005034085A1; KR20060056403A; RU2319221C1; EP1668631A4; EP1668631A1

Abstract

The present invention discloses a method (400) for automatically identifying a natural speech pause of a text string, and the natural speech pause is used for text-to-speech converion in an electronic apparatus (100). The method (400) comprises the steps that a text string (420) which comprises two ends is obtained, and the two ends comprises a starting end and an ending end; analysis step (440) is executed, namely that at least one word in the text string is analyzed so as to judge whether the natural speed pause exists beside the word, the analysis is based on at least one preset threshold value for the word, and the preset threshold value is associated with the number of syllables between the word and one of the two ends of the text string; insertion step (460) is provided so as to insert the natural speech pause into the synthetic speech signal output representation of the text string.

Description

The method that natural-sounding in the sign text string pauses

Technical field

The present invention relates generally to that literary composition language conversion (TTS) is synthetic.The present invention is particularly useful for pausing naturally in the synthetic language of determining text chunk.

Background technology

Literary composition language (TTS) conversion also is known as continuous text synthetic to voice usually, the text string that it allows electronic equipment to receive to import and the conversion of text string is provided with the form of synthetic speech after expression.But, will carry out the equipment of phonetic synthesis from the text string that receives of indefinite quantity for needs, it is very difficult that high-quality synthetic speech true to nature is provided.This is because each word that needs to synthesize or the language of syllable (for Chinese character and similar character) all are that context dependent is relevant with the position.For example, the language of sentence (text string of input) ending place word can elongate or prolong.If even the place that requires emphasis in natural-sounding pauses appears at the centre of sentence, the language of same word also can prolong.

In most of language, the language of word depends on the harmonious sounds parameter, and the harmonious sounds parameter comprises tone (pitch period), volume (power or amplitude) and duration.The prosodic parameter value of word depends on the position of word in phrase and the position of natural-sounding pause.But, in the synthetic prior art of literary composition language conversion (TTS) and be not easy to occur to be used to change the sign that the natural-sounding of input text pattern at random pauses.

In this instructions and claims, term " comprises (comprise) ", " comprising (comprising) " or other similar terms refer to comprising of nonexcludability, for example a kind of method or device that comprises a series of unit, it not only comprises the unit that those are listed separately, also may comprise the unit that those are not listed well.

Summary of the invention

According to an aspect of the present invention, provide a kind of method that Automatic Logos text string natural-sounding pauses that is used for, the literary composition language that this pause is used for carrying out on electronic equipment is changed, and this method comprises:

Obtain the text string that comprises two ends, these two ends are starting ends and finish end;

Whether at least one word of analyzing in the text string exists natural-sounding to pause near judging this word, this analysis is based at least one predetermined threshold value that is used for word, and the quantity of the syllable between the end in this predetermined threshold value and this word and the text string two ends is associated; With

The natural-sounding pause is inserted in the synthetic speech signal output expression of text string.

Preferably, described at least one predetermined threshold value comprises P word (P_word) threshold value, and it is based on the quantity of the syllable between starting end and this word.

Preferably, described at least one predetermined threshold value comprises F word (F_word) threshold value, and it is based on the quantity that finishes the syllable between end and this word.

Preferably, described at least one predetermined threshold value is determined by following step:

Training set to oral account record (transcription) provides at least one to be paused by the natural-sounding that identifier identified that inserts;

Word in each oral account record is designated P word and F word;

P word and F word that statistics ground analyzing and training is concentrated;

From The result of statistics, determine F word threshold value and P word threshold value.

Preferably, the natural-sounding of insertion pauses and also can comprise and be designated the pause that part of speech (POS) pattern is paused naturally.

Preferably, the natural-sounding of insertion pauses and also can comprise and be designated the pause that portmanteau word pauses naturally.

Description of drawings

In order to make easy to understand of the present invention and to put into practice, will come in conjunction with the accompanying drawings now with reference to the preferred embodiment shown in quoting, wherein:

Fig. 1 is the schematic block diagram according to electronic equipment of the present invention;

Fig. 2 has illustrated the method 200 that is used for definite threshold value that is associated with the natural-sounding pause of text string;

Fig. 3 A has illustrated the oral account record example of the method that is used for Fig. 2 to 3D.

Fig. 4 has illustrated the method for the natural-sounding pause that is used for the Automatic Logos text string; With

Fig. 5 is the detailed description of the analytical procedure of Fig. 4.

Embodiment

Referring to Fig. 1, show electronic equipment 100 with wireless telephonic form, this electronic equipment 100 comprises device handler 102, and it is connected to user interface 104 effectively by bus 103, and typically, user interface 104 is touch screen or display screen and keypad.Electronic equipment 100 also has language corpus 106, voice operation demonstrator 110, nonvolatile memory 120, ROM (read-only memory) 118 and wireless communication module 116, and they all are connected to processor 102 effectively by bus 103.Voice operation demonstrator 110 has output terminal, and this output terminal connects and driving loudspeaker 112.Corpus 106 comprises the speech waveform PUW expression of word or phoneme and correlated sampling, digitized and that handled.In other words, as described below, nonvolatile memory 120 (memory module) provides and has been used for the synthetic text string of literary composition language conversion (TTS) (text can be received by module 116 or miscellaneous equipment).The waveform language corpus also comprises the oral account record of expression phrase and corresponding sampling and digitized speech waveform and is positioned at text string with the position of natural pause boundary-related as described below.

As the skilled person will be apparent, typically, radio frequency communications unit 116 is a receiver and a transmitter with combination of common antenna.This radio frequency communications unit 116 has the transceiver that is connected to antenna by radio frequency amplifier.This transceiver is also connected to the public modulator/demodulator that communication unit 116 is connected to processor 102.Simultaneously, in this embodiment, nonvolatile memory 120 (memory module) stores programmable phonebook database Db, and ROM (read-only memory) 118 stores the operation code (OC) that is used for device handler 102.

Referring to Fig. 2, the method 200 that is used for definite threshold value that is associated with the natural-sounding pause of text string has been described.This threshold value is based on the forward and backward a plurality of syllables in the record of the oral account among the training set TS.After beginning step 210, method 200 is implemented step 220 is provided, and being used for provides at least one to be paused by the natural-sounding that manual punctuation mark that inserts or identifier " | " are identified to the training set TS of oral account record (some sentences typically).Fig. 3 A has illustrated such oral account record or sentence example in 3D.One 300 in these oral account records is " Based on our history|in China, ", and it has natural-sounding and pauses 310 between word " history " and " in ".For oral account record 300, a starting end 305 and an end end 315 are arranged.As the skilled person will be apparent, Fig. 3 A has at least one natural-sounding pause 310 and starting end 305 and finishes end 315 to all oral account records 300 among the 3D.These are given an oral account shown in further being analyzed as follows of record:

Based=2 syllable

On=1 syllable

Our=1 syllable

History=3 syllable

In=1 syllable

China=2 syllable

Simultaneously, each word in the oral account record can be designated as: (i) P word: be close in the oral account record front, by the word of pause naturally of punctuation mark " | " sign; (ii) F word: be close in the oral account record back, by the word of pause naturally of punctuation mark " | " sign; (iii) medium term: the word that the next door does not have natural-sounding to pause in the oral account record.After step 220, identification of steps 230 will be designated (i) P word to the word in each oral account record; (ii) F word; Or (iii) medium term.Thus, for oral account record " Based onour history|in China, ", following table 1 has identified the attribute of each word in the oral account record:

Word	The P word	The F word	Syllable quantity	Pause
Word	The P word	The F word	Syllable quantity	Pause	Based	N	N	0	N
on	N	N	2	N	Based	N	N	0	N
on	N	N	2	N	our	N	N	3	N
history	N	Y	4	After	our	N	N	3	N
history	N	Y	4	After	in	Y	N	7	Before
China	N	N	1	N	in	Y	N	7	Before

The analysis of table 1 pair oral account record " Based on our history in China "

Then, method 200 is carried out statistical study step 240.In this step 240, if the training set TS that is provided has 90,000 oral account records (for example sentence) and supposition word " in " has occurred 10 in training set, 000 time words, for these 10,000 examples of " in ", can observe following statistical study so:

(i) quantity=8,000 examples of (OPW) appear in " in " as the P word;

(ii) quantity=1,000 example of (OFW) appears in " in " as the F word;

(iii) quantity=1,000 example of (ONW) appears in " in " as middle word (neither P word, neither F word);

Further, in the appearance of 8,000 examples of " in " that from training set TS, identifies, can observe following statistical study as the P word:

(i) 8 or more syllable (OPS)=0 appear in the front;

(ii) 7 syllables (OPS)=400 appear in the front;

(iii) 6 syllables (OPS)=600 appear in the front;

(iv) 5 syllables (OPS)=2,000 appear in the front;

(v) 4 syllables (OPS)=3,000 appear in the front;

(vi) 3 syllables (OPS)=1,000 appear in the front;

(vii) 2 syllables (OPS)=1,000 appear in the front;

(viii) 1 syllable (OPS)=0 appears in the front;

Intuition and selected inspiration rate (heuristic ratio) HR of test are 0.75, and it is used for determining the P word pause threshold value PT of word " in ".This threshold value PT determines that in definite threshold value step 250 its step is as follows:

Minimum number from the maximum quantity of observed syllable to observed syllable is carried out from the OPS of maximum, up to:

OPS and/OPW 0.75

PT is chosen for quantity by the observed syllable that last OPS identified in the OPS summation;

Finish.

Therefore, the PT of " in " will determine as follows in step 250:

400/8,7 of 000=0.05 are syllable the preceding;

(400+600)/8,6 of 000=0.125 syllable the preceding;

(400+600+2,000)/8,5 of 000=0.375 are syllable the preceding;

(400+600+2,000+3,000)/8,4 of 000=0.75 are syllable the preceding;

Therefore PT is chosen as 4.

Use similar statistical study to come to determine the F word pause threshold value of " in ", reuse 0.75 inspiration rate HR in step 250.Simultaneously, determine PT and FT value (using 0.75 inspiration rate HR) for the example of all other P words of all other words among the training set TS and F word.Method 200 finishes in step 260 subsequently, and all the P words of all words and the example of F word all are stored in the nonvolatile memory 120 among the training set TS.

Referring to Fig. 4, the method 400 of the natural-sounding pause that is used for Automatic Logos text string STR has been described, the literary composition language that this pause is used for carrying out on electronic equipment 100 is changed.After beginning step 410, method 400 implements to obtain the step 420 of the text string STR that comprises two ends, and these two ends are starting end SE and finish end FE.Select word step 430 to select a word (perhaps portmanteau word CW), analytical procedure 440 is used for analyzing at least one word (or portmanteau word CW) of text string STR, near judging this word (or portmanteau word CW), whether exist natural-sounding to pause, this analysis is based at least one predetermined threshold value (PT or FT) of this word, and the quantity of the syllable between the end in the two ends of this threshold value and this word and text string is associated.Threshold value comprises P word threshold value PT, and it is based on the quantity of the syllable between starting end and this word.Threshold value also comprises F word threshold value FT, and it is based on the quantity that finishes the syllable between end and this word.

If testing procedure 450 determining steps 440 have identified pause,, will insert the natural-sounding pause and be used for phonetic synthesis so in step 460.Pause otherwise will can not insert for the word of selecting in step 430.Then,, check, just turn back to step 430 if also have word not analyze to have judged whether by analysis all words among the text string STR in step 470.Otherwise, phonetic synthesis step 480 will use corpus 106 to carry out phonetic synthesis at compositor 110, and one or more natural-soundings pauses (being inserted among the text string STR in step 460) that wherein will occur are inserted in the synthetic speech signal output expression of text string STR.

Referring to Fig. 5, the more detailed figure of analytical procedure 440 has been described.At first, check text string STR, whether have part of speech (POS) pattern and pause naturally to judge it in step 441.The example that the POS pattern is paused naturally is as follows:

1. number+noun

For example: two thousand books

2. verb+adverbial word

For example: look carefully

3. preposition+noun

For example: with telescopes

4. adjective+noun

For example: beautiful city

If determine to have pause in step 441, will carry out step 446 so, this pause is identified as the F word and pauses.If determine not pause in step 441, will check text string STR in step 442 so, whether have the portmanteau word insertion pause that pauses naturally to judge it.The example that portmanteau word pauses naturally is as follows:

a bit of

a body of

a few

a fleet of

a flooding of

a fraction of

a function of

a good deal

a good deal of

a great deal

a great deal of

a hint of

a large body of

a large number of

a lot ofland

a majority of

If determine to have pause in step 442, will carry out step 446 so, this pause is identified as the F word and pauses.If determine not pause to be identified in step 442,, will carry out a test to judge whether to have reached the P word threshold value PT of selected word so in step 443.Quantity by the syllable between starting end and the selected word among the comparison text string STR is carried out this judgement.If reached the P word threshold value PT of selected word, will determine to exist nature to pause so, and it is designated the pause of P word in step 444.In addition, do not identified,, will be carried out a test to judge whether to have reached the F word threshold value FT of selected word so in step 445 if pause in step 443.Carry out this judgement by comparing the quantity that finishes the syllable between end and the selected word among the text string STR.If reached the F word threshold value FT of selected word, will determine to exist nature to pause so, and it is designated the pause of F word in step 446.Otherwise not pausing in step 447 is identified.

The invention has the advantages that allow the natural-sounding in the sign text string to pause, it is synthetic to be used for literary composition language conversion (TTS), improves the quality of synthetic speech thus.

Above detail specifications has only provided preferred example embodiment, and and be not intended to limit the scope of the invention, applicability or configuration.The detailed description of preferred example embodiment is in order to make those skilled in the art can realize preferred example embodiment of the present invention.Be to be understood that under the prerequisite of the spirit and scope of the present invention of in not deviating from, being set forth, on the function of element and structure, can make multiple change as claims.

Claims

1. method that the natural-sounding that is used for the Automatic Logos text string pauses, this pause is used among the literary composition language conversion of carrying out on the electronic equipment, and this method comprises:

Obtain the described text string that comprises two ends, these two ends are starting ends and finish end;

Whether at least one word of analyzing in the described text string exists natural-sounding to pause near judging described word, described analysis is used for the predetermined threshold value of described word based at least one, and the quantity of the syllable between the end in the described two ends of described predetermined threshold value and described word and text string is associated; With

Described natural-sounding pause is inserted in the synthetic speech signal output expression of text string.

2. the method that the natural-sounding that is used for the Automatic Logos text string as claimed in claim 1 pauses, wherein, described at least one predetermined threshold value comprises P word threshold value, it is based on the quantity of the syllable between described starting end and the described word.

3. the method that the natural-sounding that is used for the Automatic Logos text string as claimed in claim 1 pauses, wherein, described at least one predetermined threshold value comprises F word threshold value, it is based on the quantity of the syllable between described end end and the described word.

4. the method that the natural-sounding that is used for the Automatic Logos text string as claimed in claim 1 pauses, wherein, described at least one predetermined threshold value is determined by following step:

Training set to the oral account record provides at least one to be paused by the natural-sounding that identifier identified that inserts;

Word in each described oral account record all is designated P word and F word;

Described P word and the F word in the described training set analyzed on statistics ground;

From described The result of statistics, determine described F word threshold value and P word threshold value.

5. the method that the natural-sounding that is used for the Automatic Logos text string as claimed in claim 1 pauses, wherein, the natural-sounding of described insertion pauses and also can comprise and be designated the pause that the part of speech pattern is paused naturally.

6. the method that the natural-sounding that is used for the Automatic Logos text string as claimed in claim 1 pauses, wherein, the natural-sounding of described insertion pauses and also can comprise and be designated the pause that portmanteau word pauses naturally.