CN103187052B

CN103187052B - A kind of method and device setting up the language model being used for speech recognition

Info

Publication number: CN103187052B
Application number: CN201110451385.XA
Authority: CN
Inventors: 万广鲁; 贾磊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2011-12-29
Filing date: 2011-12-29
Publication date: 2015-09-02
Anticipated expiration: 2031-12-29
Also published as: CN103187052A

Abstract

The invention provides a kind of foundation for the method for the language model of speech recognition and device, wherein said method comprises: the result identified the voice search query of user is carried out language model training as phonetic search language material by A., obtains speech language model; And, the text search query of user is carried out language model training as text search language material, obtains text language model; B. by described speech language model and described text language Model Fusion, identifiable language model is obtained.The identifiable language model obtained by the way, can reflect word preference when user speech inputs well, this identifiable language models applying can be improved in speech recognition the precision of speech recognition.

Description

A kind of method and device setting up the language model being used for speech recognition

[technical field]

The present invention relates to speech recognition technology, particularly a kind of method and device setting up the language model being used for speech recognition.

[background technology]

Search engine changes the mode of people's obtaining information greatly, has more and more become part indispensable in people's life.In recent years, along with the development of speech recognition technology, phonetic search has become one way of search more easily.People, by the request of mobile terminal input phonetic search, just can be met the Search Results of oneself demand from search engine server.

Phonetic search relies on speech recognition technology, only has and correctly identifies the phonetic entry of user, just can return the information that user wants to search.The effect of speech recognition depends on the acoustic model and language model that use in speech recognition.Acoustic model is applied to the calculating of voice to syllable probability in speech recognition, and language model is applied to the calculating of syllable to word probability in speech recognition.Language model is the model of words of description probability distribution, and the language model of the probability distribution of word when reliably can reflect that user speech is searched for is the key that phonetic searching system obtains reliable results.Due to the language material that the uses when probability distribution of word in language model depends on this language model of training, therefore, obtaining word when searching for user speech, to be accustomed to consistent corpus extremely important.In the prior art, usual employing two kinds of methods obtain corpus, the first manually marks the inquiry request of user when phonetic search and using the inquiry request after mark as corpus, the shortcoming of this mode is that cost is very high, also the language material getting sufficient amount is difficult to, another kind of mode be directly by user using the inquiry request of text event detection as corpus, the shortcoming of this mode is because word when user uses phonetic entry mode to send searching request there are differences compared with word when using character input modes to send searching request, the language model that this mode obtains, be difficult to the preference of word when reflection user uses phonetic search, such language model is applied in speech recognition, the precision of speech recognition will be reduced.

[summary of the invention]

Technical matters to be solved by this invention is to provide a kind of foundation for the method for the language model of speech recognition and device, is difficult to reflect the word custom of user when speech expression thus affects the defect of precision of identifying speech with the language model solving prior art.

The present invention is that the technical scheme that technical solution problem adopts is to provide the method for a kind of foundation for the language model of speech recognition, comprise: the result identified the voice search query of user is carried out language model training as phonetic search language material by A., obtains speech language model; And, the text search query of user is carried out language model training as text search language material, obtains text language model; B. by described speech language model and described text language Model Fusion, identifiable language model is obtained.

According to one of the present invention preferred embodiment, in described steps A, the initial voice search query of identifiable language model to user is used to identify.

According to one of the present invention preferred embodiment, in described step B, during by described speech language model and described text language Model Fusion, the parameter in the parameter in described speech language model and described text language model is carried out interpolation, to obtain the parameter in described identifiable language model.

According to one of the present invention preferred embodiment, when the parameter in described speech language model and described text language model is carried out interpolation, the parameter in described speech language model or described text language model is weighted.

According to one of the present invention preferred embodiment, described method comprises further: use the voice search query of described identifiable language model to user to identify, obtain recognition result.

According to one of the present invention preferred embodiment, described method comprises further: described recognition result is carried out language model training as the phonetic search language material newly increased, and to upgrade described speech language model, and returns described step B.

According to one of the present invention preferred embodiment, the step using the voice search query of identifiable language model to user to identify comprises: set up multiple candidate word sequence according to the voice search query of user; Use identifiable language model calculates the probability that each candidate word sequence occurs in identifiable language model, and selects to occur the recognition result of the candidate word sequence of maximum probability as the voice search query to user.

8, method according to claim 5, is characterized in that, described method comprises further:

The result for retrieval relevant to described recognition result is returned to user.

Present invention also offers a kind of device setting up identification speech model, comprising: the first training unit, carry out language model training for the result will identified the voice search query of user as phonetic search language material, obtain speech language model; Second training unit, for the text search query of user is carried out language model training as text search language material, obtains text language model; Integrated unit, for by described speech language model and described text language Model Fusion, obtains identifiable language model.

According to one of the present invention preferred embodiment, the phonetic search language material used when described first training unit carries out language model training obtains after using the initial voice search query of identifiable language model to user to identify.

According to one of the present invention preferred embodiment, when described integrated unit is by described speech language model and described text language Model Fusion, interpolation is carried out to the parameter in described speech language model and described text language model, to obtain the parameter in described identifiable language model.

According to one of the present invention preferred embodiment, when described integrated unit carries out interpolation to the parameter in described speech language model and text language model, the parameter in described speech language model or described text language model is weighted.

According to one of the present invention preferred embodiment, described device comprises further: recognition unit, for using the voice search query of described identifiable language model to user to identify, obtains recognition result.

According to one of the present invention preferred embodiment, the recognition result obtained is supplied to described first training unit by described recognition unit, described recognition result is carried out language model training as the phonetic search language material newly increased, to upgrade described speech language model for described first training unit.

According to one of the present invention preferred embodiment, described recognition unit comprises: word sequence unit, for setting up multiple candidate word sequence according to the voice search query of user; Computing unit, the probability calculating each candidate word sequence for using described identifiable language model and occur in described identifiable language model, and select to occur the recognition result of the candidate word sequence of maximum probability as the voice search query to user.

According to one of the present invention preferred embodiment, described device comprises further: retrieval unit, for returning the result for retrieval relevant to described recognition result to user.

As can be seen from the above technical solutions, by voice identification result is carried out language model training as language material, and the language model that the language model of being trained by voice identification result and corpus of text are trained merges the identifiable language model obtained, word preference when user speech inputs can be reflected well, by such identifiable language models applying in speech recognition, the precision of speech recognition can be improved.

[accompanying drawing explanation]

Fig. 1 is the schematic flow sheet of the embodiment of the method setting up the language model for speech recognition in the present invention;

Fig. 2 is the schematic diagram of the embodiment obtaining phonetic search language material and text search language material in the present invention;

Fig. 3 is the schematic diagram of the embodiment of word figure in the present invention;

Fig. 4 is the structural schematic block diagram of the embodiment of the device setting up the language model for speech recognition in the present invention;

Fig. 5 is the structural schematic block diagram of the embodiment of recognition unit in the present invention.

[embodiment]

In order to make the object, technical solutions and advantages of the present invention clearly, describe the present invention below in conjunction with the drawings and specific embodiments.

Please refer to Fig. 1, Fig. 1 is the schematic flow sheet of the embodiment of the method setting up the language model for speech recognition in the present invention.As shown in Figure 1, the method comprises:

S101: the result identified the voice search query of user is carried out language model training as phonetic search language material, obtains speech language model; And, the text search query of user is carried out language model training as text search language material, obtains text language model.

S102: by speech language model and text language Model Fusion to obtain identifiable language model.

Below above-mentioned steps is specifically described.

Please refer to Fig. 2, Fig. 2 is the schematic diagram of the embodiment obtaining phonetic search language material and text search language material in step S101.As shown in Figure 2, user, when searching for, can be undertaken by the mode of text event detection or phonetic entry.When user utilizes input through keyboard searching request, client collected by text will be sent to search engine server by the text search request of collecting by network, recording user is by the searching request of input through keyboard in retrieve log for log recording apparatus in server, and this retrieve log just can as the text search language material in the present invention.When user sends phonetic search request by mobile terminal (as mobile phone), voice collect client can by the voice signal collected by network delivery to search engine server, obtain recognition result after the phonetic search request of speech recognition equipment in search engine server to user identifies, namely this recognition result can be used as the phonetic search language material in the present invention.

In the embodiment of the acquisition phonetic search language material shown in Fig. 2, speech recognition equipment needs to utilize the initial voice search query of identifiable language model to user to identify.Initial identifiable language model in the present embodiment can be an existing identifiable language model, also can be the identifiable language model utilizing method establishment provided by the present invention, in this case the recognition result that the speech recognition equipment in Fig. 2 obtains, namely for the phonetic search language material of language model training, the identifiable language model in step S102 is served to the effect of renewal in step S101, thus achieve the adaptive process of the identifiable language model in the present invention.

Language model refers to N-Gram language model, and this model is based on a kind of like this hypothesis, and namely the appearance of N number of word is only to N-1 word is relevant above, and all uncorrelated with other any word, and the probability of whole sentence is exactly the product of each word probability of occurrence.The process of train language model, adds up the number of times that N number of word occurs simultaneously exactly from language material, to obtain the process of each N-Gram probable value.What usual use was more is the Bi-Gram model of binary and the Tri-Gram model of ternary, and the present invention does not limit this.

Parameter in speech language model and text language model is each N-Gram probable value, time in step s 102 by speech language model and text language Model Fusion, interpolation processing is carried out to the parameter in the parameter in speech language model and text language model, so just obtain the parameter in identifiable language model, each N-Gram probable value namely in identifiable language model.

Such as: in speech language model, P (you are good) is 0.5, wherein P (X) represents the probable value of X, in text language model, word P (you are good) is 0.8, if give identical weights with the parameter (i.e. each probable value) in text language model to speech language model, in identifiable language model then after interpolation, P (you are good) is exactly 50%*0.5+50%*0.8=0.65.

In addition, when carrying out interpolation processing to the parameter in the parameter in speech language model and text language model, can also be the parameter weighting in speech language model.In example such as, if the weight of speech language model is set to 70%, the weight of text language model is set to 30%, then P (you are good) is exactly 70%*0.5+30%*0.2=0.41.For the parameter in speech language model is weighted, the preference that final identifiable language model can be made to reflect better when user speech inputs.If wish that final identifiable language model lays particular emphasis on preference when reflection user version inputs, also can be weighted text language model.

After obtaining identifiable language model, further, can also identify with the voice search query of identifiable language model to user, obtain recognition result.The recognition result obtained can carry out language model training as the phonetic search language material newly increased, and to upgrade speech language model, can upgrade identifiable language model again, reach adaptive process after the speech language model after renewal and text language Model Fusion.

The process that the voice search query of user identifies is comprised:

Multiple candidate word sequence is set up according to the voice search query of user;

Use identifiable language model calculates the probability that each candidate word sequence occurs in identifiable language model, and selects to occur the recognition result of the candidate word sequence of maximum probability as the voice search query of user.

The syllable of the voice search query of such as user is " na li de kao ya hao chi ", this syllable sequence can be expressed as multiple candidate word sequence, as " the roasting tooth of there is fond of eating ", " roast duck of there is fond of eating " or " roast duck where is fond of eating " etc.For each candidate word sequence, the transition probability between probability and adjacent word that wherein each word occurs can be found from identifiable language model, transition probability between the probability occur each word and adjacent word is multiplied and can obtains the probability that this candidate word sequence occurs in language model, like this, the candidate word sequence that probability of occurrence is maximum just can as the recognition result of the voice search query to user.For identifiable language model for Bi-Gram language model, the probability that candidate word sequence occurs in identifiable language model can be expressed as follows:

P (roast duck where is fond of eating)=

P (where) * P (roast duck | where) * P (roast duck) * P (nice | roast duck) * P (being fond of eating)

Wherein P (where), P (roast duck), P (being fond of eating) they are the probability that in candidate word sequence, each word occurs, and P (roast duck | where), P (nice | roast duck) is the transition probability between adjacent word.

When setting up multiple candidate word sequence according to the voice search query of user, several morphologies one-tenth word figure as shown in Figure 3 that the frequency of occurrences in language model is the highest can be chosen in the word corresponding with syllable, in word figure, any path be communicated with from front to back all can be used as candidate word sequence, should be appreciated that, the mode more than setting up multiple candidate word sequence just schematically illustrates, the present invention does not limit the strategy setting up multiple candidate word sequence, selects arbitrarily in the mode that can be able to realize those skilled in the art.

After obtaining the recognition result to user speech search inquiry, the present invention can also return the result for retrieval relevant to recognition result to user further, it is similar that this process and existing search engine return the result for retrieval relevant with the query contents that user inputs, and is no longer specifically described at this.Be appreciated that, the result for retrieval relevant to recognition result, both can be the result for retrieval comprising recognition result, also can be the result for retrieval carrying out based on recognition result expanding, the expanding policy that corresponding query expansion result adopts, can adopt any existing expanding policy, the present invention does not limit this.

Please refer to Fig. 4, Fig. 4 is the structural schematic block diagram of the embodiment of the device setting up the language model for speech recognition in the present invention.As shown in Figure 4, the device of speech recognition comprises: the first training unit 201, second training unit 202, integrated unit 203, recognition unit 204.

First training unit 201, carries out language model training for the result will identified the voice search query of user as phonetic search language material, obtains speech language model.

Second training unit 202, for the text search query of user is carried out language model training as text search language material, obtains text language model.

Integrated unit 203, term, by speech language model and text language Model Fusion, obtains identifiable language model.

Recognition unit 204, for using the language search inquiry of identifiable language model to user to identify, obtains recognition result.

In one embodiment, the phonetic search language material used when the first training unit 201 carries out language model training obtains after using the language search inquiry of initial identifiable language model to user to identify.

Initial identifiable language model can be an existing identifiable language model, also can be the identifiable language model utilizing device provided by the present invention to set up.

Text search language material in second training unit 202 is the retrieve log of recording user text search query in search engine.

Language model in the present invention, refers to N-Gram language model, and this model is based on a kind of like this hypothesis, and namely the appearance of N number of word is only to N-1 word is relevant above, and all uncorrelated with other any word, and the probability of whole sentence is exactly the product of each word probability of occurrence.The process of train language model, adds up the number of times that N number of word occurs simultaneously exactly from language material, to obtain the process of each N-Gram probable value.What usual use was more is the Bi-Gram model of binary and the Tri-Gram model of ternary, and the present invention does not limit this.

Parameter in speech language model and text language model is each N-Gram probable value, integrated unit 203 is when by speech language model and text language Model Fusion, interpolation processing is carried out to the parameter in the parameter in speech language model and text language model, so just obtain the parameter in identifiable language model, each N-Gram probable value namely in identifiable language model.

Such as: in speech language model, word P (you are good) was 0.5 (P represents probable value), in text language model, word P (you are good) is 0.8, if give identical weights with the parameter (i.e. each probable value) in text language model to speech language model, in identifiable language model then after interpolation, P (you are good) is exactly 50%*0.5+50%*0.8=0.65.

Integrated unit 203, when carrying out interpolation processing to the parameter in the parameter in speech language model and text language model, can be the parameter weighting in speech language model.In example such as, if the weight of speech language model is set to 70%, the weight of text language model is set to 30%, then P (you are good) is exactly 70%*0.5+30%*0.2=0.41.For the parameter in speech language model is weighted, the preference that final identifiable language model can be made to reflect better when user speech inputs.If wish that final identifiable language model lays particular emphasis on preference when reflection user version inputs, integrated unit 203 also can be weighted text language model.

Please refer to Fig. 5, Fig. 5 is the schematic diagram of the embodiment of recognition unit in the present invention.As shown in Figure 5, recognition unit 204 comprises word sequence unit 2041 and computing unit 2042.Wherein word sequence unit 2041 is for setting up multiple candidate word sequence according to the voice search query of user, the probability that computing unit 2042 occurs in identifiable language model for using identifiable language model to calculate each candidate word sequence, and select to occur the recognition result of the candidate word sequence of maximum probability as the voice search query of user.

Word sequence unit 2041 sets up multiple candidate word sequence according to the syllable of voice search query after the voice search query obtaining user.The syllable of the voice search query of such as user is " na li de kaoya hao chi ", then word sequence unit 2041 can set up word figure as shown in Figure 3, in word figure, any path be communicated with from front to back defines a candidate word sequence, as " the roasting tooth of there is fond of eating ", " roast duck of there is fond of eating " or " roast duck where is fond of eating " etc.Word sequence unit 2041 is when setting up multiple candidate word sequence, several morphologies that in identifiable language model, the frequency of occurrences is the highest can be chosen in the word corresponding with syllable and become word figure, in addition, any additive method that those skilled in the art also can be adopted to realize sets up multiple candidate word sequence.

Computing unit 2042 is for each candidate word sequence in word candidate unit 2041, the transition probability between probability and adjacent word that wherein each word occurs is found from identifiable language model, and the probability occurred by each word is multiplied with the transition probability between adjacent word and obtains the probability that each candidate word sequence occurs, like this, the candidate word sequence that probability of occurrence is maximum just can as the recognition result of the voice search query to user.

Please continue to refer to Fig. 4.Further, the recognition result obtained is supplied to the first training unit 201 by recognition unit 204, recognition result is carried out language model training as the phonetic search language material newly increased, to upgrade speech language model for the first training unit 201.Speech language model after renewal and text language model, through the process of integrated unit 203, just achieve the object upgraded identifiable language model, thus realize the adaptive process of this device.In addition, device of the present invention also can comprise a retrieval unit (not shown in Fig. 4) further, for after the recognition result that obtains user speech search inquiry at recognition unit 204, the result for retrieval relevant to recognition result is returned to user, the principle of work of retrieval unit is identical with the principle of work of the retrieval unit of existing search engine, and the present invention is no longer described in detail.Should be appreciated that, the result for retrieval relevant to recognition result, both can be the result for retrieval comprising recognition result, also can be the result for retrieval carrying out based on recognition result expanding, the expanding policy that corresponding query expansion result adopts, can adopt any existing expanding policy, the present invention does not limit this.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. set up a method for the language model being used for speech recognition, it is characterized in that, the method comprises:

A. the result identified the voice search query of user is carried out language model training as phonetic search language material, obtain speech language model; And, the text search query of user is carried out language model training as text search language material, obtains text language model;

B. by described speech language model and described text language Model Fusion, identifiable language model is obtained;

In described step B, during by described speech language model and described text language Model Fusion, the parameter in the parameter in described speech language model and described text language model is carried out interpolation, to obtain the parameter in described identifiable language model.

2. method according to claim 1, is characterized in that, in described steps A, uses the initial voice search query of identifiable language model to user to identify.

3. method according to claim 1, is characterized in that, when the parameter in described speech language model and described text language model is carried out interpolation, is weighted the parameter in described speech language model or described text language model.

4. method according to claim 1, is characterized in that, described method comprises further:

Use the voice search query of described identifiable language model to user to identify, obtain recognition result.

5. method according to claim 4, is characterized in that, described method comprises further: described recognition result is carried out language model training as the phonetic search language material newly increased, and to upgrade described speech language model, and returns described step B.

6. method according to claim 4, is characterized in that, the step using the voice search query of identifiable language model to user to identify comprises:

Use identifiable language model calculates the probability that each candidate word sequence occurs in identifiable language model, and selects to occur the recognition result of the candidate word sequence of maximum probability as the voice search query to user.

7. method according to claim 4, is characterized in that, described method comprises further:

8. set up the device identifying speech model, it is characterized in that, this device comprises:

First training unit, carries out language model training for the result will identified the voice search query of user as phonetic search language material, obtains speech language model;

Second training unit, for the text search query of user is carried out language model training as text search language material, obtains text language model;

Integrated unit, for by described speech language model and described text language Model Fusion, obtains identifiable language model;

When described integrated unit is by described speech language model and described text language Model Fusion, interpolation is carried out to the parameter in described speech language model and described text language model, to obtain the parameter in described identifiable language model.

9. device according to claim 8, is characterized in that, the phonetic search language material used when described first training unit carries out language model training obtains after using the initial voice search query of identifiable language model to user to identify.

10. device according to claim 8, is characterized in that, when described integrated unit carries out interpolation to the parameter in described speech language model and text language model, is weighted the parameter in described speech language model or described text language model.

11. devices according to claim 8, is characterized in that, described device comprises further:

Recognition unit, for using the voice search query of described identifiable language model to user to identify, obtains recognition result.

12. devices according to claim 11, it is characterized in that, the recognition result obtained is supplied to described first training unit by described recognition unit, described recognition result is carried out language model training as the phonetic search language material newly increased, to upgrade described speech language model for described first training unit.

13. devices according to claim 11, is characterized in that, described recognition unit comprises:

Word sequence unit, for setting up multiple candidate word sequence according to the voice search query of user;

Computing unit, the probability calculating each candidate word sequence for using described identifiable language model and occur in described identifiable language model, and select to occur the recognition result of the candidate word sequence of maximum probability as the voice search query to user.

14. devices according to claim 11, is characterized in that, described device comprises further: retrieval unit, for returning the result for retrieval relevant to described recognition result to user.