CN101464896B

CN101464896B - Voice fuzzy retrieval method and apparatus

Info

Publication number: CN101464896B
Application number: CN2009100011645A
Authority: CN
Inventors: 王智国; 吴及; 钱胜; 吕萍; 陈志刚; 胡国平; 胡郁; 刘庆峰; 吴晓如; 王仁华
Original assignee: iFlytek Co Ltd
Current assignee: Tsinghua University; iFlytek Co Ltd
Priority date: 2009-01-23
Filing date: 2009-01-23
Publication date: 2010-08-11
Anticipated expiration: 2029-01-23
Also published as: CN101464896A

Abstract

The invention discloses a method and a device which are used for speech fuzzy retrieval, wherein, the method comprises the following steps: speech recognition is performed on the obtained speech signals by utilizing a preset acoustic model and a language model, and recognition results are obtained; retrieval is performed in a preset text entry database by utilizing a preset index table according to the recognition results, and primarily elected entries are obtained; fuzzy matching for character strings is performed between the primarily elected entries and the recognition results, entries of which the matching degree is in a threshold value range of preset matching degree are selected as well-chosen entries, and meanwhile, the matching position of each entry is recorded; posterior probability between the text of the matching part and the well-chosen entries and voice signals are calculated; and finally, a plurality of entries are selected as the retrieval results of voice signals by utilizing the posterior probability and the matching proportion obtained through the matching positions. By adopting the invention, text entries matched with the voice signals can be retrieved quickly and accurately in a great capacity text entry database on the basis of voice signals.

Description

Voice fuzzy retrieval method and device

Technical field

The present invention relates to field of speech recognition and searching field, relate in particular to a kind of voice fuzzy retrieval method and device.

Background technology

The voice fuzzy retrieval is as a branch in the multimedia retrieval technology, different with traditional text retrieval and audio retrieval, its solves be not text in the retrieval of text library or audio frequency in the retrieval of audio repository, but audio frequency in the retrieval of text library, promptly one section voice signal how submitting to according to the user retrieves the relevant text message of content with it in text library.

Speech recognition technology can be converted to word content with voice signal, if utilize the literal after the conversion and use for reference text searching method, just can realize audio frequency in the retrieval of text library, yet speech recognition technology can not be accomplished absolutely accurately, particularly for spoken voice, recognition accuracy is usually less than 90%, can imagine that retrieve magnanimity textual entry storehouse with non-text accurately, result for retrieval is more inaccurate.

Summary of the invention

The invention provides a kind of voice fuzzy retrieval method and device, to solve the inaccurate problem of retrieval that the existing voice recognition technology exists.

For this reason, the embodiment of the invention adopts following technical scheme:

A kind of voice fuzzy retrieval method comprises:

Acoustic model that utilization is preset and language model carry out speech recognition to the voice signal that obtains, and obtain recognition result;

The concordance list that utilization is preset is retrieved in the textual entry storehouse of presetting according to described recognition result, obtains the primary election clauses and subclauses;

Described primary election clauses and subclauses and described recognition result are carried out the character string fuzzy matching, choose the selected clauses and subclauses of matching degree in the matching degree threshold range that presets, write down matched position simultaneously;

Calculate the posterior probability between selected entries match part text and described voice signal, utilize posterior probability and select the result for retrieval of several clauses and subclauses as voice signal by the matching ratio that described matched position obtains.

This method also comprises:

Is that indexing units is set up described concordance list according to textual entry to be retrieved with syllable, word or speech, in order to carry out one or more levels index.

This method also comprises:

Described language model is all or part of to utilize described textual entry storehouse training of presetting to obtain.

Wherein:

The form of described recognition result comprises the most probable text strings of voice signal correspondence, the most possible kinds of words string of voice signal correspondence, and the speech figure of voice signal correspondence.

The concordance list that described utilization is preset is retrieved the detailed process that obtains the primary election clauses and subclauses according to described recognition result in the textual entry storehouse of presetting:

The concordance list that utilization is preset is voted to each character/word in the recognition result, chooses votes and is higher than the clauses and subclauses of the votes threshold value that presets as described primary election clauses and subclauses;

Wherein, described ballot is meant the index entry of searching concordance list with the character/word in the recognition result, inquire index entry after, each clauses and subclauses votes that this index is included all adds 1.

The matching algorithm of described fuzzy matching adopts based on editing distance dynamic programming computing method between the text of confusion matrix, and wherein, described confusion matrix obtains or preestablishes by training, is optimized replacing, insert, delete cost.

A kind of voice fuzzy indexing unit comprises:

The voice signal acquiring unit is used to obtain voice signal;

Recognition unit is used to utilize the acoustic model and the language model that preset that the voice signal that obtains is carried out speech recognition, obtains recognition result;

Retrieval unit is used for utilizing the concordance list that presets to retrieve in the textual entry storehouse of presetting according to described recognition result, obtains the primary election clauses and subclauses;

The fuzzy matching unit is used for described primary election clauses and subclauses and described recognition result are carried out the character string fuzzy matching, chooses the selected clauses and subclauses of matching degree in the matching degree threshold range that presets, and the record matched position;

Determining unit is used to calculate the compatible portion of selected clauses and subclauses and the posterior probability between described voice signal as a result, utilizes posterior probability and selects the result for retrieval of several clauses and subclauses as voice signal by the matching ratio that described matched position obtains.

This device also comprises:

Concordance list is set up the unit, and being used for according to the textual entry storehouse of presetting to be retrieved is that indexing units is set up described concordance list with syllable, word or speech, and described concordance list is in order to carry out one or more levels index.

This device also comprises:

Language model is set up the unit, is used to utilize described textual entry storehouse training of presetting to obtain the part or all of of described language model.

Described retrieval unit comprises:

Index ballot subelement, be used for utilizing the concordance list that presets that each character/word of recognition result is voted, wherein, described ballot is meant the index entry of searching concordance list with the character/word in the recognition result, after inquiring about index entry, each clauses and subclauses votes that this index is included all adds 1;

The primary election clauses and subclauses are chosen subelement, are used to choose votes and are higher than the clauses and subclauses of the votes threshold value that presets as described primary election clauses and subclauses.

As seen, the present invention proposes a kind of brand-new voice fuzzy search modes, it is by using the steps such as posterior probability calculating of relevant language model, index ballot, character string fuzzy matching, selected clauses and subclauses and voice signal, overcome the adverse effect that incomplete correct voice identification result is retrieved text library, realized the quick and precisely retrieval of voice signal on magnanimity textual entry storehouse.

Description of drawings

Fig. 1 is a voice fuzzy retrieval method process flow diagram of the present invention;

Fig. 2 is the inventive method embodiment process flow diagram;

Fig. 3 is a voice fuzzy indexing unit structural representation of the present invention.

Embodiment

Voice fuzzy retrieval scheme provided by the invention, when identification, add suitable language model to improve accuracy rate, when utilizing recognition result, carry out the character string fuzzy matching to reduce the influence of identification error as text retrieval, and, the calculated candidate keyword is that the posterior probability of audio content is verified, thereby increases substantially the accuracy and the reliability of retrieval.

Referring to Fig. 1, be voice fuzzy retrieval method process flow diagram of the present invention, may further comprise the steps:

S101: utilize the acoustic model and the language model that preset that the voice signal that obtains is carried out speech recognition, obtain recognition result;

S102: utilize the concordance list that presets in the textual entry storehouse of presetting, to retrieve, obtain the primary election clauses and subclauses according to described recognition result;

Wherein, described textual entry storehouse of presetting generally is the textual entry storehouse of magnanimity, comprises a large amount of textual entry information.

S103: described primary election clauses and subclauses and described recognition result are carried out the character string fuzzy matching, choose the selected clauses and subclauses of matching degree in the matching degree threshold range that presets, write down matched position simultaneously;

S104: calculate the compatible portion of selected clauses and subclauses and the posterior probability between described voice signal, utilize described posterior probability and select the result for retrieval of several clauses and subclauses as voice signal by the matching ratio that described matched position obtains.

Below in conjunction with instantiation, the present invention is described in detail.

Referring to Fig. 2, carry out the specific embodiment method flow diagram in speech retrieval magnanimity textual entry storehouse for utilizing the voice fuzzy retrieval technique, comprising:

S201: the voice signal that obtains user's input;

S202: utilize acoustic model and the language model set up in advance that the voice signal that obtains is carried out speech recognition, obtain recognition result;

S203: utilize the concordance list that presets in the textual entry storehouse of presetting, to retrieve fast, obtain the primary election clauses and subclauses according to recognition result;

Before beginning to make up the voice fuzzy searching system, need set up the concordance list in suitable speech model and magnanimity textual entry storehouse in advance.

Because will in magnanimity textual entry storehouse, retrieve the text that comprises voice content, so voice content very likely is to exist in the magnanimity textual entry storehouse, be wherein certain clauses and subclauses or the part of certain clauses and subclauses, therefore, according to magnanimity textual entry storehouse is that the language model that corpus trains is to use relevant language model, and it can adapt to retrieval tasks better.

For the concordance list that presets, it comprises two parts composition: the content of index entry and index entry correspondence.The index entry of concordance list is word or speech among the present invention, and the content of index entry correspondence is the text that comprises this word or speech in the magnanimity textual entry storehouse, the corresponding a plurality of texts of a common index entry.For example, index entry " in " corresponding content comprises " Chinese Communist Party ", " The People's Republic of China " and " our big China " or the like.

Thus, in S202, the input voice when carrying out speech recognition, are added the relevant language model of application of training among the S203, can improve the accuracy rate of identification well, in S202, obtain the high recognition result of accuracy rate.

Recognition result is that voice signal is through the decoded character form of expression, form commonly used has: the most probable text strings of input speech signal correspondence (promptly has only a kind of recognition result, for example " People's Republic of China (PRC) "), most possible is that N kind text strings (is multiple recognition result, 3 kinds of recognition results for example: " Chinese Communist Party ", " The People's Republic of China " and " our big China "), the speech figure of voice signal correspondence, the predicate figure of institute is meant in the mode of directed acyclic graph and represents all possible text strings, speech figure is the form of recognition result performance the most efficiently, and the quantity of information that it comprises also is the abundantest.

In S203,, utilize the concordance list that presets to carry out the index ballot to each character/word in the recognition result that obtains among the S202.So-called ballot that is to say, searches the index entry of concordance list with the character/word in the recognition result, and inquiry is fallen behind the index entry, and the text votes of correspondence adds 1.For example, comprise in the recognition result " in " word, then all comprise " in " text, as the Chinese Communist Party ", the votes of " The People's Republic of China " and correspondences such as " our big China " adds 1.The text that votes is high more is high more with the matching degree of recognition result.Keep votes and be higher than the text of threshold value as the primary election clauses and subclauses.

S204: primary election clauses and subclauses and recognition result are carried out the character string fuzzy matching,, and only keep the selected clauses and subclauses of matching degree in the matching degree threshold range according to sort the from high to low clauses and subclauses of coupling of matching degree;

Because speech recognition technology can not guarantee accuracy very, cause existing in the recognition result certain mistake, and concordance list has only write down and has contained those character/word in the text, the positional information that does not have character/word, so the primary election clauses and subclauses that index goes out can not be directly as result for retrieval.

Therefore, utilize character string fuzzy matching technology, obtain the matching degree in primary election clauses and subclauses and the recognition result.For the character string Accuracy Matching, fuzzy matching allows substring incomplete same with main string.Two main method of character string fuzzy matching at present are bit vector method and filter method, and the present invention can adopt existing method to carry out.The simplest fuzzy matching algorithm is based on the editing distance of dynamic programming, there is deletion in the coupling, inserts and substitutes three kinds of mistakes, every kind of mistake can define different wrong costs according to practical application, and for the part of correct coupling, the definition error cost is zero usually.Among the present invention, the text in recognition result and the magnanimity textual entry storehouse can be regarded certain character form of expression as, and substring is recognition result, and main string is the clauses and subclauses in the magnanimity textual entry storehouse.Matching degree and wrong cost journey inverse ratio.Because the voice signal of user input may be the text fragments in the magnanimity textual entry storehouse, the character string fuzzy matching when providing matching degree, also given most probable matched position.

S205: each qualified selected clauses and subclauses is calculated its posterior probability for the input audio content; Simultaneously, record matched position;

Because the selected clauses and subclauses that obtain of step S204 compare in the character aspect with recognition result and get, and recognition result itself contains certain mistake, thus the matching degree height might not to represent it be that the possibility of voice actual content is big.Therefore in S205, calculated the posterior probability of selected clauses and subclauses under the given voice signal condition.This posterior probability is the numerical value between 0 to 1, and the posterior probability sum of all selected clauses and subclauses is 1.Posterior probability is big more, and its corresponding clauses and subclauses really are that the possibility of voice content is just big more.Posterior probability is meant after the information that obtains " result " probability of revising again, in Bayesian formula, be in " hold fruit seek because of " problem " because of ", prior probability and posterior probability have indivisible the contact, the calculating of posterior probability will be based on prior probability.The computing method of relevant posterior probability are ripe prior art, do not do describe herein more.

S206: the matching ratio that utilizes described posterior probability and obtain by described matched position, select the result for retrieval of several clauses and subclauses, then process ends as voice signal.

Wherein, can finally select the relative higher clauses and subclauses of posterior probability as result for retrieval by mode to posterior probability and matching ratio weighted with matching ratio.

Corresponding with said method, the invention provides a kind of voice fuzzy indexing unit, this device can be realized by software, hardware or software and hardware combining mode.

Referring to Fig. 3, be this device inner structure synoptic diagram, comprising: voice signal acquiring unit 300, recognition unit 301, retrieval unit 302, fuzzy matching unit 303 and determining unit 304 as a result, wherein:

Voice signal acquiring unit 300 is used to obtain voice signal;

Recognition unit 301 is used to utilize the acoustic model and the language model that preset that the voice signal that voice signal acquiring unit 300 obtains is carried out speech recognition, obtains recognition result;

Retrieval unit 302 is used for utilizing the concordance list that presets to retrieve in the textual entry storehouse of presetting according to the recognition result that recognition unit 301 obtains, and obtains the primary election clauses and subclauses;

Fuzzy matching unit 303 is used for the recognition result that primary election clauses and subclauses that retrieval unit 302 is obtained and recognition unit 301 obtain and carries out the character string fuzzy matching, chooses the selected clauses and subclauses of matching degree in the matching degree threshold range that presets, and writes down matched position simultaneously;

Determining unit 304 as a result, be used to calculate the selected clauses and subclauses of fuzzy matching unit 303 couplings and the posterior probability between voice signal, the matching ratio that utilizes described posterior probability and obtain by described matched position is selected the result for retrieval of several clauses and subclauses as voice signal.

Preferably, this device also comprises:

Concordance list is set up unit 305, and being used for according to the described textual entry that presets is that indexing units is set up concordance list with syllable, word or speech.

Preferably, this device also comprises:

Language model is set up unit 306, is used to utilize described textual entry storehouse training of presetting to obtain language model.

Preferably, retrieval unit 302 further comprises:

Index ballot subelement (not shown), be used for utilizing the concordance list that presets that each character/word of recognition result is voted, wherein, described ballot is meant the index entry of searching concordance list with the character/word in the recognition result, after inquiring about index entry, each clauses and subclauses votes that this index is included all adds 1;

The primary election clauses and subclauses are chosen the subelement (not shown), are used to choose votes and are higher than the clauses and subclauses of the votes threshold value that presets as described primary election clauses and subclauses.

Can repeat no more referring to method embodiment for the realization details that the invention provides device herein.

As seen, the present invention proposes a kind of brand-new voice fuzzy retrieval scheme, it is by using the steps such as posterior probability calculating of relevant language model, index ballot, character string fuzzy matching, candidate's text and voice signal, overcome the adverse effect that incomplete correct voice identification result is retrieved text library, realized the quick and precisely retrieval of voice signal on magnanimity textual entry storehouse.

One of ordinary skill in the art will appreciate that, the process of the method for realization the foregoing description can be finished by the relevant hardware of programmed instruction, described program can be stored in the read/write memory medium, and this program is carried out the corresponding step in the said method when carrying out.Described storage medium can be as ROM/RAM, magnetic disc, CD etc.

The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a voice fuzzy retrieval method is characterized in that, comprising:

2. according to the described method of claim 1, it is characterized in that described concordance list comprises the content of index entry and index entry correspondence, wherein, described index entry is word or speech, and the content of described index entry correspondence is the text that comprises this word or speech correspondence in the described textual entry storehouse of presetting.

3. according to the described method of claim 2, it is characterized in that, also comprise: described language model is all or part of to utilize described textual entry storehouse training of presetting to obtain.

4. according to the described method of claim 1, it is characterized in that the concordance list that described utilization is preset is retrieved the detailed process that obtains the primary election clauses and subclauses according to described recognition result and is in the textual entry storehouse of presetting:

5. according to the described method of claim 1, it is characterized in that, the matching algorithm of described fuzzy matching adopts based on editing distance dynamic programming computing method between the text of confusion matrix, wherein, described confusion matrix obtains or preestablishes by training, is optimized replacing, insert, delete cost.

6. a voice fuzzy indexing unit is characterized in that, comprising:

The voice signal acquiring unit is used to obtain voice signal;

7. according to the described device of claim 6, it is characterized in that, also comprise:

Concordance list is set up the unit, be used to set up described concordance list, described concordance list comprises the content of index entry and index entry correspondence, wherein, described index entry is word or speech, and the content of described index entry correspondence is the text that comprises this word or speech correspondence in the described textual entry storehouse of presetting.

8. according to the described device of claim 7, it is characterized in that, also comprise:

9. according to claim 6,7 or 8 described devices, it is characterized in that described retrieval unit comprises:

Index ballot subelement, be used for utilizing the concordance list that presets that each character/word of recognition result is voted, wherein, described ballot is meant the index entry of searching concordance list with the character/word in the recognition result, after inquiring index entry, each clauses and subclauses votes that this index is included all adds 1;