CN104516986A

CN104516986A - Statement identification method and device

Info

Publication number: CN104516986A
Application number: CN201510024299.9A
Authority: CN
Inventors: 王金龙; 贾明静; 董日壮
Original assignee: Qingdao University of Technology
Current assignee: Qingdao University of Technology
Priority date: 2015-01-16
Filing date: 2015-01-16
Publication date: 2015-04-15
Anticipated expiration: 2035-01-16
Also published as: CN104516986B

Abstract

The application provides a sentence recognition method and a sentence recognition device, wherein the method comprises the following steps: determining non-stop words as keywords for the obtained sentences to be identified, selecting candidate sentences containing the keywords of the sentences to be identified in a preset sentence library, determining a theme classification label and an intention classification label of the sentences to be identified by utilizing a pre-constructed classification model, wherein the classification model can identify the intentions of unknown classes, grouping a plurality of candidate sentences according to the preset intention labels when the identified intention classification labels are the unknown classes and the candidate sentences are a plurality of, and displaying the preset information corresponding to the candidate sentences in each group. As different groups correspond to different intention types, candidate sentences are selected from each group as target sentences, and preset information corresponding to each target sentence is displayed, so that the problem that the fed-back information is single or even can not be fed back is solved.

Description

A kind of statement recognition methods and device

Technical field

The application relates to language data processing technology field, especially a kind of statement recognition methods and device.

Background technology

In natural language processing field, usually need the identification that natural language is intended to, automatically to generate feedback information.Such as, when automatic question answering, user's read statement " why refrigerator no power ", needs to carry out intention assessment to read statement, to feed back the cold reason of refrigerator.

Due to the complicacy of natural language, a usually corresponding multiple different sons intention of statement, such as, the statement of user's input is " refrigerator no power ", in this kind of situation, user thinks the cold reason of inquiry refrigerator, also may be think the cold solution of inquiry refrigerator.

At present, this kind is existed to the statement of multiple different son intention, the feedback information of generation is comparatively single, as only reason feedback or only feed back solution, even cannot generate feedback information.

Summary of the invention

In view of this, this application provides a kind of statement recognition methods and device, in order to solve existing recognition methods export single even cannot the technical matters of output feedack information.For realizing described goal of the invention, technical scheme provided by the invention is as follows:

A kind of statement recognition methods, comprising:

Obtain statement to be identified;

Determine that the non-stop words in described statement to be identified is keyword;

In default statement library, choose the candidate's statement comprising described keyword;

Utilize the disaggregated model built in advance, determine subject classification label and the intent classifier label of described statement to be identified;

When described intent classifier label is unknown class and described candidate's statement is multiple, multiple described candidate's statement is classified according to respective default intention labels, obtains multiple grouping;

Candidate's statement in each described grouping is defined as object statement; Wherein, the preset themes label of described object statement is identical with the subject classification label of described statement to be identified;

Show the presupposed information that each described object statement is corresponding.

Alternatively, also comprise:

When the non-unknown class of described intent classifier label, determine the similarity of described statement to be identified and each described candidate's statement;

Candidate's statement corresponding for the maximum similarity exceeding default similarity threshold is defined as object statement;

Show the presupposed information that described object statement is corresponding.

Alternatively, describedly candidate's statement in each described grouping be defined as object statement comprise:

Determine the similarity of described statement to be identified and each described candidate's statement;

Carry out descending sort according to the size of similarity, in each described grouping, choose sequence front and the candidate's statement exceeding the predetermined number of default similarity threshold is object statement.

Alternatively, the disaggregated model that described utilization builds in advance, determine that the subject classification label of described statement to be identified and intent classifier label comprise:

According to default Feature Words extracting rule, in described statement to be identified, extract multiple characteristic of division;

Described multiple characteristic of division is inputed to described disaggregated model, obtains multiple intention probable value and multiple theme probable value;

Tag along sort corresponding for maximum intention probable value is defined as the intent classifier label of described statement to be identified, and tag along sort corresponding for maximum theme probable value is defined as the subject classification label of described statement to be identified.

Alternatively, the building process of described disaggregated model comprises:

Obtain and comprise multiple training set having marked statement; Wherein, each described mark statement has respective intention labels and theme label;

Utilize and preset training method, described training set is trained, obtain disaggregated model; Wherein, described disaggregated model is used for classifying to the intention of statement to be identified and theme.

Alternatively, describedly determine that the similarity of described statement to be identified and each described candidate's statement comprises:

The semantic similarity, the theme that calculate described statement to be identified and each described candidate's statement are respectively intended to similarity and syntax similarity; Wherein, described semantic similarity is the semantic similarity between the keyword of statement to be identified and the keyword of candidate's statement; Described theme intention similarity is the theme of statement to be identified and intention and the theme of candidate's statement and the similarity of intention; Described syntax similarity is the similarity of the syntactic structure of statement to be identified and the syntactic structure of candidate's statement;

Each for each described candidate's statement self-corresponding described semantic similarity, intention similarity and described syntax similarity are weighted and average, obtain described statement to be identified and each described candidate's statement similarity separately.

Alternatively, the semantic similarity calculating described statement to be identified and described candidate's statement comprises:

The each keyword calculating described statement to be identified successively respectively with the Words similarity of each keyword of described candidate's statement, obtain similarity matrix;

Add up the total value of maximum Words similarity in each row of described similarity matrix, and calculate the row mean value of this total value;

Add up the total value of maximum Words similarity in each row of described similarity matrix, and calculate the column average value of this total value;

Calculate the mean value of described row mean value and described column average value, obtain the semantic relevancy of described statement to be identified and described candidate's statement.

Alternatively, the theme calculating described statement to be identified and described candidate's statement is intended to similarity and comprises:

Judge that whether the subject classification label of described statement to be identified is identical with the preset themes tag along sort of described candidate's statement, obtains the first judged result;

Whether the intent classifier label judging described statement to be identified is unknown class, obtains the second judged result;

Judge that whether the intent classifier label of described statement to be identified is identical with the default intention labels of described candidate's statement, obtain the 3rd judged result;

When described first judged result for be and described second judged result for being time, determine described theme intention similarity be 1;

When described first judged result be yes, described second judged result be no and described 3rd judged result for being time, determine described theme intention similarity be 1;

When described first judged result be yes, described second judged result be no and described 3rd judged result be no time, determine described theme intention similarity be greater than 0 and be less than 1 preset value;

When described first judged result is no, determine that described theme intention similarity is 0.

Alternatively, the syntax similarity calculating described statement to be identified and described candidate's statement comprises:

Syntactic analysis is carried out to described statement to be identified, obtains the first syntactic constituent of described statement to be identified, and obtain second syntactic constituent preset of described candidate's statement;

Calculate the first Words similarity of the identical component of described first syntactic constituent and described second syntactic constituent;

Calculate the second Words similarity of described first syntactic constituent and the identical ornamental equivalent of described second syntactic constituent;

Obtain the default penalty factor of the non-equal composition of described first syntactic constituent and described second syntactic constituent;

Utilize described first Words similarity, described second Words similarity and described default penalty factor, calculate weighted mean value, obtain syntax similarity.

Alternatively, when the keyword determined is multiple, described in default statement library, choose the candidate's statement comprising described keyword and comprise:

Add up each statement in described default statement library and comprise the number of keyword in statement to be identified;

Number according to the keyword comprised carries out descending sort, and the statement choosing the preceding predetermined number that sorts is candidate's statement.

Alternatively, the described non-stop words determined in described statement to be identified is that keyword comprises:

Participle is carried out to described statement to be identified, obtains multiple participle word;

Remove the stop words in described multiple participle word, obtain keyword.

Present invention also provides a kind of statement recognition device, comprising:

Statement acquisition module to be identified, for obtaining statement to be identified;

Keyword determination module, for determining that the non-stop words in described statement to be identified is keyword;

Candidate's statement acquisition module, in default statement library, chooses the candidate's statement comprising described keyword;

Theme and intention determination module, for utilizing the disaggregated model built in advance, determine subject classification label and the intent classifier label of described statement to be identified;

Candidate's statement grouping module, for when described intent classifier label is unknown class and described candidate's statement is multiple, classifies multiple described candidate's statement according to respective default intention labels, obtains multiple grouping;

Object statement determination module, for being defined as object statement corresponding to described statement to be identified by the candidate's statement in each described grouping; Wherein, the preset themes label of described object statement is identical with the subject classification label of described statement to be identified;

Presupposed information display module, for showing presupposed information corresponding to each described object statement.

Compared with prior art, the present invention has following beneficial effect:

The invention provides a kind of statement recognition methods and device, the method comprises: to the statement to be identified got, determine that non-stop words is keyword, in default statement library, choose candidate's statement of the keyword comprising statement to be identified, utilize the disaggregated model built in advance, determine theme and the intent classifier label of statement to be identified respectively, need explanation, disaggregated model can identify the intention of unknown class, the intent classifier label identified be unknown class and candidate's statement is multiple time, according to the intention labels preset, by multiple candidate's statement grouping, presupposed information corresponding for candidate's statement in each grouping is shown.Due to the intention type that different grouping is corresponding different, from each grouping, select candidate's statement as object statement, and then show each self-corresponding presupposed information of each object statement, thus solve the single problem even cannot fed back of feedack.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to the accompanying drawing provided.

The process flow diagram of the statement recognition methods that Fig. 1 provides for the embodiment of the present invention;

The process flow diagram of the statement recognition methods that Fig. 2 provides for another embodiment of the present invention;

The determination process flow diagram of the semantic similarity that Fig. 3 provides for further embodiment of this invention;

The structural representation of the statement recognition device that Fig. 4 provides for the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

With reference to Fig. 1, it illustrates the flow process of the statement recognition methods that the embodiment of the present invention provides, specifically comprise the following steps:

Step S101: obtain statement to be identified.

Such as, the statement to be identified of acquisition is " why refrigerator no power ".Alternatively, statement to be identified is the statement that user inputs, or the statement that other programs provide.

Step S102: determine that the non-stop words in described statement to be identified is keyword.

Wherein, stop words to refer in statement the word of not concrete meaning, and as function word, function word is as " ", " " etc.Comprise multiple word in statement to be identified, the word in multiple word not being stop words is defined as keyword.Such as, for the above-mentioned statement to be identified enumerated, the keyword determined is " why ", " refrigerator ", " no ", " energising ".

Step S103: in default statement library, chooses the candidate's statement comprising described keyword.

In the present embodiment, be previously provided with statement library, in this statement library, comprise multiple statement.Statement is made up of multiple word, selects the statement alternatively statement comprising the keyword of above-mentioned statement to be identified in the word formed in multiple word.It should be noted that, the keyword comprising statement to be identified in candidate's statement does not require that candidate's statement comprises all keywords, as long as it is any one or more to comprise wherein.

Such as, for the above-mentioned statement to be identified enumerated, the candidate's statement chosen comprises " why my refrigerator no power " and " refrigerator how no power ".

Step S104: utilize the disaggregated model built in advance, determine subject classification label and the intent classifier label of described statement to be identified.

In the present embodiment, be built with disaggregated model in advance, disaggregated model utilizes classification based training method to train and obtains, subject categories and the intention kind of predetermined number can be identified respectively, when inputting certain statement to be identified to disaggregated model, disaggregated model can determine that this disaggregated model specifically belongs to theme and the intention of which kind respectively.Such as, disaggregated model can identify the multiple subject classification such as failure classes, pre-sales class, after sale class, and recommend class, true class, evaluate class, method class, demand class, enumerate class, be non-class, comparing class, reason class, the multiple intent classifier such as class, relation object, unknown class is described.Certainly, be only several example listed by above, the subject classification label in the present embodiment and intent classifier label can be different because of the difference of practical application scene.

Visible, subject classification label refers to the scope belonging to statement to be identified, shows that user wants the content of seeking advice to belong to which aspect.Such as, the statement of input is " the refrigerator relative energy-saving of what plate ", shows that user wants to seek advice from the problem of pre-sales aspect; The statement of input be " why my refrigerator no power ", shows that user wants the problem of consulting fault aspect.

Intent classifier label refers to the type belonging to feedback information corresponding to statement to be identified, shows the content type of the feedback information that user goes for.Such as, the statement of input is " the refrigerator relative energy-saving of what plate ", shows that user wants to obtain recommendation information, and the intent classifier label of this statement is for recommending class; The statement of input is " why my refrigerator no power ", and show that user wants to know the cold reason of refrigerator, the intent classifier label of this statement is reason class.

It should be noted that, the intention kind that disaggregated model can be determined comprises unknown class, and unknown class shows the intention of statement to be identified and indefinite, is appreciated that out two kinds and two or more sons intention.Such as, statement to be identified is " my refrigerator no power ", and the intention of this statement is also indefinite, and intention may be how to solve, or what reason.For this kind of statement to be identified, the intent classifier label that disaggregated model is determined is " unknown class ".For statement to be identified " why refrigerator no power ", the intent classifier label that disaggregated model is determined is " reason class ".

In addition, the concrete building process about disaggregated model refers to hereafter.

Step S105: when described intent classifier label is unknown class and described candidate's statement is multiple, classifies multiple described candidate's statement according to respective default intention labels, obtains multiple grouping.

It should be noted that, the word preset in statement library has default intention labels, intention labels is in order to show the content type of the feedback information gone for, such as, the default intention labels of " my refrigerator how no power " is " reason class ", the default intention labels of " cost performance of the refrigerator of which plate is higher " is " recommendation class ", the default intention labels of " energy-efficient performance of Haier's refrigerator how " is " evaluation class ".Like this, each candidate's word chosen from default statement library has default intention labels too.

Wherein, the intent classifier label of statement to be identified is judged, if unknown class label, the intention of statement to be identified is described and indefinite, selects possible candidate's statement according to possible several intentions.Particularly, when the candidate's statement got in step S103 is multiple, the plurality of statement may comprise multiple intention type, therefore, classify according to candidate's statement default intention labels separately, the candidate's statement being about to preset intention labels identical is divided into one group, thus obtains multiple grouping.

Step S106: the candidate's statement in each described grouping is defined as object statement; Wherein, the preset themes label of described object statement is identical with the subject classification label of described statement to be identified.

It should be noted that, word in default statement library is except having default intention labels, also there is default subject classification label, such as, the preset themes label of " my refrigerator how no power " is " failure classes ", the preset themes label of " cost performance of the refrigerator of which plate is higher " is " pre-sales class ", the preset themes label of " energy-efficient performance of Haier's refrigerator how " is " pre-sales class ".Like this, each candidate's word chosen from default statement library has preset themes label too.

In the grouping of candidate's statement composition, determine object statement, the preset themes label of object statement is identical with the theme label of the statement to be identified determined in step S104.Due to, when the intent classifier label of statement to be identified is confirmed as unknown class, show the intention of statement to be identified and indefinite, multiple different son intention can be indicated, therefore, need all to determine object statement in each grouping of candidate's statement composition.That is, candidate's statement comprises the default intention labels of how many kinds of, be then divided into how many grouping, and then can determine the object statement of intention correspondence of how many kinds.

Particularly, one or more candidate's statement may be comprised in grouping, in each grouping, select candidate's statement.The mode selected can be select all candidate's statements in grouping, or selects part candidate statement.Particularly, selection portion timesharing can be the candidate's statement selecting equal number in each grouping, certainly, also can be the candidate's statement selecting varying number.Detailed selection mode can vide infra description.

Step S107: show the presupposed information that each described object statement is corresponding.

Wherein, the statement preset in statement library is provided with corresponding presupposed information.The statement inputted can be thought in statement to be identified, then presupposed information is the feedback information of read statement.It should be noted that, object statement is multiple, then the presupposed information shown is multiple.

Presupposed information is relevant to the application scenarios of recognition methods, and application scenarios is different, then the presupposed information arranged can be different.Such as, if in question and answer scene, the statement to be identified of user's input is for puing question to sentence, then the presupposed information of object statement is answer sentence.

From above technical scheme, present embodiments provide a kind of statement recognition methods, to the statement to be identified got, determine that non-stop words is keyword, in default statement library, choose candidate's statement of the keyword comprising statement to be identified, utilize the disaggregated model built in advance, determine subject classification label and the intent classifier label of statement to be identified respectively, need explanation, disaggregated model can identify the intention of unknown class, the intent classifier label identified be unknown class and candidate's statement is multiple time, according to the intention labels preset, by multiple candidate's statement grouping, presupposed information corresponding for candidate's statement in each grouping is shown.Due to the intention type that different grouping is corresponding different, from each grouping, select candidate's statement as object statement, and then show each self-corresponding presupposed information of each object statement, thus solve after input statement to be identified, the single problem even cannot fed back of feedack, meets user to the diversified demand of feedback information.

It should be noted that, above-mentioned question and answer scene is only a kind of scene enumerated, and the present invention is not limited thereto, can also be other scenes, such as, in chat scenario, user's read statement, this statement is as statement to be identified, and the presupposed information of display is the feedback statement of chat statement.

As shown in Figure 2, on the basis of above-described embodiment, can also comprise:

Step S205: judge whether described intent classifier label is unknown class; If so, step S206 is performed; Otherwise, perform step S207.

Step S207: the similarity determining described statement to be identified and each described candidate's statement; Candidate's statement corresponding for the maximum similarity exceeding default similarity threshold is defined as object statement; Show the presupposed information that described object statement is corresponding.

Step S206: when described intent classifier label is unknown class and described candidate's statement is multiple, classifies multiple described candidate's statement according to respective default intention labels, obtains multiple grouping; Candidate's statement in each described grouping is defined as object statement; Show the presupposed information that each described object statement is corresponding.

Need to illustrate, other steps in this figure refer to above-mentioned explanation, do not repeat herein.

Particularly, if the intent classifier label of statement to be identified is not unknown class, then it is intended that clear and definite, be the classification that any one is clear and definite, as recommended class, true class, evaluate class, method class, demand class, enumerate class, be non-class, comparing class, reason class, the one described in class, relation object etc.

Determine the similarity between statement to be identified and each candidate's statement, and each similarity and default similarity threshold are compared, candidate's statement corresponding for the maximum similarity exceeding default similarity threshold in each similarity is defined as object statement, and then shows presupposed information corresponding to this objective statement.

Clear and definite feedback information can be shown for the clear and definite statement to be identified of intention in the present embodiment, and candidate's statement corresponding for the maximum similarity exceeding default similarity threshold is defined as object statement, the accuracy of the feedback information provided can be provided.

When the intent classifier label of statement to be identified is unknown class, need to determine object statement in candidate's statement, can be in candidate's statement, select part candidate statement to be object statement, correspondingly, the specific implementation that the candidate's statement in each described grouping is defined as object statement is by step S106:

Determine the similarity of described statement to be identified and each described candidate's statement; Carry out descending sort according to the size of similarity, in each described grouping, choose sequence front and the candidate's statement exceeding the predetermined number of default similarity threshold is object statement.

Particularly, the similarity of the candidate's statement in each grouping and statement to be identified is carried out descending sort according to size, select the forward and candidate's word being above the predetermined number of default similarity threshold as object statement.Predetermined number is such as two, and certainly, this is a kind of example, the number of the statement that can comprise according to statement library and determining, and when while statement number is larger, predetermined number is also larger.

In general, in above-mentioned two embodiments, in candidate's statement during select target statement, in order to improve the accuracy of selection, the similarity of candidate's statement and default similar threshold value are compared, the candidate's statement exceeding default similar threshold value is defined as object statement.Particularly, when the intent classifier label determined is unknown class, similarity in candidate's statement is exceeded default similar threshold value and candidate's statement of the forward predetermined number that sorts is defined as object statement; When the non-unknown class of the intent classifier label determined, similarity in candidate's statement is exceeded default similar threshold value and is defined as object statement for candidate's statement of maximal value.

Above-mentionedly determine that the non-stop words in described statement to be identified is the process of keyword and is: participle is carried out to described statement to be identified, obtains multiple participle word; Remove the stop words in described multiple participle word, obtain keyword.

Wherein, when removing stop words, default inactive vocabulary can be used, removing by the word in participle word in inactive vocabulary, thus obtain keyword.

In addition, when the keyword determined is multiple, in default statement library, the detailed process choosing the candidate's statement comprising described keyword comprises:

Add up the number of keyword in each self-contained statement to be identified of each statement in described default statement library; Number according to the keyword comprised carries out descending sort, and the statement choosing the preceding predetermined number that sorts is candidate's statement.

The mode accuracy that this kind chooses candidate's statement is higher.

Below the disaggregated model built in advance in each embodiment above-mentioned is described.

Obtain and comprise multiple training set having marked statement; Wherein, each described mark statement has respective intention labels and theme label; Utilize and preset training method, described training set is trained, obtain disaggregated model; Wherein, described disaggregated model is used for classifying to the intention of statement to be identified and theme.

First, obtain training set, training set comprises multiple statement marked.The statement of this mark is artificial mark, i.e. the intention labels of artificial read statement and theme label.Wherein, intention labels is in order to show the content type of the feedback information gone for, and such as, for statement " why refrigerator no power ", the intention labels of artificial input is " reason class ", shows that the feedback information gone for is reason; Theme label is in order to show the type of the content that statement is expressed, and such as, the above-mentioned example enumerated, the theme label of artificial input is failure classes, and what show that statement expresses is defect content.

Then, utilize training method, training set is trained, thus obtain disaggregated model.Wherein, training method can be any one training method of the prior art, such as, libsvm instrument can be used to train.It should be noted that, the quantity having marked statement in training set is larger, and the recognition accuracy of disaggregated model is higher.In addition, the disaggregated model of acquisition both can be classified to theme, also can classify to intention.

Utilize the disaggregated model built, determine that the subject classification label of statement to be identified and the detailed process of intent classifier label are:

According to default Feature Words extracting rule, in described statement to be identified, extract multiple characteristic of division; Described multiple characteristic of division is inputed to described disaggregated model, obtains multiple intention probable value and multiple theme probable value; Tag along sort corresponding for maximum intention probable value is defined as the intent classifier label of described statement to be identified, and tag along sort corresponding for maximum theme probable value is defined as the subject classification label of described statement to be identified.

Particularly, in question and answer scene, that presets that the characteristic of division that extracts of Feature Words extracting rule can comprise in following three kinds of features is any one or more, i.e. N tuple word feature, interrogative feature, syntactic feature.Wherein, N unit word feature can be unitary, binary and ternary word feature; Interrogative feature is the feature pair of the part of speech composition of interrogative and subsequent keyword; Syntactic feature refers to the keyword pair depending on predicate verb and interrogative, when keyword depends on predicate verb or interrogative, this dependence is taken out.It should be noted that, the characteristic of division extracted comprise word to, part of speech to, part of speech and word combination to three classes.Such as, for " why refrigerator no power ", the characteristic of division extracted is as shown in table 1 below.

Table 1

Disaggregated model has each type of theme and each intention type that self can identify, utilize the characteristic of division of input, calculate statement to be identified belong to the probable value of each type of theme respectively and belong to the probable value of each intention type respectively, the tag along sort that maximal value (i.e. maximum theme probable value) in each theme probable value is corresponding is defined as the subject classification label of statement to be identified, in like manner, the maximal value (i.e. maximum intention probable value) in each intention probable value is defined as the intent classifier label of statement to be identified.

It should be noted that; in assorting process; interrogative feature is most important for the classification of the statement to be identified of question sentence type; such as; the question sentence comprising interrogative " why " is divided into reason class usually, " how " be usually divided into method class etc., but the usage of interrogative is very flexible; how how for " ", how could question sentence " eliminate the voice in music ", interrogative in " how dance music classification divides " " " position is flexible.In addition, interrogative usually and some noun, verbs etc. to combine co expression query information.

Therefore, the embodiment of the present invention have employed the feature of the part of speech collocation of word before and after interrogative, with " mobile phone how down-load music? " for example, after participle morphology mark, result is: mobile phone/n how/r download/v music/n, extract part of speech before and after interrogative to be characterized as: how, how n--v, n-how-v, and then by <n, how how are >, <, v> and <n, how, v> is as interrogative feature.

In addition, syntactic feature also plays an important role in the assorting process of the statement to be identified of question sentence type.Such as, in " recommending several pleasing to the ear song ", verb " recommendation " classification to whole question sentence judges extremely important.

Therefore, the embodiment of the present invention can adopt interdependent feature, takes out the noun and the interrogative that there are dependence with the predicate verb in question sentence, and composition dependence pair, as syntactic feature.Certainly, in order to the versatility of feature, also add the right part of speech of dependence as syntactic feature.Such as, statement to be identified for " how in music player, changing music format? ", the word of acquisition is to being characterized as: how-conversion, conversion-form; Word and part of speech feature: how-v, conversion-n, r-change, v-form; Part of speech is to being characterized as r-v, v-n.

Certainly, in other scenes, also can use above-mentioned characteristic of division, but interrogative feature can not be needed, or interrogative feature is replaced by other types word feature.

In each embodiment above-mentioned, need the similarity determining statement to be identified and each candidate's statement, deterministic process specifically comprises the following steps:

Particularly, as shown in Figure 3, the semantic similarity Sim of statement to be identified and each candidate's statement is calculated _word(A, B) carries out all in the following manner:

Step S301: each keyword calculating described statement to be identified successively respectively with the Words similarity of each keyword of described candidate's statement, obtain similarity matrix.

Wherein, statement to be identified comprises multiple keyword, can be called as the first keyword, also comprises multiple keyword in candidate's statement, can be called as the second keyword, calculates the Words similarity between each first keyword and each second keyword respectively.Wherein, the mode calculating Words similarity can be any one method of the prior art, e.g., uses the computing method of the semantic relevancy based on " knowing net ".In addition, keyword refers to non-stop words.

Such as, statement to be identified is A, the keyword comprised for (A1, A2 ..., Am), candidate's statement is B, the keyword comprised for (B1, B2 ..., Bn), after calculating the word degree of correlation, acquisition similarity matrix is S _{a, B}.

S_{A, B} = [\begin{matrix} S (A_{1}, B_{1}) & S (A_{1}, B_{2}) & . . . & S (A_{1}, B_{n}) \\ S (A_{2}, B_{1}) & S (A_{2}, B_{2}) & . . . & S (A_{2}, B_{n}) \\ . . . & . . . & . . . & . . . \\ S (A_{m}, B_{1}) & S (A_{m}, B_{2}) & . . . & S (A_{m}, B_{n}) \end{matrix}]

Wherein, S (A _i, B _j) represent the word degree of correlation of i-th keyword of statement A to be identified and a jth keyword of candidate's statement B.

Step S302: the total value of adding up maximum Words similarity in each row of described similarity matrix, and the row mean value calculating this total value.

Step S303: the total value of adding up maximum Words similarity in each row of described similarity matrix, and the column average value calculating this total value.

Wherein, matrix comprises multiple lines and multiple rows.Line number is identical with the keyword number m of A, and columns is identical with the keyword number n of B, or line number is identical with the keyword number n of B, and columns is identical with the keyword number m of A.

Determine the maximum Words similarity in every a line in matrix, calculate the total value of each maximum Words similarity, and then calculate the mean value Sim (A, B) of total value, by total value divided by line number.In like manner, the mean value Sim (B, A) of the total value of calculated column.

Step S304: the mean value calculating described row mean value and described column average value, obtains the semantic relevancy of described statement to be identified and described candidate's statement.

Wherein, the mean value Sim of two mean values is calculated _word(A, B)

{Sim}_{word} (A, B) = \frac{Sim (A, B) + Sim (B, A)}{2} .

In addition, the theme calculating statement to be identified and each candidate's statement is intended to similarity Sim _style(A, B) carries out all in the following manner:

Judge that whether the subject classification label of described statement to be identified is identical with the preset themes tag along sort of described candidate's statement, obtains the first judged result; Whether the intent classifier label judging described statement to be identified is unknown class, obtains the second judged result; Judge that whether the intent classifier label of described statement to be identified is identical with the default intention labels of described candidate's statement, obtain the 3rd judged result;

When described first judged result for be and described second judged result for being time, determine described theme intention similarity be 1; When described first judged result be yes, described second judged result be no and described 3rd judged result for being time, determine described theme intention similarity be 1; When described first judged result be yes, described second judged result be no and described 3rd judged result be no time, determine described theme intention similarity be greater than 0 and be less than 1 preset value; When described first judged result is no, determine that described theme intention similarity is 0.

It should be noted that, step S104 in above-described embodiment can determine subject classification label and the intent classifier label of statement to be identified, utilizes the subject classification label determined in this step and intent classifier label can determine that the theme of statement to be identified and candidate's statement is intended to similarity.

In addition, above-mentioned three deterministic processes can be carry out simultaneously, also can be that order performs.When order performs, in order to ensure the highest execution efficiency, first can see that whether the theme label of statement to be identified is identical with the theme label of candidate's statement, if theme label is different, then the theme of statement to be identified and candidate's word is intended to similarity and is set to 0, if theme label is identical, then judge that the intent classifier label of statement to be identified be unknown class is also non-unknown class, if unknown class, then the theme of statement to be identified and candidate's word is intended to similarity and is set to 1, if be any one clear and definite intent classifier label of non-unknown class, then judge that whether statement to be identified is identical with the intention labels of candidate's statement, if intention labels is identical, then be set to 1, if intention labels is different, what be set to preset is greater than 0 value being less than 1.

Moreover, calculate the syntax similarity Sim of statement to be identified and each candidate's statement _syntax(A, B) carries out all in the following manner:

Syntactic analysis is carried out to described statement to be identified, obtains the first syntactic constituent of described statement to be identified, and obtain second syntactic constituent preset of described candidate's statement; Calculate the first Words similarity of the identical component of described first syntactic constituent and described second syntactic constituent; Calculate the second Words similarity of described first syntactic constituent and the identical ornamental equivalent of described second syntactic constituent; Obtain the default penalty factor of the non-equal composition of described first syntactic constituent and described second syntactic constituent; Utilize described first Words similarity, described second Words similarity and described default penalty factor, calculate weighted mean value, obtain syntax similarity.

It should be noted that, the first syntactic constituent refers to the syntactic constituent in statement to be identified, and the second syntactic constituent is the syntactic constituent in candidate's statement.Wherein, syntax similarity refers to the similarity in the first syntactic constituent and the second syntactic constituent between corresponding syntactic constituent, as the Words similarity between subject, predicate, object and other interdependent compositions.For default composition, need to utilize penalty factor to compensate, finally calculate weighted mean value.The solution formula of weighted mean value is:

{Sim}_{syntax} (A, B) = \frac{w_{S} \cdot s_{s} + w_{V} \cdot s_{V} + w_{O} \cdot s_{O} + w_{A} \cdot s_{A} + Σ_{i = 1}^{n} w_{R} \cdot Sim (h_{1 i}, h_{2 i})}{w_{S} + w_{V} + w_{O} + w_{A} + w_{R} \cdot n} - l \cdot PF

Wherein, S _s, S _v, S _o, S _abe respectively subject, predicate, object, adverbial modifier's complement similarity; W _s, W _v, W _o, W _abe respectively the default weighted value of subject, predicate, object, adverbial modifier's complement; Sim (h _1i, h _2i) be the Words similarity of identical ornamental equivalent, and be other compositions except subject, predicate, object and adverbial modifier's complement, such as, the composition that first syntactic constituent comprises has the modifier of subject " refrigerator ", subject for " Haier ", the composition that second syntactic constituent comprises has subject " refrigerator ", the modifier of subject is " beautiful ", the Words similarity of both calculating; W _rfor the default weighted value of identical ornamental equivalent; N is the sum of identical ornamental equivalent; L is the number of non-equal composition, and namely exclusive separately composition number, comprises two parts, be respectively have in the first syntactic constituent but the composition do not had in the second syntactic constituent, and have in the second syntactic constituent but the composition do not had in the first syntactic constituent; PF is for presetting penalty factor.

It should be noted that, when syntactic analysis, syntactic analysis instrument can be used, as LTP syntactic analysis instrument.After syntactic analysis, obtaining syntactic structure is syntactic structure tree.Each node in this syntactic structure tree all has corresponding numbering, when certain syntactic constituent depends on another syntactic constituent, is then the numbering of the syntactic constituent that it depends on by its interdependent Node configuration.Such as shown in table 2, carry out syntactic constituent analysis to statement to be identified, the interdependent node of keyword " refrigerator " is the keyword of 3,3 correspondences is " energising ", illustrates that " refrigerator " depends on " energising ".

Table 2

Numbering	Word	Part of speech	Interdependent node	Dependence
					0	Why	r	3	ADV
1	Refrigerator	n	3	SBV
					2	No	d	3	ADV
3	Energising	v	-1	HED
					4	?	u	3	RAD

It should be noted that, need to mark the part of speech of keyword in syntactic analysis process, namely marking keyword is noun, verb or adverbial word etc., and the part-of-speech tagging instrument of use can be ansj instrument.In addition, the interdependent node that obtains of syntactic analysis and dependence can help to calculate syntax similarity.At calculating syntax similarity Sim _syntaxtime (A, B), need to determine identical ornamental equivalent, when determining, utilizing interdependent node to judge whether to depend on same keyword, when being when depending on same keyword, utilizing dependence to judge the ornamental equivalent of keyword.

In addition, when carrying out syntactic analysis, can determine syntactic constituent, syntactic constituent comprises part of speech, interdependent node and dependence etc., by can localizing objects statement more accurately, and then the more accurate information of feedback.Specifically, can be determined the key component of statement to be identified by interdependent node, such as, the key component of the statement to be identified of input " refrigerator how much " is " refrigerator ", the key component of candidate's statement " strip of paper used for sealing of refrigerator how much " is " strip of paper used for sealing ", and both are different.Utilize recognition methods of the prior art, the statement comprising " refrigerator " is directly defined as object statement, inaccurate.But, can determine in the present invention that the two is different, and then also would not be defined as object statement by second, improve the determination accuracy of object statement.

As difference computing semantic similarity Sim _word(A, B), theme intention similarity Sim _style(A, B) and syntax similarity Sim _syntaxafter (A, B), be weighted and average, as calculated statement to be identified and candidate's statement similarity Sim (A, B) separately according to following computing formula.

Sim(A,B)＝α×Sim _word(A,B)+β×Sim _style(A,B)+γ×Sim _syntax(A,B)

Wherein, alpha+beta+γ=1.

Below the statement recognition device that the embodiment of the present invention provides is described, it should be noted that, about the explanation of statement recognition device see statement recognition methods provided above, can not repeat herein.

With reference to Fig. 4, it illustrates the structure of the statement recognition device that the embodiment of the present invention provides, specifically comprise:

Statement acquisition module 100 to be identified, for obtaining statement to be identified;

Keyword determination module 200, for determining that the non-stop words in described statement to be identified is keyword;

Candidate's statement acquisition module 300, in default statement library, chooses the candidate's statement comprising described keyword;

Theme and intention determination module 400, for utilizing the disaggregated model built in advance, determine subject classification label and the intent classifier label of described statement to be identified;

Candidate's statement grouping module 500, for when described intent classifier label is unknown class and described candidate's statement is multiple, classifies multiple described candidate's statement according to respective default intention labels, obtains multiple grouping;

Object statement determination module 600, for being defined as object statement corresponding to described statement to be identified by the candidate's statement in each described grouping; Wherein, the preset themes label of described object statement is identical with the subject classification label of described statement to be identified;

Presupposed information display module 700, for showing presupposed information corresponding to each described object statement.

From above technical scheme, the statement recognition device that the embodiment of the present invention provides, can to the statement to be identified got, determine that non-stop words is keyword, in default statement library, choose candidate's statement of the keyword comprising statement to be identified, utilize the disaggregated model built in advance, determine the intent classifier label of statement to be identified, need explanation, disaggregated model can identify the intention of unknown class, the intent classifier label identified be unknown class and candidate's statement is multiple time, according to the intention labels preset, by multiple candidate's statement grouping, presupposed information corresponding for candidate's statement in each grouping is shown.Due to the intention type that different grouping is corresponding different, from each grouping, select candidate's statement as object statement, and then show each self-corresponding presupposed information of each object statement, thus solve the single problem even cannot fed back of feedack.

Alternatively, said apparatus embodiment also comprises: the clear and definite module of intent classifier, for when the non-unknown class of described intent classifier label, determines the similarity of described statement to be identified and each described candidate's statement; Candidate's statement corresponding for the maximum similarity exceeding default similarity threshold is defined as object statement; Show the presupposed information that described object statement is corresponding.

Alternatively, above-mentioned object statement determination module 600 comprises:

Similarity determination submodule, for determining the similarity of described statement to be identified and each described candidate's statement;

First object statement determination submodule, for carrying out descending sort according to the size of similarity, in each described grouping, chooses sequence front and the candidate's statement exceeding the predetermined number of default similarity threshold is object statement.

Alternatively, above-mentioned theme and intention determination module 400 comprise:

Characteristic of division extracts submodule, for according to default Feature Words extracting rule, in described statement to be identified, extracts multiple characteristic of division;

Probable value obtains submodule, for described multiple characteristic of division is inputed to described disaggregated model, obtains multiple intention probable value and multiple theme probable value;

Intention labels determination submodule, for tag along sort corresponding for maximum intention probable value being defined as the intent classifier label of described statement to be identified, and is defined as the subject classification label of described statement to be identified by tag along sort corresponding for maximum theme probable value.

Alternatively, include disaggregated model in above-mentioned device embodiment and build module, described disaggregated model builds module, comprises multiple training set having marked statement for obtaining; Wherein, each described mark statement has respective intention labels and theme label; Utilize and preset training method, described training set is trained, obtain disaggregated model; Wherein, described disaggregated model is used for classifying to the intention of statement to be identified and theme.

Alternatively, the clear and definite module of intent classifier of the above-mentioned similarity for determining described statement to be identified and each described candidate's statement and similarity determination submodule all can comprise:

Similarity calculated, is intended to similarity and syntax similarity for semantic similarity, the theme calculating described statement to be identified and each described candidate's statement respectively; Wherein, described semantic similarity is the semantic similarity between the keyword of statement to be identified and the keyword of candidate's statement; Described theme intention similarity is the theme of statement to be identified and intention and the theme of candidate's statement and the similarity of intention; Described syntax similarity is the similarity of the syntactic structure of statement to be identified and the syntactic structure of candidate's statement;

Weighted mean value computing unit, average for each for each described candidate's statement self-corresponding described semantic similarity, intention similarity and described syntax similarity are weighted, obtain described statement to be identified and each described candidate's statement similarity separately.

Alternatively, the similarity calculated for the semantic similarity calculating described statement to be identified and described candidate's statement comprises:

Similarity calculated, for calculate described statement to be identified successively each keyword respectively with the Words similarity of each keyword of described candidate's statement, obtain similarity matrix; Add up the total value of maximum Words similarity in each row of described similarity matrix, and calculate the row mean value of this total value; Add up the total value of maximum Words similarity in each row of described similarity matrix, and calculate the column average value of this total value; Calculate the mean value of described row mean value and described column average value, obtain the semantic relevancy of described statement to be identified and described candidate's statement.

Alternatively, the similarity calculated that the theme for calculating described statement to be identified and described candidate's statement is intended to similarity comprises:

Similarity calculated, for judging that whether the subject classification label of described statement to be identified is identical with the preset themes tag along sort of described candidate's statement, obtains the first judged result; Whether the intent classifier label judging described statement to be identified is unknown class, obtains the second judged result; Judge that whether the intent classifier label of described statement to be identified is identical with the default intention labels of described candidate's statement, obtain the 3rd judged result; When described first judged result for be and described second judged result for being time, determine described theme intention similarity be 1; When described first judged result be yes, described second judged result be no and described 3rd judged result for being time, determine described theme intention similarity be 1; When described first judged result be yes, described second judged result be no and described 3rd judged result be no time, determine described theme intention similarity be greater than 0 and be less than 1 preset value; When described first judged result is no, determine that described theme intention similarity is 0.

Alternatively, the similarity calculated for the syntax similarity calculating described statement to be identified and described candidate's statement comprises:

Similarity calculated, for carrying out syntactic analysis to described statement to be identified, obtains the first syntactic constituent of described statement to be identified, and obtains second syntactic constituent preset of described candidate's statement; Calculate the first Words similarity of the identical component of described first syntactic constituent and described second syntactic constituent; Calculate the second Words similarity of described first syntactic constituent and the identical ornamental equivalent of described second syntactic constituent; Obtain the default penalty factor of the non-equal composition of described first syntactic constituent and described second syntactic constituent; Utilize described first Words similarity, described second Words similarity and described default penalty factor, calculate weighted mean value, obtain syntax similarity.

Alternatively, when the keyword determined is multiple, above-mentioned candidate's statement acquisition module 300 comprises:

Keyword number statistics submodule, for adding up the number of each self-contained keyword of each statement in described default statement library;

Submodule chosen in candidate's statement, and for carrying out descending sort according to the number of the keyword comprised, the statement choosing the preceding predetermined number that sorts is candidate's statement.

Alternatively, above-mentioned keyword determination module 200 comprises:

Participle word obtains submodule, for carrying out participle to described statement to be identified, obtains multiple participle word;

Stop words removes submodule, for removing the stop words in described multiple participle word, obtains keyword.

It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.

Also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising above-mentioned key element and also there is other identical element.

To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims

1. a statement recognition methods, is characterized in that, comprising:

Obtain statement to be identified;

2. statement recognition methods according to claim 1, is characterized in that, also comprise:

3. statement recognition methods according to claim 1, is characterized in that, describedly candidate's statement in each described grouping is defined as object statement comprises:

4. statement recognition methods according to claim 1, is characterized in that, the disaggregated model that described utilization builds in advance, determines that the subject classification label of described statement to be identified and intent classifier label comprise:

5. statement recognition methods according to claim 1, is characterized in that, the building process of described disaggregated model comprises:

6. the statement recognition methods according to Claims 2 or 3, is characterized in that, describedly determines that the similarity of described statement to be identified and each described candidate's statement comprises:

7. statement recognition methods according to claim 6, is characterized in that, the semantic similarity calculating described statement to be identified and described candidate's statement comprises:

8. statement recognition methods according to claim 6, is characterized in that, the theme calculating described statement to be identified and described candidate's statement is intended to similarity and comprises:

9. statement recognition methods according to claim 6, is characterized in that, the syntax similarity calculating described statement to be identified and described candidate's statement comprises:

10. statement recognition methods according to claim 1, is characterized in that, when the keyword determined is multiple, described in default statement library, chooses the candidate's statement comprising described keyword and comprises:

11. statement recognition methodss according to claim 1, is characterized in that, the described non-stop words determined in described statement to be identified is that keyword comprises:

Remove the stop words in described multiple participle word, obtain keyword.

12. 1 kinds of statement recognition devices, is characterized in that, comprising: