CN108629019B

CN108629019B - Question-answer field-oriented question sentence similarity calculation method containing names

Info

Publication number: CN108629019B
Application number: CN201810433143.XA
Authority: CN
Inventors: 常亮; 时雨; 宾辰忠; 古天龙; 孙彦鹏; 孙磊; 匡海丽
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2018-05-08
Filing date: 2018-05-08
Publication date: 2021-04-30
Anticipated expiration: 2038-05-08
Also published as: CN108629019A

Abstract

The invention discloses a question similarity calculation method containing names in the question-answering field, which respectively calculates the similarity between the names and the non-names, calculates the sentence similarity from the two aspects of the word order and the length of the sentence by considering the sentence structure, and finally obtains the similarity of the whole sentence according to the sentence semantic similarity and the structural similarity in a weighting manner. The problem that the sentences containing names can not be judged according with human subjectivity when the sentences are calculated by the conventional sentence similarity calculation method is solved. The method provided by the invention can more accurately calculate the similarity of sentences containing names of people and can be widely applied to the field of question answering.

Description

Question-answer field-oriented question sentence similarity calculation method containing names

Technical Field

The invention relates to the technical field of question answering, in particular to a question sentence similarity calculation method containing names in the field of question answering.

Background

The similarity calculation of question sentences is always the basic and important research work in the fields of artificial intelligence and natural language processing, is also a research hotspot, and has very wide application, such as question-answering systems, information retrieval systems and the like.

The algorithms related to the similarity calculation of Chinese sentences at present can be roughly divided into the following categories: the first category is feature word based methods; the second category is sentence structure based methods; the third category is semantic dictionary based methods. Firstly, the first method is based on a method of characteristic words, which is to extract the characteristic words of two question sentences to be compared respectively, then compare the characteristic words, calculate the similarity of the characteristic words, and express the similarity of the two question sentences by using the similarity result. Next to the second method, a method based on a syntactic structure, which refers to calculating the similarity of the syntactic structures of two sentences by analyzing the structures of the two sentences. By comparing the part of speech sequences of the two sentences, after the optimal identical part of speech sequence is matched, the similarity between the part of speech and the part of speech is compared, so that the similarity between the two sentences is reflected. The third method, a method based on semantic dictionary, refers to reflecting the similarity between two sentences through the similarity of words in question sentences. When the word similarity is calculated, a large-scale semantic dictionary is used, for example, a How Net knowledge base is used for calculating the similarity of two question sentences, the similarity is calculated by matching all the words in the two question sentences pairwise, the two words with the highest similarity result are used as an optimal matching pair, and finally the result of weighted average of the similarity of all the optimal matching pairs of words represents the semantic similarity of the whole sentence.

However, when the similarity of sentences containing names of people is calculated, the above three methods cannot accurately calculate the questions containing names of people, for example, for two questions, "zhrenchang is the fourth generation royal grandpa of jing jiang dynasty" and "zhrencheng is the fourth generation royal grandpa of jing jiang dynasty", if the three methods introduced before are used for calculation, the similarity of the two sentences obtained is extremely high. However, from the actual angle, zhren chang and zhren sanden are two individuals, which are the princes of jing fu and have similar names but different actual meanings. Therefore, the similarity result calculated by the method is extremely high and does not accord with the subjective judgment of people.

Disclosure of Invention

The invention aims to solve the problem that the difference between names and the importance of the names to the whole sentence cannot be reflected when the similarity of the sentences is calculated in the current question-answering field, so that the result of the question-answering similarity calculation is poor and glad.

In order to solve the problems, the invention is realized by the following technical scheme:

a question sentence similarity calculation method containing names in the question and answer field specifically comprises the following steps:

step 1, calculating current input question L and each question S in corpus^zThe sentence structure similarity specifically includes:

step 1.1, calculating sentence length similarity Sim_Len(L,S^z)：

Wherein Len_LThe number of the words after the input question sentence L is segmented is shown,

representing a corpus question S^zThe number of words after word segmentation;

step 1.2, calculating sentence word order similarity Sim_Ord(L,S^z)：

Where RevOrd indicates that the same term is in corpus question S relative to input question L^zIn (3), MaxRevOrd indicates that the same word number sequence is in the corpus question S relative to the input question L^zThe maximum inverse number of;

step 1.3, synthesizing sentence length similarity Sim obtained in step 1.1_Len(L,S^z) And the sentence word order similarity Sim obtained in step 1.2_Ord(L,S^z) To obtain the current input question L and each question S in the corpus^zSentence structure similarity Sim_stru(L,S^z)；

Sim_stru(L,S^z)＝μ₁Sim_Len(L,S^z)+μ₂Sim_Ord(L,S^z)

Wherein, mu₁Weight, μ, representing sentence length similarity₂Weight, μ, representing sentence structural similarity₁+μ₂＝1；

Step 2, calculating the current input question L and each question S in the corpus^zThe sentence semantic similarity specifically includes:

step 2.1, calculating the similarity of names and words of sentences

Wherein x is₁And y₁Respectively representing the year and month of birth of the name in the input question L, x₂And y₂Respectively representing corpus question S^zYear and month of birth of the Chinese name, p₁And q is₁Respectively representing the year and month of birth of the name spouse in the input question L, p₂And q is₂Respectively representing corpus question S^zThe birth year and the birth month of the middle-name spouse, alpha is the regulating parameter of the human, beta is the regulating parameter of the human spouse, and alpha + beta is 1;

step 2.2, calculating similarity of non-name words of sentences

Wherein, C_1iIndicating the words L in the input question L_v1A certain meaning item of (1), C_2jRepresenting a corpus question S^zChinese word

N represents the word L in the input question L_v1The number of semantic items, m, representing a corpus question S^zChinese word

Number of middle terms, Sim (C)_1i,C_2j) Representing an item of significance C_1iAnd item of sense C_2jThe similarity of (2);

step 2.3, synthesizing the similarity of names and words of the sentences obtained in the step 2.1

And the similarity of the non-name words of the sentences obtained in the step 2.2

Obtaining the current input question L and each question S in the corpus^zSemantic similarity Sim of sentences_sem(L,S^z)；

Wherein a represents an input question L and a corpus question S^zB represents the logarithm of the best matching pair obtained in the non-human name set of input question L and corpus question, γ₁Weight, gamma, representing the similarity of the names and words₂Weight, gamma, representing similarity of non-human name words₁+γ₂＝1；

Step 3, synthesizing sentence structure similarity Sim obtained in step 1_stru(L,S^z) And step 2, obtaining the semantic similarity Sim of the sentences_sem(L,S^z) To obtain the current input question L and each question S in the corpus^zGlobal sentence similarity Sim (L, S)^z)：

Sim(L,S^z)＝λ₁Sim_stru(L,S^z)+λ₂Sim_sem(L,S^z)

Wherein λ is₁Representing sentence structure phaseWeight of similarity, λ₂Weight, λ, representing semantic similarity of sentences₁+λ₂＝1；

Step 4, the whole sentence similarity Sim (L, S) obtained in the step 3^z) Sorting and selecting the whole sentence similarity Sim (L, S) with the current input question L from the corpus^z) Outputting the highest question as a question similarity calculation result;

s above^zRepresents the Z-th sentence in the corpus, Z belongs to (1,2, …, Z), and Z is the number of question sentences in the corpus.

In the step 2.2, the similarity of the non-famous words of the sentence can be calculated by utilizing different Guinea electricity knowledge bases

But preferably calculates similarity of non-name words of sentences by using the How Net knowledge base

Firstly, when the similarity of question sentences containing names is calculated, the names and the names of people are distinguished, and the similarity of the names and the names of people is calculated respectively; then, considering the structure of the sentence, and calculating the similarity of the sentence from the two aspects of the word order and the length of the sentence; and finally, weighting according to the semantic similarity and the structural similarity of the sentence to obtain the similarity of the whole sentence. The invention solves the problem that the prior sentence similarity calculation method can not obtain the subjective judgment conforming to the human name when calculating the sentences containing the names.

Compared with the prior art, the invention provides a method for dividing the sentence into a part with the name of a person and a part with the name of a non-person and respectively calculating the similarity, simultaneously considers the semantics of the sentence and the structural similarity of the sentence, solves the problems that the calculated similarity is not accurate or does not accord with the subjective judgment of a person when the sentence with the name of the person is involved in the prior art, and has good practicability.

Drawings

Fig. 1 is a partial example diagram of an example sentence divided into a person name and a non-person name.

Fig. 2 is a flow chart of a method for calculating similarity of question sentences according to names of people in the invention.

Fig. 3a is an exemplary diagram of calculating question similarity according to a conventional method based on feature words.

Fig. 3b is a diagram illustrating a method for calculating question similarity according to a conventional syntax-based structure.

Fig. 3c is a schematic diagram of calculating question similarity according to a conventional method based on a semantic dictionary.

Fig. 4 is a schematic diagram of calculating the similarity of question sentences containing names of people according to the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.

In the existing question similarity calculation technology, the special importance of the names of the persons is not considered when the question similarity containing the names of the persons is calculated, and if the names of the persons are not processed properly, the difference between the names of the persons and the importance of the names of the persons to the whole sentence cannot be reflected, so that the result of the question similarity calculation is poor. The invention fully considers the important role of the names of people in calculating the similarity of the question, and respectively calculates the similarity according to the fact that the question is divided into two parts of the names of people and the names of non-people. Meanwhile, the influence of the sentence structure on the sentence similarity is considered. Fig. 1 is an example diagram in which example sentences "zhu cheng is the first generation royal of jing jiang dynasty" and "zhu chang is the first generation royal of jing jiang dynasty" are divided into two parts, and similarity is calculated. Wherein the circles represent "sanden" and "sandong chang", represent the named part of the sentence, and the boxes represent the non-named part of the sentence.

The invention firstly considers the question-answering field, and relates to calculating the similarity between related name question sentences by obtaining the birth year and month of the character to calculate the similarity between name words, so as to avoid the same birth year and month of the character, thereby increasing the birth year and month of the spouse for supplement; when the semantic similarity of sentences is calculated, the importance of names to the calculation of the similarity of the sentences is embodied by giving different weights to the names and the non-names; and the factors of sentence structure are considered, and the calculation is respectively carried out according to the length and the word sequence of the sentence; and finally, giving different weights to the similarity of sentence semantics and structure to obtain the overall similarity of the sentences.

A question sentence similarity calculation method containing names in the question and answer field is shown in a flow diagram of fig. 2, and comprises the following steps:

step 1, calculating current input question L and each question S in corpus^zThe sentence structure similarity. For the sentence structure similarity calculation, the similarity calculation is mainly performed on the sentence structure from two aspects, namely the length of the sentence and the word order of the sentence.

Step 1.1, calculating sentence length similarity Sim_Len(L,S^z)：

Wherein Len_LThe number of the words after the word segmentation of the current input question sentence L is shown,

representing question S in corpus^zThe number of the words after word segmentation, wherein S^zRepresents the Z-th sentence in the corpus, Z belongs to (1,2, …, Z), and Z is the number of question sentences in the corpus.

Step 1.2, calculating sentence word order similarity Sim_Ord(L,S^z)：

In the formula, assuming that the sequence of the same words of two sentences is positive in the input question L, RevOrd represents the question S of the same words in the corpus^zIn (1)The inverse number, MaxRevOrd, represents the maximum inverse number of the same word number sequence;

step 1.3, the similarity between the sentence structure and the word order is integrated, the structural similarity of the whole sentence can be calculated, and the calculation formula is as follows:

Sim_stru(L,S^z)＝μ₁Sim_Len(L,S^z)+μ₂Sim_Ord(L,S^z)

wherein, mu₁Weight, μ, representing sentence length similarity₂Weight representing sentence structural similarity, and μ₁+μ₂＝1。

Step 2, calculating the current input question L and each question S in the corpus^zSemantic similarity of sentences. Similarity calculation is carried out aiming at the semantics of sentences, and the similarity calculation mainly comprises two parts, namely human name similarity calculation and non-human name similarity calculation.

Step 2.1, calculating the similarity of the names of the people;

step 2.1.1, using the birth year and month of the person as a vector coordinate, and adopting a calculation formula as follows:

the birth year and the birth month of the person are used as the name Lx of the person in the input question L₁And question S in corpus^zName of Chinese

The vector coordinates of (1), wherein x is year coordinate, y is month coordinate, and cosine value is used for representing the similarity of two names in the birth year and month;

step 2.1.2, considering the situation that two people can be born in the same year and month, the birth year and month of the spouse is added for supplement, and the calculation formula is as follows:

match figures with a dollThe year and month of birth of (1) as the name of the person in the input question L_u1And question S in corpus^zName of Chinese

Wherein p is a year coordinate, q is a month coordinate, and the similarity of the birth year and month of the two spouses is expressed by using cosine values;

step 2.1.3, integrating the two factors which have important influence on the similarity of the names to obtain the overall similarity of the names, wherein the calculation formula is as follows:

weighting and summing the year and month of birth similarity of the characters and the character spouses to obtain a value which is the overall similarity of the names of the characters, wherein alpha and beta are adjusting parameters, and alpha + beta is 1;

step 2.2, calculating the similarity of the non-name words;

the method for calculating the similarity of words by using the How Net has the following calculation formula

Wherein L is_v1And

respectively an input question L and a question S in the corpus^zTwo words of (A), C_1iFor inputting question L words L_v1A certain meaning item of (1), C_2jAs question S in corpus^zWord and phrase

The maximum similarity between all the meaning items of the two terms is used for representing the similarity of the non-human name terms;

step 2.3, the calculation results of the similarity of the names and the similarity of the non-names are integrated, the semantic similarity of the whole sentence can be calculated, and the calculation formula is as follows:

wherein a represents an input question L and a question S in the corpus^zTo obtain the logarithm of the best matching pair, L_u1Representing the name of the sentence L in the pair of names,

representing sentences S in paired names^zB represents the logarithm of the best matching pair obtained from the input sentence and the non-human name set of sentences in the corpus, L_v1A word representing the sentence L in the pair of words,

in sentence S for paired words^zTerm of γ₁Representing the weight given to the term of a person, gamma₂Denotes a weight given to a non-human term, and γ₁+γ₂＝1。

Step 3, synthesizing sentence structure similarity Sim obtained in step 1_stru(L,S^z) And step 2, obtaining the semantic similarity Sim of the sentences_sem(L,S^z) To obtain the current input question L and each question S in the corpus^zGlobal sentence similarity Sim (L, S)^z)。

And (3) calculating the similarity of the whole sentence, namely integrally calculating the similarity of the names of the people and the similarity of the names of the non-people, wherein the calculation formula is as follows:

Sim(L,S^z)＝λ₁Sim_stru(L,S^z)+λ₂Sim_sem(L,S^z)

wherein λ is₁Weight, λ, representing sentence structure similarity₂Weight representing semantic similarity of sentences, and₁+λ₂＝1。

and 4, sequencing the overall sentence similarity obtained in the step 3, and selecting the question with the highest overall sentence similarity with the current input question L from the corpus, and outputting the question as a question similarity calculation result.

When the method is adopted to calculate the sentence similarity, the condition of sentences containing names is considered, and the similarity of the semantics and the structure of the sentences is considered, so that the sentence similarity calculated by using the model is more accurate and effective.

Fig. 3 is a diagram illustrating the calculation of sentence similarity according to three conventional methods. In the method based on the feature words in fig. 3a, only the occurrence frequencies of words such as "zhuochang", "zhuocheng", "jing jiang fu", etc. in sentences are considered when calculating the sentence similarity, and the semantics cannot be well processed; fig. 3b mainly considers the structural features of sentences when calculating the sentence similarity by the syntax structure-based method, and does not analyze semantic information well, so that for two sentences "zhu chang is the royal of the first generation of jing jiang coworker" and "zhu chang is the royal of the second generation of jing jiang coworker", the structures of the sentences are the same, and then the similarity of the two question sentences is considered to be the same; although the semantic dictionary-based method in fig. 3c can effectively understand semantic information, for the large-scale knowledge base How Net does not completely cover all the names of people, nor does it clearly distinguish similar names, so for two names of "zhu-anchang" and "zhu-ren sheng", if there are two names in the knowledge base, the similarity is extremely high, and if there are no two names, the similarity is extremely low. In fact, "Zhu ren Chang" and "Zhu ren Cheng" are two completely different people.

In fig. 4, the situation that names are included in sentences is considered, when names are included in question sentences, the similarity between two names of "zhu-anchang" and "zhu-ancheng" is calculated separately, the similarity between the remaining half sentences is calculated at the same time, and after semantic analysis, sentence structure analysis is performed to calculate the word order and length similarity of the sentences.

The invention provides a question similarity calculation method containing names of people, which comprises the following steps: and respectively calculating the similarity of the human name and the non-human name, calculating the similarity of sentences from the two aspects of the word order and the length of the sentences by considering the structure of the sentences, and finally weighting according to the semantic similarity and the structural similarity of the sentences to obtain the similarity of the whole sentences. The problem that the sentences containing names can not be judged according with human subjectivity when the sentences are calculated by the conventional sentence similarity calculation method is solved. The method provided by the invention can more accurately calculate the similarity of sentences containing names of people and can be widely applied to the field of question answering.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. A question sentence similarity calculation method containing names in the question and answer field is characterized by comprising the following steps:

step 1.1, calculating sentence length similarity Sim_Len(L,S^z)：

representing a corpus question S^zThe number of words after word segmentation;

step 1.2, calculating sentence word order similarity Sim_Ord(L,S^z)：

Sim_stru(L,S^z)＝μ₁Sim_Len(L,S^z)+μ₂Sim_Ord(L,S^z)

step 2.1, calculating the similarity of names and words of sentences

Wherein x is₁And y₁Respectively representing the year and month of birth of the name in the input question L, x₂And y₂Respectively representing corpus question S^zYear and month of birth of the Chinese name, p₁And q is₁Respectively representing the year and month of birth of the name spouse in the input question L, p₂And q is₂Respectively representing corpus question S^zThe birth year and the birth month of the middle-name spouse, alpha is the regulating parameter of the humanNumber, β is a regulatory parameter of the human spouse, α + β ═ 1;

step 2.2, calculating similarity of non-name words of sentences

Sim(L,S^z)＝λ₁Sim_stru(L,S^z)+λ₂Sim_sem(L,S^z)

Wherein λ is₁Weight, λ, representing sentence structure similarity₂Weight, λ, representing semantic similarity of sentences₁+λ₂＝1；

2. The method for calculating question similarity containing names in question-answering field according to claim 1, wherein in step 2.2, the How-similar non-name words in sentences are calculated by using How Net knowledge base