CN108549634A

CN108549634A - A kind of Chinese patent text similarity calculating method

Info

Publication number: CN108549634A
Application number: CN201810310198.1A
Authority: CN
Inventors: 吕学强; 董志安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2018-04-09
Filing date: 2018-04-09
Publication date: 2018-09-18

Abstract

The present invention relates to a kind of Chinese patent text similarity calculating methods, including：Text is segmented；TF IDF values are calculated to word segmentation result, extraction TF IDF values are higher to be used as keyword, and the sentence where positioning keyword obtains the critical sentence set of each text as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence；The weight to text for calculating each critical sentence chooses text to be compared and compares the critical sentence of text successively, and the sentence similarity based on critical sentence calculates the similarity of text.The present invention utilizes existing patent field ontology, analyze the semantic relation in patent text, the calculating of patent text similarity is carried out using vector space model and domain body, the accuracy and recall rate of result of calculation are higher, similarity degree between patent can be described more accurately, it can accelerate the speed of patent examination, the needs of practical application can be met well.

Description

A kind of Chinese patent text similarity calculating method

Technical field

The invention belongs to text information processing technical fields, and in particular to a kind of Chinese patent text similarity calculation side Method.

Background technology

Current Internet era, carrier of the patent as record mankind's achievement contain a large amount of scientific and technological achievement and innovation Technology.The fast development of science and technology makes annual amount of the application for patent sharply increase.Traditional retrieval mode passes through term Carry out matching return as a result, being usually correlation using the quantity that term occurs as patent, not in view of patent The semantic information for itself being included.The essence of patent examination is the high related patents of unexamined patent similarity, among these, most heavy What is wanted is exactly a little to calculate patent text similarity.Text similarity, general algorithmic method are using vector space model to text This expression calculates vector similarity as text similarity directly in vector space later.In recent years, ontology, as one kind The new representation of knowledge and description form is widely applied to the various aspects such as semantic net, information retrieval, more and more researchers Start to pay attention to carrying out semantic analysis using ontology.

Text similarity method can be mainly divided into two classes：One is using vector space model translate text into The form of amount, then calculated, one is the contacts indicated using semantic dictionary method between different long short texts, pass through key Word number of matches reflects the similarity between text.The method of the similarity of the Chinese patent text of calculating of the prior art exists The problem of semantic information is lost, and the prior art is inaccurate to the calculating of Chinese text similarity, the accuracy of result of calculation and Recall rate is relatively low, cannot accurately reflect the similarity of patent text, cannot meet the needs of practical application.

Invention content

For the above-mentioned prior art the problem of, the purpose of the present invention is to provide the avoidable appearance of one kind is above-mentioned The Chinese patent text similarity calculating method of technological deficiency.

In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows：

A kind of Chinese patent text similarity calculating method, includes the steps that calculating sentence similarity.

Further, the computational methods include：

Text is segmented；TF-IDF values are calculated to word segmentation result, extraction TF-IDF values are higher to be used as keyword, The sentence where keyword is positioned as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence, Obtain the critical sentence set of each text；The weight to text for calculating each critical sentence chooses text to be compared and right successively Than the critical sentence of text.

Further, Words similarity is converted to the similarity of concept in the body to calculate；Concept is in the body Calculating formula of similarity is：

Wherein w₁And w₂Indicate two words, dis (w₁, w₂) indicate w₁And w₂Semantic distance in domain body.

Further, minimum public father node position and node local density, the similarity meter of concept in the body is added Calculating formula is：

Wherein r indicates that the root node of tree, com indicate w₁And w₂The public father node of minimum, dis (r, com) indicates minimum The depth of public father node, num (w₁) indicate w₁The brotgher of node number of node.

Further, it is based on word2vec and calculates Words similarity, the input layer of CBOW models is the front and back n of current word A term vector is added up to obtain W by intermediate hidden layer to this 2n term vector_x；Output layer is a Huffman tree, is By the word in corpus as leaf node, what the frequency of each word was built as weights；By stochastic gradient algorithm to W_xInto Row prediction so that p (w | context (w)) value maximizes, and context (w) refers to the n front and back word of w；Pass through word2vec pairs Language material is trained, and obtains the term vector of all words；Calculate word between similarity translate into calculate word equivalent to The similarity of amount, calculation formula are：

Wherein w₁And w₂The term vector that respectively two words obtain after training；x_1iAnd x_2iTwo words are indicated respectively The value of the corresponding i-th dimension in vector space of term vector of language.

Further, two kinds of Words similarity sim are calculated separately out using ontology and word2vec_ow(w₁, w₂) and sim_rw (w₁, w₂), in conjunction with obtaining Words similarity, formula is：

Wherein S indicates the concept set in ontology, if being not belonging to Ontological concept set there are one in two words, Using the similarity obtained using word2vec as Words similarity this pronouns, general term for nouns, numerals and measure words is taken if the two belongs to Ontological concept set Language similarity and word2vec Words similarities average value are as final Words similarity.

Further, sentence similarity computational methods are as follows：

Assuming that being respectively S there are two sentence₁=(w₁₁, w₁₂..., w_1n) and S₂=(w₂₁, w₂₂..., w_2m), wherein w₁₁, w₁₂, w₂₁, w₂₂For the notional word that sentence obtains after segmenting and removing stop words, (w is defined_1i, w_2j) it is sentence S₁And S₂One of word Mapping, if for arbitrary k, l, sim_w(w_1i, w_2j) ＞ sim_w(w_1k, w_2l) perseverance establishment, it is judged that in two sentences w_1iAnd w_2jIt is semantic relation word pair the most similar, obtains one group of semantic relation word the most similar to rear, respectively from two The word is removed in sentence, and is recalculated, and until word is not present in one of sentence vocabulary, calculation formula is：

sim_ws(S₁, S₂) indicate S₂Relative to S₁The entity Word similarity of gained.

Further, the relationship similarity calculated based on the non-categorical relationship of patent field ontology in sentence, step are utilized It is rapid as follows：

Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, is only retained dynamic The word of word part of speech, noun part-of-speech obtains the orderly vocabulary of two sentences, and the orderly vocabulary for defining first sentence is S₁ (w₁₁：pos₁₁, w₁₂：pos₁₂..., w_1n：pos_1n), the orderly vocabulary of another sentence is defined as S₂(w₂₁：pos₂₁, w₂₂： pos₂₂..., w_2m：pos_2m)；To the vocabulary of each sentence, chooses verb therein and the noun before and after it constitutes SAO Structure phrase P (n₁, v, n₂)；Convert the orderly vocabulary of each sentence to phrase set S₁=(P₁₁, P₁₂..., P_1n) and S₂ =(P₂₁, P₂₂..., P_2m), obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship is NR (r₁, r₂..., r_l), r_lThere is the phrase of SAO structures for one in non-categorical set；It is integrated into non-categorical by calculating two sentence phrases The number occurred in set of relationship NR calculates the non-categorical relationship similarity of sentence, and calculation formula is：

Wherein, num (S₁) indicate S₁Phrase in set belongs to the number of non-categorical set of relationship NR, com (S₁, S₂) table Show sentence phrase set S₁And S₂Intersection, indicate S₁And S₂Shared phrase set.

The calculation formula of overall similarity is between sentence

sim_s(S₁, S₂)=β sim_ws(S₁, S₂)+(1-β)sim_ps(S₁, S₂),

Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sim_s(S₁, S₂) indicate S₂Relative to S₁Sentence similarity.

Further, text similarity, step are calculated on the basis of existing Words similarity and sentence similarity For：

Text is segmented first, TF-IDF values are calculated to word segmentation result later, extracts the higher conduct of TF-IDF values Keyword, the sentence where positioning keyword is as critical sentence；Calculate the weight to text of each critical sentence；It chooses crucial The maximum keyword of TF-IDF values in the keyword set that sentence is included, and using the weights of the word as the weight w of critical sentence (S), the sentence set of two texts is finally obtained, is enabled

D₁(S₁₁：w(S₁₁), S₁₂：w(S₁₂) ..., S_1n：w(S_1n)) indicate text D₁Sentence set, D₂(S₂₁：w (S₂₁), S₂₂：w(S₂₂) ..., S_2m：w(S_2m)) indicate D₂Sentence set；

Define (S_1i, S_2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sim_s(S_1i, S_2j)≥ sim_s(S_1l, S_2k) perseverance establishment, then it is assumed that S_1iAnd S_2jIt is relationship sentence the most close, wherein sim in two texts_s(S_1i, S_2j) be calculated by sentence similarity.

Further, two text D₁And D₂Calculating formula of similarity be：

Wherein, sim_s(S_1i, S_1j) represent sentence phrase set S_1iWith sentence phrase set S_1jBetween overall similarity, w(S_1i) represent sentence phrase set S_1iCritical sentence weight, w (S_1j) represent sentence phrase set S_1jCritical sentence power Weight.

Chinese patent text similarity calculating method provided by the invention, it is proposed that a kind of layered computation text is similar The calculating of text similarity is divided into three word, sentence, text levels, is calculated from bottom to top, the party by the method for degree Method calculates text similarity using sentence as granularity, is combined using existing domain body and word2vec and calculates word phase Like degree, and the relationship similarity obtained according to non-categorical relationship is added when calculating sentence similarity, finally according to different sentences The weight of son calculates text similarity；The present invention utilizes existing patent field ontology, analyzes the semantic pass in patent text System, the calculating of patent text similarity is carried out using vector space model and domain body, and result of calculation is accurate, calculates knot The accuracy and recall rate of fruit are higher, can more accurately describe the similarity degree between patent, can accelerate patent examination Speed, while also more efficiently patent resource can be analyzed for users, it can meet well and actually answer Needs.

Description of the drawings

Fig. 1 is CBOW illustratons of model；

Fig. 2 is Skip-gram illustratons of model.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.

Words similarity refers to a kind of measurement of Semantic Similarity between word.Word is in domain body with concept Mode shows, and the similarity calculation of word can be converted into the similarity calculation of concept in the body.Using existing Domain body, the problem of can not calculating similarity in order to avoid the word not included in domain body, introduce word2vec and this Body is combined the similarity for calculating word.

(1) it is based on ontology and calculates Words similarity

The concept for including in domain body constitutes a tree-shaped hierarchical structure by the classification relation of upper bottom, generally Similarity between thought can be obtained by calculating semantic distance of the concept in ontology tree.The minimum for finding two concepts is public Father node, calculate its minimum public father node to the sum of the distance between two concepts as concept semanteme in the body away from From.

Calculating formula of similarity is：

Similarity between word, not only semantic distance is related between concept, also with its minimum public father node in field Brotgher of node number is related around position and concept in the tree of ontological construction.Semantic distance is identical between concept, Its level of minimum public father node in tree is deeper, and the similarity of word is bigger.Similarly, word corresponding concept week The brotgher of node enclosed is more, and local density is bigger, then illustrating that the concept node degree of refinement is bigger, the similarity of word is just It is bigger.Calculation formula is as follows after minimum public father node position and node local density is added：

(2) it is based on word2vec and calculates Words similarity

Word2vec is a Open-Source Tools that can convert word to real number value vector of Google, using deeply The thought of study is spent, word can be reduced to a vector in K dimensional vector spaces, on such phrase semantic by training Similarity can be converted into the operation of term vector.

Word2vec use the word of the Distributed representation that Hinton proposed in 1986 to Representation is measured, basic thought is the vector space that word is mapped to K dimensions using training pattern, each word It can be indicated by the vector of a K dimension.K is hyper parameter, needs to specify in advance.After converting word to vector expression, so that it may with The semantic similarity between word is indicated using the distance calculated between vector.Word2vec has CBOW models (Continuous Bag-of-Words Model) and Skip-gram models, two kinds of model structures it is as depicted in figs. 1 and 2.Wherein CBOW models profit It is then using current to go prediction current word, Skip-gram models with the 2n word (n is 2 in Fig. 1) in the context of current word The 2n word of word prediction thereon hereinafter (n is 2 in Fig. 2).

The input layer of CBOW models is the front and back n term vector of current word, by intermediate hidden layer to this 2n term vector It is added up to obtain W_x.Output layer is a Huffman tree, is by the word in corpus as leaf node, the frequency of each word It is built as weights.By stochastic gradient algorithm to W_xIt is predicted so that p (w | context (w)) value maximization, Context (w) refers to the n front and back word of w.When the training is completed, the term vector w of all words can be obtained.Skip- The training process of gram models is similar with CBOW model training processes.

Language material is trained by word2vec, obtains the term vector of all words.Similarity between calculating word is just It is converted into and calculates the similarity that word corresponds to term vector.Cosine phase may be used in calculating generally for space vector similarity Like degree algorithm, specific algorithm is as follows：

Wherein w₁And w₂The term vector that respectively two words obtain after training.x_1iAnd x_2iTwo words are indicated respectively The value of the corresponding i-th dimension in vector space of term vector of language.

(3) Words similarity

Two kinds of Words similarity sim are calculated separately out using ontology and word2vec_ow(w₁, w₂) and sim_rw(w₁, w₂), knot Conjunction obtains Words similarity, and formula is as follows：

Sentence similarity calculates, similar by calculating the word between notional word generally based on notional word significant in text Degree obtains sentence similarity.It is closed in the present invention with the Words similarity of formula (4) description and the non-categorical of patent field ontology Sentence similarity is calculated based on system.

Assuming that being respectively S there are two sentence₁=(w₁₁, w₁₂..., w_1n) and S₂=(w₂₁, w₂₂..., w_2m).Wherein w₁₁, w₁₂, w₂₁, w₂₂The notional word obtained after segmenting and removing stop words for sentence.Define (w_1i, w_2j) it is sentence S₁And S₂One of word Mapping, if for arbitrary k, l, sim_w(w_1i, w_2j) ＞ sim_w(w_1k, w_2l) perseverance establishment, sim_w(w_1i, w_2j) counted by formula (4) It obtains.It is judged that the w in two sentences_1iAnd w_2jIt is semantic relation word pair the most similar, obtains one group of semanteme and close System's word the most similar removes the word, and recalculate, until one of sentence vocabulary from two sentences respectively to rear In word is not present, calculation formula is as follows：

Since the length of sentence is different, the shared close word of two sentences is not for the similarity of each sentence to group With, the present invention is with sim_ws(S₁, S₂) indicate S₂Relative to S₁The entity Word similarity of gained.

Above formula has only focused on the similarity of entity word the most similar in two sentences, but there is no consider language in sentence The similarity of phrase similar in justice.Based on the non-categorical relationship of patent field ontology, the relationship calculated in sentence is similar Degree.

Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, is only retained dynamic Word part of speech, the word of noun part-of-speech.The orderly vocabulary of two sentences is obtained, the orderly vocabulary for defining first sentence is S₁ (w₁₁：pos₁₁, w₁₂：pos₁₂..., w_1n：pos_1n), similarly, the orderly vocabulary of another sentence is defined as S₂(w₂₁： pos₂₁, w₂₂：pos₂₂..., w_2m：pos_2m).To the vocabulary of each sentence, verb therein and the name before and after it are chosen Word constitutes SAO structure phrase P (n₁, v, n₂).Convert the orderly vocabulary of each sentence to phrase set S₁=(P₁₁, P₁₂..., P_1n) and S₂=(P₂₁, P₂₂..., P_2m).Obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship For NR (r₁, r₂..., r_l), r_lThere is the phrase of SAO structures for one in non-categorical set.By calculating two sentences Phrase is integrated into the non-categorical relationship similarity that the number occurred in non-categorical set of relationship NR calculates sentence.Its calculation formula It is as follows：

The non-categorical relationship similarity that the entity Word similarity and formula (6) obtained according to formula (5) obtains, obtains sentence Overall similarity between son, shown in computational methods such as formula (7).

sim_s(S₁, S₂)=β sim_ws(S₁, S₂)+(1-β)sim_ps(S₁, S₂) (7),

Text similarity is calculated on the basis of existing Words similarity and sentence similarity.For text, and Not every word and sentence are all related to the main meaning of text representation, the crucial sentence where keyword and keyword The meaning of text can be more expressed, others primarily serve the function of connection.During calculating text similarity, pass through Calculate that the similarity of keyword and critical sentence obtains as a result, having extremely close connection with the meaning that two documents itself are expressed System, can more accurately indicate the similitude between text.

When choosing the keyword in text, the significance level for calculating each word pair and text in text is needed, is generally adopted With word frequency-inverse document frequency (Term frequency-Inverse document frequency), i.e. the calculating sides TF-IDF Method, this method calculate by a relatively simple and have higher accuracy and recall rate, be widely used in calculating weight. The frequency that some word occurs in one text is higher, and the frequency which occurs in other texts is lower, then the word is got over The theme in this text can be represented, it is higher to the significance level of this text.

When calculating text similarity, text is segmented first, TF-IDF values are calculated to word segmentation result later, is extracted TF-IDF values are higher to be used as keyword, and the sentence where positioning keyword is as critical sentence.Different critical sentences includes difference Keyword, due to the difference of the quantity comprising keyword and weight itself, each critical sentence to the significance level of text not yet Equally, need to calculate the weight to text of each critical sentence.Each critical sentence depends on the significance level of text It includes keyword, choose the maximum keyword of TF-IDF values in the keyword set that critical sentence is included, and by the word Weight w (S) of the weights as critical sentence.The sentence set of two texts is finally obtained, is enabled

D₁(S₁₁：w(S₁₁), S₁₂：w(S₁₂) ..., S_1n：w(S_1n)) indicate text D₁Sentence set, D₂(S₂₁：w(S₂₁), S₂₂：w(S₂₂) ..., S_2m：w(S_2m)) indicate D₂Sentence set.

It is similar to sentence similarity calculating, since the sentence number of two texts is different, calculate similarity in two texts Most similar sentence is different different text obtained similarities set, with sim (D₁, D₂) indicate with D₁For Benchmark, D₂Relative to D₁Text similarity.

The calculating formula of similarity of two texts is as follows：

Above formula is directed not only to critical sentence all in two texts, it is also contemplated that one for entire document of critical sentence Significance level can reflect the similarity between text well.

The present embodiment uses existing new-energy automobile domain body, and the language material of use is new-energy automobile field Chinese 50 patent texts under patent same category.

It chooses wherein patent and is used as text N to be compared, other all patents text D as a comparison utilizes the present invention Chinese patent text similarity calculating method, calculate Chinese patent text similarity, be as follows：

1) subordinate sentence processing is carried out to all patent texts；

2) by Hanlp tools, new energy field automotive field dictionary is added, the result after subordinate sentence is segmented, and The part of speech for retaining each word removes stop words using deactivated vocabulary；

3) the TF-IDF values of all patent texts are calculated and by its descending arrangement, the word conduct that the row of taking is first 20 Keyword, and according to keyword, marks the critical sentence where keyword, and using the maximum weights of keyword in critical sentence as The weights of critical sentence obtain the critical sentence set D (S of each text₁：w(S₁), S₂：w(S₂) ..., S_n：w(S_n))；

4) text N to be compared is chosen successively and compares the critical sentence of text D, and according to sentence similarity algorithm and word Similarity algorithm computational entity Word similarity and relationship similarity；

5) similarity of two texts is calculated using the sentence similarity of existing two texts N and D.

Accuracy (P), recall rate (R) and F value conducts are generally used for the performance evaluation of Text similarity computing method Performance Evaluating Indexes.It defines T (t) and indicates that mark value is the quantity of t, C (t) indicates the value of the text similarity calculated in t institutes table The quantity for the range shown, TC (t) indicate that mark value is t and calculated value indicates the quantity in range in t.Specific evaluation index It is defined as follows：

Accuracy P：

Recall rate R：

F values：

Accuracy, recall rate and the F values for the Chinese patent text similarity calculation result that method through the invention obtains It is all very high, far above accuracy, recall rate and the F values of the result of calculation of the prior art.

Chinese patent text similarity calculating method provided by the invention, it is proposed that a kind of layered computation text is similar The calculating of text similarity is divided into three word, sentence, text levels, is calculated from bottom to top, the party by the method for degree Method calculates text similarity using sentence as granularity, is combined using existing domain body and word2vec and calculates word phase Like degree, and the relationship similarity obtained according to non-categorical relationship is added when calculating sentence similarity, finally according to different sentences The weight of son calculates text similarity；The present invention utilizes existing patent field ontology, analyzes the semantic pass in patent text System, the calculating of patent text similarity is carried out using vector space model and domain body, and result of calculation is accurate, calculates knot The accuracy and recall rate of fruit are higher so that the description of correlation is more accurate between text, can more accurately describe patent it Between similarity degree, the speed of patent examination can be accelerated, while also can more efficiently be provided for users to patent Source is analyzed, and can meet the needs of practical application well.

Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of Chinese patent text similarity calculating method, which is characterized in that include the steps that calculating sentence similarity.

2. Chinese patent text similarity calculating method according to claim 1, which is characterized in that the computational methods packet It includes：

Text is segmented；TF-IDF values are calculated to word segmentation result, it is higher as keyword, positioning pass to extract TF-IDF values Sentence where keyword obtains every as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence The critical sentence set of a text；The weight to text for calculating each critical sentence chooses text to be compared and comparison text successively Critical sentence.

3. the Chinese patent text similarity calculating method according to claim 1-2, which is characterized in that by Words similarity The similarity of concept in the body is converted into calculate；The calculating formula of similarity of concept in the body is：

4. the Chinese patent text similarity calculating method according to claim 1-3, which is characterized in that be added minimum public Father node position and node local density, the calculating formula of similarity of concept in the body are：

Wherein r indicates that the root node of tree, com indicate w₁And w₂The public father node of minimum, dis (r, com) indicates minimum public father The depth of node, num (w₁) indicate w₁The brotgher of node number of node.

5. the Chinese patent text similarity calculating method according to claim 1-4, which is characterized in that be based on word2vec Words similarity is calculated, the input layer of CBOW models is the front and back n term vector of current word, by intermediate hidden layer to this 2n Term vector is added up to obtain W_x；Output layer is a Huffman tree, is by the word in corpus as leaf node, each word The frequency build as weights；By stochastic gradient algorithm to W_xIt is predicted so that and p (w | context (w)) value maximum Change, context (w) refers to the n front and back word of w；Language material is trained by word2vec, obtains the term vector of all words； It calculates the similarity between word and translates into the similarity that calculating word corresponds to term vector, calculation formula is：

Wherein w₁And w₂The term vector that respectively two words obtain after training；x_1iAnd x_2iThe word of two words is indicated respectively The value of the corresponding i-th dimension in vector space of vector.

6. the Chinese patent text similarity calculating method according to claim 1-5, which is characterized in that using ontology and Word2vec calculates separately out two kinds of Words similarity sim_ow(w₁, w₂) and sim_rw(w₁, w₂), it is public in conjunction with obtaining Words similarity Formula is：

Wherein S indicates the concept set in ontology, if will be utilized in two words there are one Ontological concept set is not belonging to The similarity that word2vec is obtained is as Words similarity, if the two belongs to Ontological concept set, takes ontology word similar Degree and word2vec Words similarities average value are as final Words similarity.

7. the Chinese patent text similarity calculating method according to claim 1-6, which is characterized in that sentence similarity meter Calculation method is as follows：

Assuming that being respectively S there are two sentence₁=(w₁₁, w₁₂..., w_1n) and S₂=(w₂₁, w₂₂..., w_2m), wherein w₁₁, w₁₂, w₂₁, w₂₂For the notional word that sentence obtains after segmenting and removing stop words, (w is defined_1i, w_2j) it is sentence S₁And S₂One of word is reflected It penetrates, if for arbitrary k, l, sim_w(w_1i, w_2j) ＞ sim_w(w_1k, w_2l) perseverance establishment, it is judged that the w in two sentences_1iWith w_2jIt is semantic relation word pair the most similar, obtains one group of semantic relation word the most similar to rear, respectively from two sentences The word is removed, and is recalculated, until word is not present in one of sentence vocabulary, calculation formula is：

8. the Chinese patent text similarity calculating method according to claim 1-7, which is characterized in that utilize patent field The relationship similarity in sentence is calculated based on the non-categorical relationship of ontology, steps are as follows：

Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, only retain verb word Property, noun part-of-speech word, obtain the orderly vocabulary of two sentences, the orderly vocabulary for defining first sentence is S₁(w₁₁： pos₁₁, w₁₂：pos₁₂..., w_1n：pos_1n), the orderly vocabulary of another sentence is defined as S₂(w₂₁：pos₂₁, w₂₂： pos₂₂..., w_2m：pos_2m)；To the vocabulary of each sentence, chooses verb therein and the noun before and after it constitutes SAO knots Word-building group P (n₁, v, n₂)；Convert the orderly vocabulary of each sentence to phrase set S₁=(P₁₁, P₁₂..., P_1n) and S₂= (P₂₁, P₂₂..., P_2m), obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship is NR (r₁, r₂..., r_l), r_lThere is the phrase of SAO structures for one in non-categorical set；It is integrated into non-categorical relationship by calculating two sentence phrases The number occurred in set NR calculates the non-categorical relationship similarity of sentence, and calculation formula is：

Wherein, num (S₁) indicate S₁Phrase in set belongs to the number of non-categorical set of relationship NR, com (S₁, S₂) indicate sentence Phrase set S₁And S₂Intersection, indicate S₁And S₂Shared phrase set.

The calculation formula of overall similarity is between sentence

sim_s(S₁, S₂)=β sim_ws(S₁, S₂)+(1-β)sim_ps(S₁, S₂),

Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sim_s(S₁, S₂) indicate S₂Relative to S₁Sentence Sub- similarity.

9. the Chinese patent text similarity calculating method according to claim 1-8, which is characterized in that in existing word Text similarity is calculated on the basis of similarity and sentence similarity, step is：

Text is segmented first, TF-IDF values are calculated to word segmentation result later, it is higher as crucial to extract TF-IDF values Word, the sentence where positioning keyword is as critical sentence；Calculate the weight to text of each critical sentence；Critical sentence is chosen to be wrapped The maximum keyword of TF-IDF values in the keyword set contained, and using the weights of the word as the weight w (S) of critical sentence, finally The sentence set of two texts is obtained, is enabled

D₁(S₁₁：w(S₁₁), S₁₂：w(S₁₂) ..., S_1n：w(S_1n)) indicate text D₁Sentence set, D₂(S₂₁：w(S₂₁), S₂₂：w (S₂₂) ..., S_2m：w(S_2m)) indicate D₂Sentence set；

Define (S_1i, S_2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sim_s(S_1i, S_2j)≥sim_s (S_1l, S_2k) perseverance establishment, then it is assumed that S_1iAnd S_2jIt is relationship sentence the most close, wherein sim in two texts_s(S_1i, S_2j) logical Sentence similarity is crossed to be calculated.

10. the Chinese patent text similarity calculating method according to claim 1-9, which is characterized in that two text D₁With D₂Calculating formula of similarity be：

Wherein, sim_s(S_1i, S_1j) represent sentence phrase set S_1iWith sentence phrase set S_1jBetween overall similarity, w (S_1i) Represent sentence phrase set S_1iCritical sentence weight, w (S_1j) represent sentence phrase set S_1jCritical sentence weight.