[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108549634A - A kind of Chinese patent text similarity calculating method - Google Patents

A kind of Chinese patent text similarity calculating method Download PDF

Info

Publication number
CN108549634A
CN108549634A CN201810310198.1A CN201810310198A CN108549634A CN 108549634 A CN108549634 A CN 108549634A CN 201810310198 A CN201810310198 A CN 201810310198A CN 108549634 A CN108549634 A CN 108549634A
Authority
CN
China
Prior art keywords
sentence
similarity
word
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810310198.1A
Other languages
Chinese (zh)
Inventor
吕学强
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201810310198.1A priority Critical patent/CN108549634A/en
Publication of CN108549634A publication Critical patent/CN108549634A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of Chinese patent text similarity calculating methods, including:Text is segmented;TF IDF values are calculated to word segmentation result, extraction TF IDF values are higher to be used as keyword, and the sentence where positioning keyword obtains the critical sentence set of each text as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence;The weight to text for calculating each critical sentence chooses text to be compared and compares the critical sentence of text successively, and the sentence similarity based on critical sentence calculates the similarity of text.The present invention utilizes existing patent field ontology, analyze the semantic relation in patent text, the calculating of patent text similarity is carried out using vector space model and domain body, the accuracy and recall rate of result of calculation are higher, similarity degree between patent can be described more accurately, it can accelerate the speed of patent examination, the needs of practical application can be met well.

Description

A kind of Chinese patent text similarity calculating method
Technical field
The invention belongs to text information processing technical fields, and in particular to a kind of Chinese patent text similarity calculation side Method.
Background technology
Current Internet era, carrier of the patent as record mankind's achievement contain a large amount of scientific and technological achievement and innovation Technology.The fast development of science and technology makes annual amount of the application for patent sharply increase.Traditional retrieval mode passes through term Carry out matching return as a result, being usually correlation using the quantity that term occurs as patent, not in view of patent The semantic information for itself being included.The essence of patent examination is the high related patents of unexamined patent similarity, among these, most heavy What is wanted is exactly a little to calculate patent text similarity.Text similarity, general algorithmic method are using vector space model to text This expression calculates vector similarity as text similarity directly in vector space later.In recent years, ontology, as one kind The new representation of knowledge and description form is widely applied to the various aspects such as semantic net, information retrieval, more and more researchers Start to pay attention to carrying out semantic analysis using ontology.
Text similarity method can be mainly divided into two classes:One is using vector space model translate text into The form of amount, then calculated, one is the contacts indicated using semantic dictionary method between different long short texts, pass through key Word number of matches reflects the similarity between text.The method of the similarity of the Chinese patent text of calculating of the prior art exists The problem of semantic information is lost, and the prior art is inaccurate to the calculating of Chinese text similarity, the accuracy of result of calculation and Recall rate is relatively low, cannot accurately reflect the similarity of patent text, cannot meet the needs of practical application.
Invention content
For the above-mentioned prior art the problem of, the purpose of the present invention is to provide the avoidable appearance of one kind is above-mentioned The Chinese patent text similarity calculating method of technological deficiency.
In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:
A kind of Chinese patent text similarity calculating method, includes the steps that calculating sentence similarity.
Further, the computational methods include:
Text is segmented;TF-IDF values are calculated to word segmentation result, extraction TF-IDF values are higher to be used as keyword, The sentence where keyword is positioned as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence, Obtain the critical sentence set of each text;The weight to text for calculating each critical sentence chooses text to be compared and right successively Than the critical sentence of text.
Further, Words similarity is converted to the similarity of concept in the body to calculate;Concept is in the body Calculating formula of similarity is:
Wherein w1And w2Indicate two words, dis (w1, w2) indicate w1And w2Semantic distance in domain body.
Further, minimum public father node position and node local density, the similarity meter of concept in the body is added Calculating formula is:
Wherein r indicates that the root node of tree, com indicate w1And w2The public father node of minimum, dis (r, com) indicates minimum The depth of public father node, num (w1) indicate w1The brotgher of node number of node.
Further, it is based on word2vec and calculates Words similarity, the input layer of CBOW models is the front and back n of current word A term vector is added up to obtain W by intermediate hidden layer to this 2n term vectorx;Output layer is a Huffman tree, is By the word in corpus as leaf node, what the frequency of each word was built as weights;By stochastic gradient algorithm to WxInto Row prediction so that p (w | context (w)) value maximizes, and context (w) refers to the n front and back word of w;Pass through word2vec pairs Language material is trained, and obtains the term vector of all words;Calculate word between similarity translate into calculate word equivalent to The similarity of amount, calculation formula are:
Wherein w1And w2The term vector that respectively two words obtain after training;x1iAnd x2iTwo words are indicated respectively The value of the corresponding i-th dimension in vector space of term vector of language.
Further, two kinds of Words similarity sim are calculated separately out using ontology and word2vecow(w1, w2) and simrw (w1, w2), in conjunction with obtaining Words similarity, formula is:
Wherein S indicates the concept set in ontology, if being not belonging to Ontological concept set there are one in two words, Using the similarity obtained using word2vec as Words similarity this pronouns, general term for nouns, numerals and measure words is taken if the two belongs to Ontological concept set Language similarity and word2vec Words similarities average value are as final Words similarity.
Further, sentence similarity computational methods are as follows:
Assuming that being respectively S there are two sentence1=(w11, w12..., w1n) and S2=(w21, w22..., w2m), wherein w11, w12, w21, w22For the notional word that sentence obtains after segmenting and removing stop words, (w is defined1i, w2j) it is sentence S1And S2One of word Mapping, if for arbitrary k, l, simw(w1i, w2j) > simw(w1k, w2l) perseverance establishment, it is judged that in two sentences w1iAnd w2jIt is semantic relation word pair the most similar, obtains one group of semantic relation word the most similar to rear, respectively from two The word is removed in sentence, and is recalculated, and until word is not present in one of sentence vocabulary, calculation formula is:
simws(S1, S2) indicate S2Relative to S1The entity Word similarity of gained.
Further, the relationship similarity calculated based on the non-categorical relationship of patent field ontology in sentence, step are utilized It is rapid as follows:
Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, is only retained dynamic The word of word part of speech, noun part-of-speech obtains the orderly vocabulary of two sentences, and the orderly vocabulary for defining first sentence is S1 (w11:pos11, w12:pos12..., w1n:pos1n), the orderly vocabulary of another sentence is defined as S2(w21:pos21, w22: pos22..., w2m:pos2m);To the vocabulary of each sentence, chooses verb therein and the noun before and after it constitutes SAO Structure phrase P (n1, v, n2);Convert the orderly vocabulary of each sentence to phrase set S1=(P11, P12..., P1n) and S2 =(P21, P22..., P2m), obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship is NR (r1, r2..., rl), rlThere is the phrase of SAO structures for one in non-categorical set;It is integrated into non-categorical by calculating two sentence phrases The number occurred in set of relationship NR calculates the non-categorical relationship similarity of sentence, and calculation formula is:
Wherein, num (S1) indicate S1Phrase in set belongs to the number of non-categorical set of relationship NR, com (S1, S2) table Show sentence phrase set S1And S2Intersection, indicate S1And S2Shared phrase set.
The calculation formula of overall similarity is between sentence
sims(S1, S2)=β simws(S1, S2)+(1-β)simps(S1, S2),
Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sims(S1, S2) indicate S2Relative to S1Sentence similarity.
Further, text similarity, step are calculated on the basis of existing Words similarity and sentence similarity For:
Text is segmented first, TF-IDF values are calculated to word segmentation result later, extracts the higher conduct of TF-IDF values Keyword, the sentence where positioning keyword is as critical sentence;Calculate the weight to text of each critical sentence;It chooses crucial The maximum keyword of TF-IDF values in the keyword set that sentence is included, and using the weights of the word as the weight w of critical sentence (S), the sentence set of two texts is finally obtained, is enabled
D1(S11:w(S11), S12:w(S12) ..., S1n:w(S1n)) indicate text D1Sentence set, D2(S21:w (S21), S22:w(S22) ..., S2m:w(S2m)) indicate D2Sentence set;
Define (S1i, S2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sims(S1i, S2j)≥ sims(S1l, S2k) perseverance establishment, then it is assumed that S1iAnd S2jIt is relationship sentence the most close, wherein sim in two textss(S1i, S2j) be calculated by sentence similarity.
Further, two text D1And D2Calculating formula of similarity be:
Wherein, sims(S1i, S1j) represent sentence phrase set S1iWith sentence phrase set S1jBetween overall similarity, w(S1i) represent sentence phrase set S1iCritical sentence weight, w (S1j) represent sentence phrase set S1jCritical sentence power Weight.
Chinese patent text similarity calculating method provided by the invention, it is proposed that a kind of layered computation text is similar The calculating of text similarity is divided into three word, sentence, text levels, is calculated from bottom to top, the party by the method for degree Method calculates text similarity using sentence as granularity, is combined using existing domain body and word2vec and calculates word phase Like degree, and the relationship similarity obtained according to non-categorical relationship is added when calculating sentence similarity, finally according to different sentences The weight of son calculates text similarity;The present invention utilizes existing patent field ontology, analyzes the semantic pass in patent text System, the calculating of patent text similarity is carried out using vector space model and domain body, and result of calculation is accurate, calculates knot The accuracy and recall rate of fruit are higher, can more accurately describe the similarity degree between patent, can accelerate patent examination Speed, while also more efficiently patent resource can be analyzed for users, it can meet well and actually answer Needs.
Description of the drawings
Fig. 1 is CBOW illustratons of model;
Fig. 2 is Skip-gram illustratons of model.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
Words similarity refers to a kind of measurement of Semantic Similarity between word.Word is in domain body with concept Mode shows, and the similarity calculation of word can be converted into the similarity calculation of concept in the body.Using existing Domain body, the problem of can not calculating similarity in order to avoid the word not included in domain body, introduce word2vec and this Body is combined the similarity for calculating word.
(1) it is based on ontology and calculates Words similarity
The concept for including in domain body constitutes a tree-shaped hierarchical structure by the classification relation of upper bottom, generally Similarity between thought can be obtained by calculating semantic distance of the concept in ontology tree.The minimum for finding two concepts is public Father node, calculate its minimum public father node to the sum of the distance between two concepts as concept semanteme in the body away from From.
Calculating formula of similarity is:
Wherein w1And w2Indicate two words, dis (w1, w2) indicate w1And w2Semantic distance in domain body.
Similarity between word, not only semantic distance is related between concept, also with its minimum public father node in field Brotgher of node number is related around position and concept in the tree of ontological construction.Semantic distance is identical between concept, Its level of minimum public father node in tree is deeper, and the similarity of word is bigger.Similarly, word corresponding concept week The brotgher of node enclosed is more, and local density is bigger, then illustrating that the concept node degree of refinement is bigger, the similarity of word is just It is bigger.Calculation formula is as follows after minimum public father node position and node local density is added:
Wherein r indicates that the root node of tree, com indicate w1And w2The public father node of minimum, dis (r, com) indicates minimum The depth of public father node, num (w1) indicate w1The brotgher of node number of node.
(2) it is based on word2vec and calculates Words similarity
Word2vec is a Open-Source Tools that can convert word to real number value vector of Google, using deeply The thought of study is spent, word can be reduced to a vector in K dimensional vector spaces, on such phrase semantic by training Similarity can be converted into the operation of term vector.
Word2vec use the word of the Distributed representation that Hinton proposed in 1986 to Representation is measured, basic thought is the vector space that word is mapped to K dimensions using training pattern, each word It can be indicated by the vector of a K dimension.K is hyper parameter, needs to specify in advance.After converting word to vector expression, so that it may with The semantic similarity between word is indicated using the distance calculated between vector.Word2vec has CBOW models (Continuous Bag-of-Words Model) and Skip-gram models, two kinds of model structures it is as depicted in figs. 1 and 2.Wherein CBOW models profit It is then using current to go prediction current word, Skip-gram models with the 2n word (n is 2 in Fig. 1) in the context of current word The 2n word of word prediction thereon hereinafter (n is 2 in Fig. 2).
The input layer of CBOW models is the front and back n term vector of current word, by intermediate hidden layer to this 2n term vector It is added up to obtain Wx.Output layer is a Huffman tree, is by the word in corpus as leaf node, the frequency of each word It is built as weights.By stochastic gradient algorithm to WxIt is predicted so that p (w | context (w)) value maximization, Context (w) refers to the n front and back word of w.When the training is completed, the term vector w of all words can be obtained.Skip- The training process of gram models is similar with CBOW model training processes.
Language material is trained by word2vec, obtains the term vector of all words.Similarity between calculating word is just It is converted into and calculates the similarity that word corresponds to term vector.Cosine phase may be used in calculating generally for space vector similarity Like degree algorithm, specific algorithm is as follows:
Wherein w1And w2The term vector that respectively two words obtain after training.x1iAnd x2iTwo words are indicated respectively The value of the corresponding i-th dimension in vector space of term vector of language.
(3) Words similarity
Two kinds of Words similarity sim are calculated separately out using ontology and word2vecow(w1, w2) and simrw(w1, w2), knot Conjunction obtains Words similarity, and formula is as follows:
Wherein S indicates the concept set in ontology, if being not belonging to Ontological concept set there are one in two words, Using the similarity obtained using word2vec as Words similarity this pronouns, general term for nouns, numerals and measure words is taken if the two belongs to Ontological concept set Language similarity and word2vec Words similarities average value are as final Words similarity.
Sentence similarity calculates, similar by calculating the word between notional word generally based on notional word significant in text Degree obtains sentence similarity.It is closed in the present invention with the Words similarity of formula (4) description and the non-categorical of patent field ontology Sentence similarity is calculated based on system.
Assuming that being respectively S there are two sentence1=(w11, w12..., w1n) and S2=(w21, w22..., w2m).Wherein w11, w12, w21, w22The notional word obtained after segmenting and removing stop words for sentence.Define (w1i, w2j) it is sentence S1And S2One of word Mapping, if for arbitrary k, l, simw(w1i, w2j) > simw(w1k, w2l) perseverance establishment, simw(w1i, w2j) counted by formula (4) It obtains.It is judged that the w in two sentences1iAnd w2jIt is semantic relation word pair the most similar, obtains one group of semanteme and close System's word the most similar removes the word, and recalculate, until one of sentence vocabulary from two sentences respectively to rear In word is not present, calculation formula is as follows:
Since the length of sentence is different, the shared close word of two sentences is not for the similarity of each sentence to group With, the present invention is with simws(S1, S2) indicate S2Relative to S1The entity Word similarity of gained.
Above formula has only focused on the similarity of entity word the most similar in two sentences, but there is no consider language in sentence The similarity of phrase similar in justice.Based on the non-categorical relationship of patent field ontology, the relationship calculated in sentence is similar Degree.
Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, is only retained dynamic Word part of speech, the word of noun part-of-speech.The orderly vocabulary of two sentences is obtained, the orderly vocabulary for defining first sentence is S1 (w11:pos11, w12:pos12..., w1n:pos1n), similarly, the orderly vocabulary of another sentence is defined as S2(w21: pos21, w22:pos22..., w2m:pos2m).To the vocabulary of each sentence, verb therein and the name before and after it are chosen Word constitutes SAO structure phrase P (n1, v, n2).Convert the orderly vocabulary of each sentence to phrase set S1=(P11, P12..., P1n) and S2=(P21, P22..., P2m).Obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship For NR (r1, r2..., rl), rlThere is the phrase of SAO structures for one in non-categorical set.By calculating two sentences Phrase is integrated into the non-categorical relationship similarity that the number occurred in non-categorical set of relationship NR calculates sentence.Its calculation formula It is as follows:
Wherein, num (S1) indicate S1Phrase in set belongs to the number of non-categorical set of relationship NR, com (S1, S2) table Show sentence phrase set S1And S2Intersection, indicate S1And S2Shared phrase set.
The non-categorical relationship similarity that the entity Word similarity and formula (6) obtained according to formula (5) obtains, obtains sentence Overall similarity between son, shown in computational methods such as formula (7).
sims(S1, S2)=β simws(S1, S2)+(1-β)simps(S1, S2) (7),
Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sims(S1, S2) indicate S2Relative to S1Sentence similarity.
Text similarity is calculated on the basis of existing Words similarity and sentence similarity.For text, and Not every word and sentence are all related to the main meaning of text representation, the crucial sentence where keyword and keyword The meaning of text can be more expressed, others primarily serve the function of connection.During calculating text similarity, pass through Calculate that the similarity of keyword and critical sentence obtains as a result, having extremely close connection with the meaning that two documents itself are expressed System, can more accurately indicate the similitude between text.
When choosing the keyword in text, the significance level for calculating each word pair and text in text is needed, is generally adopted With word frequency-inverse document frequency (Term frequency-Inverse document frequency), i.e. the calculating sides TF-IDF Method, this method calculate by a relatively simple and have higher accuracy and recall rate, be widely used in calculating weight. The frequency that some word occurs in one text is higher, and the frequency which occurs in other texts is lower, then the word is got over The theme in this text can be represented, it is higher to the significance level of this text.
When calculating text similarity, text is segmented first, TF-IDF values are calculated to word segmentation result later, is extracted TF-IDF values are higher to be used as keyword, and the sentence where positioning keyword is as critical sentence.Different critical sentences includes difference Keyword, due to the difference of the quantity comprising keyword and weight itself, each critical sentence to the significance level of text not yet Equally, need to calculate the weight to text of each critical sentence.Each critical sentence depends on the significance level of text It includes keyword, choose the maximum keyword of TF-IDF values in the keyword set that critical sentence is included, and by the word Weight w (S) of the weights as critical sentence.The sentence set of two texts is finally obtained, is enabled
D1(S11:w(S11), S12:w(S12) ..., S1n:w(S1n)) indicate text D1Sentence set, D2(S21:w(S21), S22:w(S22) ..., S2m:w(S2m)) indicate D2Sentence set.
Define (S1i, S2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sims(S1i, S2j)≥ sims(S1l, S2k) perseverance establishment, then it is assumed that S1iAnd S2jIt is relationship sentence the most close, wherein sim in two textss(S1i, S2j) be calculated by sentence similarity.
It is similar to sentence similarity calculating, since the sentence number of two texts is different, calculate similarity in two texts Most similar sentence is different different text obtained similarities set, with sim (D1, D2) indicate with D1For Benchmark, D2Relative to D1Text similarity.
The calculating formula of similarity of two texts is as follows:
Above formula is directed not only to critical sentence all in two texts, it is also contemplated that one for entire document of critical sentence Significance level can reflect the similarity between text well.
The present embodiment uses existing new-energy automobile domain body, and the language material of use is new-energy automobile field Chinese 50 patent texts under patent same category.
It chooses wherein patent and is used as text N to be compared, other all patents text D as a comparison utilizes the present invention Chinese patent text similarity calculating method, calculate Chinese patent text similarity, be as follows:
1) subordinate sentence processing is carried out to all patent texts;
2) by Hanlp tools, new energy field automotive field dictionary is added, the result after subordinate sentence is segmented, and The part of speech for retaining each word removes stop words using deactivated vocabulary;
3) the TF-IDF values of all patent texts are calculated and by its descending arrangement, the word conduct that the row of taking is first 20 Keyword, and according to keyword, marks the critical sentence where keyword, and using the maximum weights of keyword in critical sentence as The weights of critical sentence obtain the critical sentence set D (S of each text1:w(S1), S2:w(S2) ..., Sn:w(Sn));
4) text N to be compared is chosen successively and compares the critical sentence of text D, and according to sentence similarity algorithm and word Similarity algorithm computational entity Word similarity and relationship similarity;
5) similarity of two texts is calculated using the sentence similarity of existing two texts N and D.
Accuracy (P), recall rate (R) and F value conducts are generally used for the performance evaluation of Text similarity computing method Performance Evaluating Indexes.It defines T (t) and indicates that mark value is the quantity of t, C (t) indicates the value of the text similarity calculated in t institutes table The quantity for the range shown, TC (t) indicate that mark value is t and calculated value indicates the quantity in range in t.Specific evaluation index It is defined as follows:
Accuracy P:
Recall rate R:
F values:
Accuracy, recall rate and the F values for the Chinese patent text similarity calculation result that method through the invention obtains It is all very high, far above accuracy, recall rate and the F values of the result of calculation of the prior art.
Chinese patent text similarity calculating method provided by the invention, it is proposed that a kind of layered computation text is similar The calculating of text similarity is divided into three word, sentence, text levels, is calculated from bottom to top, the party by the method for degree Method calculates text similarity using sentence as granularity, is combined using existing domain body and word2vec and calculates word phase Like degree, and the relationship similarity obtained according to non-categorical relationship is added when calculating sentence similarity, finally according to different sentences The weight of son calculates text similarity;The present invention utilizes existing patent field ontology, analyzes the semantic pass in patent text System, the calculating of patent text similarity is carried out using vector space model and domain body, and result of calculation is accurate, calculates knot The accuracy and recall rate of fruit are higher so that the description of correlation is more accurate between text, can more accurately describe patent it Between similarity degree, the speed of patent examination can be accelerated, while also can more efficiently be provided for users to patent Source is analyzed, and can meet the needs of practical application well.
Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (10)

1. a kind of Chinese patent text similarity calculating method, which is characterized in that include the steps that calculating sentence similarity.
2. Chinese patent text similarity calculating method according to claim 1, which is characterized in that the computational methods packet It includes:
Text is segmented;TF-IDF values are calculated to word segmentation result, it is higher as keyword, positioning pass to extract TF-IDF values Sentence where keyword obtains every as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence The critical sentence set of a text;The weight to text for calculating each critical sentence chooses text to be compared and comparison text successively Critical sentence.
3. the Chinese patent text similarity calculating method according to claim 1-2, which is characterized in that by Words similarity The similarity of concept in the body is converted into calculate;The calculating formula of similarity of concept in the body is:
Wherein w1And w2Indicate two words, dis (w1, w2) indicate w1And w2Semantic distance in domain body.
4. the Chinese patent text similarity calculating method according to claim 1-3, which is characterized in that be added minimum public Father node position and node local density, the calculating formula of similarity of concept in the body are:
Wherein r indicates that the root node of tree, com indicate w1And w2The public father node of minimum, dis (r, com) indicates minimum public father The depth of node, num (w1) indicate w1The brotgher of node number of node.
5. the Chinese patent text similarity calculating method according to claim 1-4, which is characterized in that be based on word2vec Words similarity is calculated, the input layer of CBOW models is the front and back n term vector of current word, by intermediate hidden layer to this 2n Term vector is added up to obtain Wx;Output layer is a Huffman tree, is by the word in corpus as leaf node, each word The frequency build as weights;By stochastic gradient algorithm to WxIt is predicted so that and p (w | context (w)) value maximum Change, context (w) refers to the n front and back word of w;Language material is trained by word2vec, obtains the term vector of all words; It calculates the similarity between word and translates into the similarity that calculating word corresponds to term vector, calculation formula is:
Wherein w1And w2The term vector that respectively two words obtain after training;x1iAnd x2iThe word of two words is indicated respectively The value of the corresponding i-th dimension in vector space of vector.
6. the Chinese patent text similarity calculating method according to claim 1-5, which is characterized in that using ontology and Word2vec calculates separately out two kinds of Words similarity simow(w1, w2) and simrw(w1, w2), it is public in conjunction with obtaining Words similarity Formula is:
Wherein S indicates the concept set in ontology, if will be utilized in two words there are one Ontological concept set is not belonging to The similarity that word2vec is obtained is as Words similarity, if the two belongs to Ontological concept set, takes ontology word similar Degree and word2vec Words similarities average value are as final Words similarity.
7. the Chinese patent text similarity calculating method according to claim 1-6, which is characterized in that sentence similarity meter Calculation method is as follows:
Assuming that being respectively S there are two sentence1=(w11, w12..., w1n) and S2=(w21, w22..., w2m), wherein w11, w12, w21, w22For the notional word that sentence obtains after segmenting and removing stop words, (w is defined1i, w2j) it is sentence S1And S2One of word is reflected It penetrates, if for arbitrary k, l, simw(w1i, w2j) > simw(w1k, w2l) perseverance establishment, it is judged that the w in two sentences1iWith w2jIt is semantic relation word pair the most similar, obtains one group of semantic relation word the most similar to rear, respectively from two sentences The word is removed, and is recalculated, until word is not present in one of sentence vocabulary, calculation formula is:
simws(S1, S2) indicate S2Relative to S1The entity Word similarity of gained.
8. the Chinese patent text similarity calculating method according to claim 1-7, which is characterized in that utilize patent field The relationship similarity in sentence is calculated based on the non-categorical relationship of ontology, steps are as follows:
Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, only retain verb word Property, noun part-of-speech word, obtain the orderly vocabulary of two sentences, the orderly vocabulary for defining first sentence is S1(w11: pos11, w12:pos12..., w1n:pos1n), the orderly vocabulary of another sentence is defined as S2(w21:pos21, w22: pos22..., w2m:pos2m);To the vocabulary of each sentence, chooses verb therein and the noun before and after it constitutes SAO knots Word-building group P (n1, v, n2);Convert the orderly vocabulary of each sentence to phrase set S1=(P11, P12..., P1n) and S2= (P21, P22..., P2m), obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship is NR (r1, r2..., rl), rlThere is the phrase of SAO structures for one in non-categorical set;It is integrated into non-categorical relationship by calculating two sentence phrases The number occurred in set NR calculates the non-categorical relationship similarity of sentence, and calculation formula is:
Wherein, num (S1) indicate S1Phrase in set belongs to the number of non-categorical set of relationship NR, com (S1, S2) indicate sentence Phrase set S1And S2Intersection, indicate S1And S2Shared phrase set.
The calculation formula of overall similarity is between sentence
sims(S1, S2)=β simws(S1, S2)+(1-β)simps(S1, S2),
Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sims(S1, S2) indicate S2Relative to S1Sentence Sub- similarity.
9. the Chinese patent text similarity calculating method according to claim 1-8, which is characterized in that in existing word Text similarity is calculated on the basis of similarity and sentence similarity, step is:
Text is segmented first, TF-IDF values are calculated to word segmentation result later, it is higher as crucial to extract TF-IDF values Word, the sentence where positioning keyword is as critical sentence;Calculate the weight to text of each critical sentence;Critical sentence is chosen to be wrapped The maximum keyword of TF-IDF values in the keyword set contained, and using the weights of the word as the weight w (S) of critical sentence, finally The sentence set of two texts is obtained, is enabled
D1(S11:w(S11), S12:w(S12) ..., S1n:w(S1n)) indicate text D1Sentence set, D2(S21:w(S21), S22:w (S22) ..., S2m:w(S2m)) indicate D2Sentence set;
Define (S1i, S2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sims(S1i, S2j)≥sims (S1l, S2k) perseverance establishment, then it is assumed that S1iAnd S2jIt is relationship sentence the most close, wherein sim in two textss(S1i, S2j) logical Sentence similarity is crossed to be calculated.
10. the Chinese patent text similarity calculating method according to claim 1-9, which is characterized in that two text D1With D2Calculating formula of similarity be:
Wherein, sims(S1i, S1j) represent sentence phrase set S1iWith sentence phrase set S1jBetween overall similarity, w (S1i) Represent sentence phrase set S1iCritical sentence weight, w (S1j) represent sentence phrase set S1jCritical sentence weight.
CN201810310198.1A 2018-04-09 2018-04-09 A kind of Chinese patent text similarity calculating method Pending CN108549634A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810310198.1A CN108549634A (en) 2018-04-09 2018-04-09 A kind of Chinese patent text similarity calculating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810310198.1A CN108549634A (en) 2018-04-09 2018-04-09 A kind of Chinese patent text similarity calculating method

Publications (1)

Publication Number Publication Date
CN108549634A true CN108549634A (en) 2018-09-18

Family

ID=63514291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810310198.1A Pending CN108549634A (en) 2018-04-09 2018-04-09 A kind of Chinese patent text similarity calculating method

Country Status (1)

Country Link
CN (1) CN108549634A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109597878A (en) * 2018-11-13 2019-04-09 北京合享智慧科技有限公司 A kind of method and relevant apparatus of determining text similarity
CN109657227A (en) * 2018-10-08 2019-04-19 平安科技(深圳)有限公司 Contract feasibility determination method, equipment, storage medium and device
CN109657213A (en) * 2018-12-21 2019-04-19 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN109918670A (en) * 2019-03-12 2019-06-21 重庆誉存大数据科技有限公司 A kind of article duplicate checking method and system
CN110134792A (en) * 2019-05-22 2019-08-16 北京金山数字娱乐科技有限公司 Text recognition method, device, electronic equipment and storage medium
CN110209822A (en) * 2019-06-11 2019-09-06 中译语通科技股份有限公司 Sphere of learning data dependence prediction technique based on deep learning, computer
CN110297918A (en) * 2019-06-25 2019-10-01 深圳市酷开网络科技有限公司 A kind of method, intelligent terminal and storage medium calculating movie and television contents degree of correlation
CN110309263A (en) * 2019-06-06 2019-10-08 中国人民解放军军事科学院军事科学信息研究中心 A kind of semantic-based working attributes content of text judgement method for confliction detection and device
CN110457435A (en) * 2019-07-26 2019-11-15 南京邮电大学 A kind of patent novelty analysis system and its analysis method
CN110516216A (en) * 2019-05-15 2019-11-29 北京信息科技大学 A kind of automatic writing template base construction method of sports news
CN110532396A (en) * 2019-06-11 2019-12-03 福建奇点时空数字科技有限公司 A kind of entity similarity calculating method based on vector space model
CN110929022A (en) * 2018-09-18 2020-03-27 阿基米德(上海)传媒有限公司 Text abstract generation method and system
CN111027306A (en) * 2019-12-23 2020-04-17 园宝科技(武汉)有限公司 Intellectual property matching technology based on keyword extraction and word shifting distance
CN111767724A (en) * 2020-06-11 2020-10-13 安徽旅贲科技有限公司 Text similarity calculation method and system
CN111814456A (en) * 2020-05-25 2020-10-23 国网上海市电力公司 Verb-based Chinese text similarity calculation method
CN111930946A (en) * 2020-08-18 2020-11-13 哈尔滨工程大学 Patent classification method based on similarity measurement
CN112163418A (en) * 2020-08-31 2021-01-01 深圳市修远文化创意有限公司 Text comparison method and related device
CN112380830A (en) * 2020-06-18 2021-02-19 达而观信息科技(上海)有限公司 Method, system and computer readable storage medium for matching related sentences in different documents
CN112651221A (en) * 2019-10-10 2021-04-13 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN115563515A (en) * 2022-12-07 2023-01-03 粤港澳大湾区数字经济研究院(福田) Text similarity detection method, device and equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250547A1 (en) * 2001-08-13 2010-09-30 Xerox Corporation System for Automatically Generating Queries
CN105678327A (en) * 2016-01-05 2016-06-15 北京信息科技大学 Method for extracting non-taxonomy relations between entities for Chinese patents
CN106407182A (en) * 2016-09-19 2017-02-15 国网福建省电力有限公司 A method for automatic abstracting for electronic official documents of enterprises

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100250547A1 (en) * 2001-08-13 2010-09-30 Xerox Corporation System for Automatically Generating Queries
CN105678327A (en) * 2016-01-05 2016-06-15 北京信息科技大学 Method for extracting non-taxonomy relations between entities for Chinese patents
CN106407182A (en) * 2016-09-19 2017-02-15 国网福建省电力有限公司 A method for automatic abstracting for electronic official documents of enterprises

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEI LU等: "Joint semantic similarity assessment with raw corpus and structured ontology for semantic-oriented service discovery", 《PERSONAL AND UBIQUITOUS COMPUTING》 *
王晋 等: "基于领域本体的文本相似度算法", 《苏州大学学报(工科版)》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929022A (en) * 2018-09-18 2020-03-27 阿基米德(上海)传媒有限公司 Text abstract generation method and system
CN109657227A (en) * 2018-10-08 2019-04-19 平安科技(深圳)有限公司 Contract feasibility determination method, equipment, storage medium and device
CN109597878A (en) * 2018-11-13 2019-04-09 北京合享智慧科技有限公司 A kind of method and relevant apparatus of determining text similarity
CN109657213A (en) * 2018-12-21 2019-04-19 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN109657213B (en) * 2018-12-21 2023-07-28 北京金山安全软件有限公司 Text similarity detection method and device and electronic equipment
CN109918670A (en) * 2019-03-12 2019-06-21 重庆誉存大数据科技有限公司 A kind of article duplicate checking method and system
CN110516216A (en) * 2019-05-15 2019-11-29 北京信息科技大学 A kind of automatic writing template base construction method of sports news
CN110134792A (en) * 2019-05-22 2019-08-16 北京金山数字娱乐科技有限公司 Text recognition method, device, electronic equipment and storage medium
CN110134792B (en) * 2019-05-22 2022-03-08 北京金山数字娱乐科技有限公司 Text recognition method and device, electronic equipment and storage medium
CN110309263A (en) * 2019-06-06 2019-10-08 中国人民解放军军事科学院军事科学信息研究中心 A kind of semantic-based working attributes content of text judgement method for confliction detection and device
CN110532396A (en) * 2019-06-11 2019-12-03 福建奇点时空数字科技有限公司 A kind of entity similarity calculating method based on vector space model
CN110209822A (en) * 2019-06-11 2019-09-06 中译语通科技股份有限公司 Sphere of learning data dependence prediction technique based on deep learning, computer
CN110209822B (en) * 2019-06-11 2021-12-21 中译语通科技股份有限公司 Academic field data correlation prediction method based on deep learning and computer
CN110297918A (en) * 2019-06-25 2019-10-01 深圳市酷开网络科技有限公司 A kind of method, intelligent terminal and storage medium calculating movie and television contents degree of correlation
CN110457435A (en) * 2019-07-26 2019-11-15 南京邮电大学 A kind of patent novelty analysis system and its analysis method
CN112651221A (en) * 2019-10-10 2021-04-13 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN111027306A (en) * 2019-12-23 2020-04-17 园宝科技(武汉)有限公司 Intellectual property matching technology based on keyword extraction and word shifting distance
CN111814456A (en) * 2020-05-25 2020-10-23 国网上海市电力公司 Verb-based Chinese text similarity calculation method
CN111767724A (en) * 2020-06-11 2020-10-13 安徽旅贲科技有限公司 Text similarity calculation method and system
CN112380830A (en) * 2020-06-18 2021-02-19 达而观信息科技(上海)有限公司 Method, system and computer readable storage medium for matching related sentences in different documents
CN112380830B (en) * 2020-06-18 2024-05-17 达观数据有限公司 Matching method, system and computer readable storage medium for related sentences in different documents
CN111930946A (en) * 2020-08-18 2020-11-13 哈尔滨工程大学 Patent classification method based on similarity measurement
CN112163418A (en) * 2020-08-31 2021-01-01 深圳市修远文化创意有限公司 Text comparison method and related device
CN115563515A (en) * 2022-12-07 2023-01-03 粤港澳大湾区数字经济研究院(福田) Text similarity detection method, device and equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108549634A (en) A kind of Chinese patent text similarity calculating method
CN109344236B (en) Problem similarity calculation method based on multiple characteristics
CN107229610B (en) A kind of analysis method and device of affection data
US10867256B2 (en) Method and system to provide related data
CN107247780A (en) A kind of patent document method for measuring similarity of knowledge based body
CN110020189A (en) A kind of article recommended method based on Chinese Similarity measures
CN103324700B (en) Noumenon concept attribute learning method based on Web information
CN104834747A (en) Short text classification method based on convolution neutral network
CN109190117A (en) A kind of short text semantic similarity calculation method based on term vector
CN108874896B (en) Humor identification method based on neural network and humor characteristics
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN110134925A (en) A kind of Chinese patent text similarity calculating method
CN103646112A (en) Dependency parsing field self-adaption method based on web search
CN114997288B (en) Design resource association method
US12124802B2 (en) System and method for analyzing similarity of natural language data
CN109408802A (en) A kind of method, system and storage medium promoting sentence vector semanteme
CN114818717B (en) Chinese named entity recognition method and system integrating vocabulary and syntax information
JP4534666B2 (en) Text sentence search device and text sentence search program
CN116304748B (en) Text similarity calculation method, system, equipment and medium
CN116644148A (en) Keyword recognition method and device, electronic equipment and storage medium
Song et al. Improving embedding-based unsupervised keyphrase extraction by incorporating structural information
CN109086443A (en) Social media short text on-line talking method based on theme
CN114757184B (en) Method and system for realizing knowledge question and answer in aviation field
CN113420127B (en) Threat information processing method, threat information processing device, computing equipment and storage medium
Van Tu A deep learning model of multiple knowledge sources integration for community question answering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180918