CN108549634A - A kind of Chinese patent text similarity calculating method - Google Patents
A kind of Chinese patent text similarity calculating method Download PDFInfo
- Publication number
- CN108549634A CN108549634A CN201810310198.1A CN201810310198A CN108549634A CN 108549634 A CN108549634 A CN 108549634A CN 201810310198 A CN201810310198 A CN 201810310198A CN 108549634 A CN108549634 A CN 108549634A
- Authority
- CN
- China
- Prior art keywords
- sentence
- similarity
- word
- text
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of Chinese patent text similarity calculating methods, including:Text is segmented;TF IDF values are calculated to word segmentation result, extraction TF IDF values are higher to be used as keyword, and the sentence where positioning keyword obtains the critical sentence set of each text as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence;The weight to text for calculating each critical sentence chooses text to be compared and compares the critical sentence of text successively, and the sentence similarity based on critical sentence calculates the similarity of text.The present invention utilizes existing patent field ontology, analyze the semantic relation in patent text, the calculating of patent text similarity is carried out using vector space model and domain body, the accuracy and recall rate of result of calculation are higher, similarity degree between patent can be described more accurately, it can accelerate the speed of patent examination, the needs of practical application can be met well.
Description
Technical field
The invention belongs to text information processing technical fields, and in particular to a kind of Chinese patent text similarity calculation side
Method.
Background technology
Current Internet era, carrier of the patent as record mankind's achievement contain a large amount of scientific and technological achievement and innovation
Technology.The fast development of science and technology makes annual amount of the application for patent sharply increase.Traditional retrieval mode passes through term
Carry out matching return as a result, being usually correlation using the quantity that term occurs as patent, not in view of patent
The semantic information for itself being included.The essence of patent examination is the high related patents of unexamined patent similarity, among these, most heavy
What is wanted is exactly a little to calculate patent text similarity.Text similarity, general algorithmic method are using vector space model to text
This expression calculates vector similarity as text similarity directly in vector space later.In recent years, ontology, as one kind
The new representation of knowledge and description form is widely applied to the various aspects such as semantic net, information retrieval, more and more researchers
Start to pay attention to carrying out semantic analysis using ontology.
Text similarity method can be mainly divided into two classes:One is using vector space model translate text into
The form of amount, then calculated, one is the contacts indicated using semantic dictionary method between different long short texts, pass through key
Word number of matches reflects the similarity between text.The method of the similarity of the Chinese patent text of calculating of the prior art exists
The problem of semantic information is lost, and the prior art is inaccurate to the calculating of Chinese text similarity, the accuracy of result of calculation and
Recall rate is relatively low, cannot accurately reflect the similarity of patent text, cannot meet the needs of practical application.
Invention content
For the above-mentioned prior art the problem of, the purpose of the present invention is to provide the avoidable appearance of one kind is above-mentioned
The Chinese patent text similarity calculating method of technological deficiency.
In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:
A kind of Chinese patent text similarity calculating method, includes the steps that calculating sentence similarity.
Further, the computational methods include:
Text is segmented;TF-IDF values are calculated to word segmentation result, extraction TF-IDF values are higher to be used as keyword,
The sentence where keyword is positioned as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence,
Obtain the critical sentence set of each text;The weight to text for calculating each critical sentence chooses text to be compared and right successively
Than the critical sentence of text.
Further, Words similarity is converted to the similarity of concept in the body to calculate;Concept is in the body
Calculating formula of similarity is:
Wherein w1And w2Indicate two words, dis (w1, w2) indicate w1And w2Semantic distance in domain body.
Further, minimum public father node position and node local density, the similarity meter of concept in the body is added
Calculating formula is:
Wherein r indicates that the root node of tree, com indicate w1And w2The public father node of minimum, dis (r, com) indicates minimum
The depth of public father node, num (w1) indicate w1The brotgher of node number of node.
Further, it is based on word2vec and calculates Words similarity, the input layer of CBOW models is the front and back n of current word
A term vector is added up to obtain W by intermediate hidden layer to this 2n term vectorx;Output layer is a Huffman tree, is
By the word in corpus as leaf node, what the frequency of each word was built as weights;By stochastic gradient algorithm to WxInto
Row prediction so that p (w | context (w)) value maximizes, and context (w) refers to the n front and back word of w;Pass through word2vec pairs
Language material is trained, and obtains the term vector of all words;Calculate word between similarity translate into calculate word equivalent to
The similarity of amount, calculation formula are:
Wherein w1And w2The term vector that respectively two words obtain after training;x1iAnd x2iTwo words are indicated respectively
The value of the corresponding i-th dimension in vector space of term vector of language.
Further, two kinds of Words similarity sim are calculated separately out using ontology and word2vecow(w1, w2) and simrw
(w1, w2), in conjunction with obtaining Words similarity, formula is:
Wherein S indicates the concept set in ontology, if being not belonging to Ontological concept set there are one in two words,
Using the similarity obtained using word2vec as Words similarity this pronouns, general term for nouns, numerals and measure words is taken if the two belongs to Ontological concept set
Language similarity and word2vec Words similarities average value are as final Words similarity.
Further, sentence similarity computational methods are as follows:
Assuming that being respectively S there are two sentence1=(w11, w12..., w1n) and S2=(w21, w22..., w2m), wherein w11,
w12, w21, w22For the notional word that sentence obtains after segmenting and removing stop words, (w is defined1i, w2j) it is sentence S1And S2One of word
Mapping, if for arbitrary k, l, simw(w1i, w2j) > simw(w1k, w2l) perseverance establishment, it is judged that in two sentences
w1iAnd w2jIt is semantic relation word pair the most similar, obtains one group of semantic relation word the most similar to rear, respectively from two
The word is removed in sentence, and is recalculated, and until word is not present in one of sentence vocabulary, calculation formula is:
simws(S1, S2) indicate S2Relative to S1The entity Word similarity of gained.
Further, the relationship similarity calculated based on the non-categorical relationship of patent field ontology in sentence, step are utilized
It is rapid as follows:
Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, is only retained dynamic
The word of word part of speech, noun part-of-speech obtains the orderly vocabulary of two sentences, and the orderly vocabulary for defining first sentence is S1
(w11:pos11, w12:pos12..., w1n:pos1n), the orderly vocabulary of another sentence is defined as S2(w21:pos21, w22:
pos22..., w2m:pos2m);To the vocabulary of each sentence, chooses verb therein and the noun before and after it constitutes SAO
Structure phrase P (n1, v, n2);Convert the orderly vocabulary of each sentence to phrase set S1=(P11, P12..., P1n) and S2
=(P21, P22..., P2m), obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship is NR (r1, r2...,
rl), rlThere is the phrase of SAO structures for one in non-categorical set;It is integrated into non-categorical by calculating two sentence phrases
The number occurred in set of relationship NR calculates the non-categorical relationship similarity of sentence, and calculation formula is:
Wherein, num (S1) indicate S1Phrase in set belongs to the number of non-categorical set of relationship NR, com (S1, S2) table
Show sentence phrase set S1And S2Intersection, indicate S1And S2Shared phrase set.
The calculation formula of overall similarity is between sentence
sims(S1, S2)=β simws(S1, S2)+(1-β)simps(S1, S2),
Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sims(S1, S2) indicate S2Relative to
S1Sentence similarity.
Further, text similarity, step are calculated on the basis of existing Words similarity and sentence similarity
For:
Text is segmented first, TF-IDF values are calculated to word segmentation result later, extracts the higher conduct of TF-IDF values
Keyword, the sentence where positioning keyword is as critical sentence;Calculate the weight to text of each critical sentence;It chooses crucial
The maximum keyword of TF-IDF values in the keyword set that sentence is included, and using the weights of the word as the weight w of critical sentence
(S), the sentence set of two texts is finally obtained, is enabled
D1(S11:w(S11), S12:w(S12) ..., S1n:w(S1n)) indicate text D1Sentence set, D2(S21:w
(S21), S22:w(S22) ..., S2m:w(S2m)) indicate D2Sentence set;
Define (S1i, S2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sims(S1i, S2j)≥
sims(S1l, S2k) perseverance establishment, then it is assumed that S1iAnd S2jIt is relationship sentence the most close, wherein sim in two textss(S1i,
S2j) be calculated by sentence similarity.
Further, two text D1And D2Calculating formula of similarity be:
Wherein, sims(S1i, S1j) represent sentence phrase set S1iWith sentence phrase set S1jBetween overall similarity,
w(S1i) represent sentence phrase set S1iCritical sentence weight, w (S1j) represent sentence phrase set S1jCritical sentence power
Weight.
Chinese patent text similarity calculating method provided by the invention, it is proposed that a kind of layered computation text is similar
The calculating of text similarity is divided into three word, sentence, text levels, is calculated from bottom to top, the party by the method for degree
Method calculates text similarity using sentence as granularity, is combined using existing domain body and word2vec and calculates word phase
Like degree, and the relationship similarity obtained according to non-categorical relationship is added when calculating sentence similarity, finally according to different sentences
The weight of son calculates text similarity;The present invention utilizes existing patent field ontology, analyzes the semantic pass in patent text
System, the calculating of patent text similarity is carried out using vector space model and domain body, and result of calculation is accurate, calculates knot
The accuracy and recall rate of fruit are higher, can more accurately describe the similarity degree between patent, can accelerate patent examination
Speed, while also more efficiently patent resource can be analyzed for users, it can meet well and actually answer
Needs.
Description of the drawings
Fig. 1 is CBOW illustratons of model;
Fig. 2 is Skip-gram illustratons of model.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and specific implementation
The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work
The every other embodiment obtained is put, shall fall within the protection scope of the present invention.
Words similarity refers to a kind of measurement of Semantic Similarity between word.Word is in domain body with concept
Mode shows, and the similarity calculation of word can be converted into the similarity calculation of concept in the body.Using existing
Domain body, the problem of can not calculating similarity in order to avoid the word not included in domain body, introduce word2vec and this
Body is combined the similarity for calculating word.
(1) it is based on ontology and calculates Words similarity
The concept for including in domain body constitutes a tree-shaped hierarchical structure by the classification relation of upper bottom, generally
Similarity between thought can be obtained by calculating semantic distance of the concept in ontology tree.The minimum for finding two concepts is public
Father node, calculate its minimum public father node to the sum of the distance between two concepts as concept semanteme in the body away from
From.
Calculating formula of similarity is:
Wherein w1And w2Indicate two words, dis (w1, w2) indicate w1And w2Semantic distance in domain body.
Similarity between word, not only semantic distance is related between concept, also with its minimum public father node in field
Brotgher of node number is related around position and concept in the tree of ontological construction.Semantic distance is identical between concept,
Its level of minimum public father node in tree is deeper, and the similarity of word is bigger.Similarly, word corresponding concept week
The brotgher of node enclosed is more, and local density is bigger, then illustrating that the concept node degree of refinement is bigger, the similarity of word is just
It is bigger.Calculation formula is as follows after minimum public father node position and node local density is added:
Wherein r indicates that the root node of tree, com indicate w1And w2The public father node of minimum, dis (r, com) indicates minimum
The depth of public father node, num (w1) indicate w1The brotgher of node number of node.
(2) it is based on word2vec and calculates Words similarity
Word2vec is a Open-Source Tools that can convert word to real number value vector of Google, using deeply
The thought of study is spent, word can be reduced to a vector in K dimensional vector spaces, on such phrase semantic by training
Similarity can be converted into the operation of term vector.
Word2vec use the word of the Distributed representation that Hinton proposed in 1986 to
Representation is measured, basic thought is the vector space that word is mapped to K dimensions using training pattern, each word
It can be indicated by the vector of a K dimension.K is hyper parameter, needs to specify in advance.After converting word to vector expression, so that it may with
The semantic similarity between word is indicated using the distance calculated between vector.Word2vec has CBOW models (Continuous
Bag-of-Words Model) and Skip-gram models, two kinds of model structures it is as depicted in figs. 1 and 2.Wherein CBOW models profit
It is then using current to go prediction current word, Skip-gram models with the 2n word (n is 2 in Fig. 1) in the context of current word
The 2n word of word prediction thereon hereinafter (n is 2 in Fig. 2).
The input layer of CBOW models is the front and back n term vector of current word, by intermediate hidden layer to this 2n term vector
It is added up to obtain Wx.Output layer is a Huffman tree, is by the word in corpus as leaf node, the frequency of each word
It is built as weights.By stochastic gradient algorithm to WxIt is predicted so that p (w | context (w)) value maximization,
Context (w) refers to the n front and back word of w.When the training is completed, the term vector w of all words can be obtained.Skip-
The training process of gram models is similar with CBOW model training processes.
Language material is trained by word2vec, obtains the term vector of all words.Similarity between calculating word is just
It is converted into and calculates the similarity that word corresponds to term vector.Cosine phase may be used in calculating generally for space vector similarity
Like degree algorithm, specific algorithm is as follows:
Wherein w1And w2The term vector that respectively two words obtain after training.x1iAnd x2iTwo words are indicated respectively
The value of the corresponding i-th dimension in vector space of term vector of language.
(3) Words similarity
Two kinds of Words similarity sim are calculated separately out using ontology and word2vecow(w1, w2) and simrw(w1, w2), knot
Conjunction obtains Words similarity, and formula is as follows:
Wherein S indicates the concept set in ontology, if being not belonging to Ontological concept set there are one in two words,
Using the similarity obtained using word2vec as Words similarity this pronouns, general term for nouns, numerals and measure words is taken if the two belongs to Ontological concept set
Language similarity and word2vec Words similarities average value are as final Words similarity.
Sentence similarity calculates, similar by calculating the word between notional word generally based on notional word significant in text
Degree obtains sentence similarity.It is closed in the present invention with the Words similarity of formula (4) description and the non-categorical of patent field ontology
Sentence similarity is calculated based on system.
Assuming that being respectively S there are two sentence1=(w11, w12..., w1n) and S2=(w21, w22..., w2m).Wherein w11,
w12, w21, w22The notional word obtained after segmenting and removing stop words for sentence.Define (w1i, w2j) it is sentence S1And S2One of word
Mapping, if for arbitrary k, l, simw(w1i, w2j) > simw(w1k, w2l) perseverance establishment, simw(w1i, w2j) counted by formula (4)
It obtains.It is judged that the w in two sentences1iAnd w2jIt is semantic relation word pair the most similar, obtains one group of semanteme and close
System's word the most similar removes the word, and recalculate, until one of sentence vocabulary from two sentences respectively to rear
In word is not present, calculation formula is as follows:
Since the length of sentence is different, the shared close word of two sentences is not for the similarity of each sentence to group
With, the present invention is with simws(S1, S2) indicate S2Relative to S1The entity Word similarity of gained.
Above formula has only focused on the similarity of entity word the most similar in two sentences, but there is no consider language in sentence
The similarity of phrase similar in justice.Based on the non-categorical relationship of patent field ontology, the relationship calculated in sentence is similar
Degree.
Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, is only retained dynamic
Word part of speech, the word of noun part-of-speech.The orderly vocabulary of two sentences is obtained, the orderly vocabulary for defining first sentence is S1
(w11:pos11, w12:pos12..., w1n:pos1n), similarly, the orderly vocabulary of another sentence is defined as S2(w21:
pos21, w22:pos22..., w2m:pos2m).To the vocabulary of each sentence, verb therein and the name before and after it are chosen
Word constitutes SAO structure phrase P (n1, v, n2).Convert the orderly vocabulary of each sentence to phrase set S1=(P11,
P12..., P1n) and S2=(P21, P22..., P2m).Obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship
For NR (r1, r2..., rl), rlThere is the phrase of SAO structures for one in non-categorical set.By calculating two sentences
Phrase is integrated into the non-categorical relationship similarity that the number occurred in non-categorical set of relationship NR calculates sentence.Its calculation formula
It is as follows:
Wherein, num (S1) indicate S1Phrase in set belongs to the number of non-categorical set of relationship NR, com (S1, S2) table
Show sentence phrase set S1And S2Intersection, indicate S1And S2Shared phrase set.
The non-categorical relationship similarity that the entity Word similarity and formula (6) obtained according to formula (5) obtains, obtains sentence
Overall similarity between son, shown in computational methods such as formula (7).
sims(S1, S2)=β simws(S1, S2)+(1-β)simps(S1, S2) (7),
Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sims(S1, S2) indicate S2Relative to
S1Sentence similarity.
Text similarity is calculated on the basis of existing Words similarity and sentence similarity.For text, and
Not every word and sentence are all related to the main meaning of text representation, the crucial sentence where keyword and keyword
The meaning of text can be more expressed, others primarily serve the function of connection.During calculating text similarity, pass through
Calculate that the similarity of keyword and critical sentence obtains as a result, having extremely close connection with the meaning that two documents itself are expressed
System, can more accurately indicate the similitude between text.
When choosing the keyword in text, the significance level for calculating each word pair and text in text is needed, is generally adopted
With word frequency-inverse document frequency (Term frequency-Inverse document frequency), i.e. the calculating sides TF-IDF
Method, this method calculate by a relatively simple and have higher accuracy and recall rate, be widely used in calculating weight.
The frequency that some word occurs in one text is higher, and the frequency which occurs in other texts is lower, then the word is got over
The theme in this text can be represented, it is higher to the significance level of this text.
When calculating text similarity, text is segmented first, TF-IDF values are calculated to word segmentation result later, is extracted
TF-IDF values are higher to be used as keyword, and the sentence where positioning keyword is as critical sentence.Different critical sentences includes difference
Keyword, due to the difference of the quantity comprising keyword and weight itself, each critical sentence to the significance level of text not yet
Equally, need to calculate the weight to text of each critical sentence.Each critical sentence depends on the significance level of text
It includes keyword, choose the maximum keyword of TF-IDF values in the keyword set that critical sentence is included, and by the word
Weight w (S) of the weights as critical sentence.The sentence set of two texts is finally obtained, is enabled
D1(S11:w(S11), S12:w(S12) ..., S1n:w(S1n)) indicate text D1Sentence set, D2(S21:w(S21),
S22:w(S22) ..., S2m:w(S2m)) indicate D2Sentence set.
Define (S1i, S2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sims(S1i, S2j)≥
sims(S1l, S2k) perseverance establishment, then it is assumed that S1iAnd S2jIt is relationship sentence the most close, wherein sim in two textss(S1i,
S2j) be calculated by sentence similarity.
It is similar to sentence similarity calculating, since the sentence number of two texts is different, calculate similarity in two texts
Most similar sentence is different different text obtained similarities set, with sim (D1, D2) indicate with D1For
Benchmark, D2Relative to D1Text similarity.
The calculating formula of similarity of two texts is as follows:
Above formula is directed not only to critical sentence all in two texts, it is also contemplated that one for entire document of critical sentence
Significance level can reflect the similarity between text well.
The present embodiment uses existing new-energy automobile domain body, and the language material of use is new-energy automobile field Chinese
50 patent texts under patent same category.
It chooses wherein patent and is used as text N to be compared, other all patents text D as a comparison utilizes the present invention
Chinese patent text similarity calculating method, calculate Chinese patent text similarity, be as follows:
1) subordinate sentence processing is carried out to all patent texts;
2) by Hanlp tools, new energy field automotive field dictionary is added, the result after subordinate sentence is segmented, and
The part of speech for retaining each word removes stop words using deactivated vocabulary;
3) the TF-IDF values of all patent texts are calculated and by its descending arrangement, the word conduct that the row of taking is first 20
Keyword, and according to keyword, marks the critical sentence where keyword, and using the maximum weights of keyword in critical sentence as
The weights of critical sentence obtain the critical sentence set D (S of each text1:w(S1), S2:w(S2) ..., Sn:w(Sn));
4) text N to be compared is chosen successively and compares the critical sentence of text D, and according to sentence similarity algorithm and word
Similarity algorithm computational entity Word similarity and relationship similarity;
5) similarity of two texts is calculated using the sentence similarity of existing two texts N and D.
Accuracy (P), recall rate (R) and F value conducts are generally used for the performance evaluation of Text similarity computing method
Performance Evaluating Indexes.It defines T (t) and indicates that mark value is the quantity of t, C (t) indicates the value of the text similarity calculated in t institutes table
The quantity for the range shown, TC (t) indicate that mark value is t and calculated value indicates the quantity in range in t.Specific evaluation index
It is defined as follows:
Accuracy P:
Recall rate R:
F values:
Accuracy, recall rate and the F values for the Chinese patent text similarity calculation result that method through the invention obtains
It is all very high, far above accuracy, recall rate and the F values of the result of calculation of the prior art.
Chinese patent text similarity calculating method provided by the invention, it is proposed that a kind of layered computation text is similar
The calculating of text similarity is divided into three word, sentence, text levels, is calculated from bottom to top, the party by the method for degree
Method calculates text similarity using sentence as granularity, is combined using existing domain body and word2vec and calculates word phase
Like degree, and the relationship similarity obtained according to non-categorical relationship is added when calculating sentence similarity, finally according to different sentences
The weight of son calculates text similarity;The present invention utilizes existing patent field ontology, analyzes the semantic pass in patent text
System, the calculating of patent text similarity is carried out using vector space model and domain body, and result of calculation is accurate, calculates knot
The accuracy and recall rate of fruit are higher so that the description of correlation is more accurate between text, can more accurately describe patent it
Between similarity degree, the speed of patent examination can be accelerated, while also can more efficiently be provided for users to patent
Source is analyzed, and can meet the needs of practical application well.
Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not
Therefore it is interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention
Protect range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (10)
1. a kind of Chinese patent text similarity calculating method, which is characterized in that include the steps that calculating sentence similarity.
2. Chinese patent text similarity calculating method according to claim 1, which is characterized in that the computational methods packet
It includes:
Text is segmented;TF-IDF values are calculated to word segmentation result, it is higher as keyword, positioning pass to extract TF-IDF values
Sentence where keyword obtains every as critical sentence, and using the maximum weights of keyword in critical sentence as the weights of critical sentence
The critical sentence set of a text;The weight to text for calculating each critical sentence chooses text to be compared and comparison text successively
Critical sentence.
3. the Chinese patent text similarity calculating method according to claim 1-2, which is characterized in that by Words similarity
The similarity of concept in the body is converted into calculate;The calculating formula of similarity of concept in the body is:
Wherein w1And w2Indicate two words, dis (w1, w2) indicate w1And w2Semantic distance in domain body.
4. the Chinese patent text similarity calculating method according to claim 1-3, which is characterized in that be added minimum public
Father node position and node local density, the calculating formula of similarity of concept in the body are:
Wherein r indicates that the root node of tree, com indicate w1And w2The public father node of minimum, dis (r, com) indicates minimum public father
The depth of node, num (w1) indicate w1The brotgher of node number of node.
5. the Chinese patent text similarity calculating method according to claim 1-4, which is characterized in that be based on word2vec
Words similarity is calculated, the input layer of CBOW models is the front and back n term vector of current word, by intermediate hidden layer to this 2n
Term vector is added up to obtain Wx;Output layer is a Huffman tree, is by the word in corpus as leaf node, each word
The frequency build as weights;By stochastic gradient algorithm to WxIt is predicted so that and p (w | context (w)) value maximum
Change, context (w) refers to the n front and back word of w;Language material is trained by word2vec, obtains the term vector of all words;
It calculates the similarity between word and translates into the similarity that calculating word corresponds to term vector, calculation formula is:
Wherein w1And w2The term vector that respectively two words obtain after training;x1iAnd x2iThe word of two words is indicated respectively
The value of the corresponding i-th dimension in vector space of vector.
6. the Chinese patent text similarity calculating method according to claim 1-5, which is characterized in that using ontology and
Word2vec calculates separately out two kinds of Words similarity simow(w1, w2) and simrw(w1, w2), it is public in conjunction with obtaining Words similarity
Formula is:
Wherein S indicates the concept set in ontology, if will be utilized in two words there are one Ontological concept set is not belonging to
The similarity that word2vec is obtained is as Words similarity, if the two belongs to Ontological concept set, takes ontology word similar
Degree and word2vec Words similarities average value are as final Words similarity.
7. the Chinese patent text similarity calculating method according to claim 1-6, which is characterized in that sentence similarity meter
Calculation method is as follows:
Assuming that being respectively S there are two sentence1=(w11, w12..., w1n) and S2=(w21, w22..., w2m), wherein w11, w12,
w21, w22For the notional word that sentence obtains after segmenting and removing stop words, (w is defined1i, w2j) it is sentence S1And S2One of word is reflected
It penetrates, if for arbitrary k, l, simw(w1i, w2j) > simw(w1k, w2l) perseverance establishment, it is judged that the w in two sentences1iWith
w2jIt is semantic relation word pair the most similar, obtains one group of semantic relation word the most similar to rear, respectively from two sentences
The word is removed, and is recalculated, until word is not present in one of sentence vocabulary, calculation formula is:
simws(S1, S2) indicate S2Relative to S1The entity Word similarity of gained.
8. the Chinese patent text similarity calculating method according to claim 1-7, which is characterized in that utilize patent field
The relationship similarity in sentence is calculated based on the non-categorical relationship of ontology, steps are as follows:
Part-of-speech tagging is carried out to two sentences, stop words is removed using deactivated vocabulary, removes other parts of speech, only retain verb word
Property, noun part-of-speech word, obtain the orderly vocabulary of two sentences, the orderly vocabulary for defining first sentence is S1(w11:
pos11, w12:pos12..., w1n:pos1n), the orderly vocabulary of another sentence is defined as S2(w21:pos21, w22:
pos22..., w2m:pos2m);To the vocabulary of each sentence, chooses verb therein and the noun before and after it constitutes SAO knots
Word-building group P (n1, v, n2);Convert the orderly vocabulary of each sentence to phrase set S1=(P11, P12..., P1n) and S2=
(P21, P22..., P2m), obtain non-categorical set of relationship, it is assumed that existing non-categorical set of relationship is NR (r1, r2..., rl),
rlThere is the phrase of SAO structures for one in non-categorical set;It is integrated into non-categorical relationship by calculating two sentence phrases
The number occurred in set NR calculates the non-categorical relationship similarity of sentence, and calculation formula is:
Wherein, num (S1) indicate S1Phrase in set belongs to the number of non-categorical set of relationship NR, com (S1, S2) indicate sentence
Phrase set S1And S2Intersection, indicate S1And S2Shared phrase set.
The calculation formula of overall similarity is between sentence
sims(S1, S2)=β simws(S1, S2)+(1-β)simps(S1, S2),
Wherein β presentation-entity Word similarity proportion shared in sentence similarity, sims(S1, S2) indicate S2Relative to S1Sentence
Sub- similarity.
9. the Chinese patent text similarity calculating method according to claim 1-8, which is characterized in that in existing word
Text similarity is calculated on the basis of similarity and sentence similarity, step is:
Text is segmented first, TF-IDF values are calculated to word segmentation result later, it is higher as crucial to extract TF-IDF values
Word, the sentence where positioning keyword is as critical sentence;Calculate the weight to text of each critical sentence;Critical sentence is chosen to be wrapped
The maximum keyword of TF-IDF values in the keyword set contained, and using the weights of the word as the weight w (S) of critical sentence, finally
The sentence set of two texts is obtained, is enabled
D1(S11:w(S11), S12:w(S12) ..., S1n:w(S1n)) indicate text D1Sentence set, D2(S21:w(S21), S22:w
(S22) ..., S2m:w(S2m)) indicate D2Sentence set;
Define (S1i, S2j) corresponded to for one group of sentence in two texts, if for arbitrary l, k, sims(S1i, S2j)≥sims
(S1l, S2k) perseverance establishment, then it is assumed that S1iAnd S2jIt is relationship sentence the most close, wherein sim in two textss(S1i, S2j) logical
Sentence similarity is crossed to be calculated.
10. the Chinese patent text similarity calculating method according to claim 1-9, which is characterized in that two text D1With
D2Calculating formula of similarity be:
Wherein, sims(S1i, S1j) represent sentence phrase set S1iWith sentence phrase set S1jBetween overall similarity, w (S1i)
Represent sentence phrase set S1iCritical sentence weight, w (S1j) represent sentence phrase set S1jCritical sentence weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810310198.1A CN108549634A (en) | 2018-04-09 | 2018-04-09 | A kind of Chinese patent text similarity calculating method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810310198.1A CN108549634A (en) | 2018-04-09 | 2018-04-09 | A kind of Chinese patent text similarity calculating method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108549634A true CN108549634A (en) | 2018-09-18 |
Family
ID=63514291
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810310198.1A Pending CN108549634A (en) | 2018-04-09 | 2018-04-09 | A kind of Chinese patent text similarity calculating method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108549634A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109597878A (en) * | 2018-11-13 | 2019-04-09 | 北京合享智慧科技有限公司 | A kind of method and relevant apparatus of determining text similarity |
CN109657227A (en) * | 2018-10-08 | 2019-04-19 | 平安科技(深圳)有限公司 | Contract feasibility determination method, equipment, storage medium and device |
CN109657213A (en) * | 2018-12-21 | 2019-04-19 | 北京金山安全软件有限公司 | Text similarity detection method and device and electronic equipment |
CN109918670A (en) * | 2019-03-12 | 2019-06-21 | 重庆誉存大数据科技有限公司 | A kind of article duplicate checking method and system |
CN110134792A (en) * | 2019-05-22 | 2019-08-16 | 北京金山数字娱乐科技有限公司 | Text recognition method, device, electronic equipment and storage medium |
CN110209822A (en) * | 2019-06-11 | 2019-09-06 | 中译语通科技股份有限公司 | Sphere of learning data dependence prediction technique based on deep learning, computer |
CN110297918A (en) * | 2019-06-25 | 2019-10-01 | 深圳市酷开网络科技有限公司 | A kind of method, intelligent terminal and storage medium calculating movie and television contents degree of correlation |
CN110309263A (en) * | 2019-06-06 | 2019-10-08 | 中国人民解放军军事科学院军事科学信息研究中心 | A kind of semantic-based working attributes content of text judgement method for confliction detection and device |
CN110457435A (en) * | 2019-07-26 | 2019-11-15 | 南京邮电大学 | A kind of patent novelty analysis system and its analysis method |
CN110516216A (en) * | 2019-05-15 | 2019-11-29 | 北京信息科技大学 | A kind of automatic writing template base construction method of sports news |
CN110532396A (en) * | 2019-06-11 | 2019-12-03 | 福建奇点时空数字科技有限公司 | A kind of entity similarity calculating method based on vector space model |
CN110929022A (en) * | 2018-09-18 | 2020-03-27 | 阿基米德(上海)传媒有限公司 | Text abstract generation method and system |
CN111027306A (en) * | 2019-12-23 | 2020-04-17 | 园宝科技(武汉)有限公司 | Intellectual property matching technology based on keyword extraction and word shifting distance |
CN111767724A (en) * | 2020-06-11 | 2020-10-13 | 安徽旅贲科技有限公司 | Text similarity calculation method and system |
CN111814456A (en) * | 2020-05-25 | 2020-10-23 | 国网上海市电力公司 | Verb-based Chinese text similarity calculation method |
CN111930946A (en) * | 2020-08-18 | 2020-11-13 | 哈尔滨工程大学 | Patent classification method based on similarity measurement |
CN112163418A (en) * | 2020-08-31 | 2021-01-01 | 深圳市修远文化创意有限公司 | Text comparison method and related device |
CN112380830A (en) * | 2020-06-18 | 2021-02-19 | 达而观信息科技(上海)有限公司 | Method, system and computer readable storage medium for matching related sentences in different documents |
CN112651221A (en) * | 2019-10-10 | 2021-04-13 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN115563515A (en) * | 2022-12-07 | 2023-01-03 | 粤港澳大湾区数字经济研究院(福田) | Text similarity detection method, device and equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100250547A1 (en) * | 2001-08-13 | 2010-09-30 | Xerox Corporation | System for Automatically Generating Queries |
CN105678327A (en) * | 2016-01-05 | 2016-06-15 | 北京信息科技大学 | Method for extracting non-taxonomy relations between entities for Chinese patents |
CN106407182A (en) * | 2016-09-19 | 2017-02-15 | 国网福建省电力有限公司 | A method for automatic abstracting for electronic official documents of enterprises |
-
2018
- 2018-04-09 CN CN201810310198.1A patent/CN108549634A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100250547A1 (en) * | 2001-08-13 | 2010-09-30 | Xerox Corporation | System for Automatically Generating Queries |
CN105678327A (en) * | 2016-01-05 | 2016-06-15 | 北京信息科技大学 | Method for extracting non-taxonomy relations between entities for Chinese patents |
CN106407182A (en) * | 2016-09-19 | 2017-02-15 | 国网福建省电力有限公司 | A method for automatic abstracting for electronic official documents of enterprises |
Non-Patent Citations (2)
Title |
---|
WEI LU等: "Joint semantic similarity assessment with raw corpus and structured ontology for semantic-oriented service discovery", 《PERSONAL AND UBIQUITOUS COMPUTING》 * |
王晋 等: "基于领域本体的文本相似度算法", 《苏州大学学报(工科版)》 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110929022A (en) * | 2018-09-18 | 2020-03-27 | 阿基米德(上海)传媒有限公司 | Text abstract generation method and system |
CN109657227A (en) * | 2018-10-08 | 2019-04-19 | 平安科技(深圳)有限公司 | Contract feasibility determination method, equipment, storage medium and device |
CN109597878A (en) * | 2018-11-13 | 2019-04-09 | 北京合享智慧科技有限公司 | A kind of method and relevant apparatus of determining text similarity |
CN109657213A (en) * | 2018-12-21 | 2019-04-19 | 北京金山安全软件有限公司 | Text similarity detection method and device and electronic equipment |
CN109657213B (en) * | 2018-12-21 | 2023-07-28 | 北京金山安全软件有限公司 | Text similarity detection method and device and electronic equipment |
CN109918670A (en) * | 2019-03-12 | 2019-06-21 | 重庆誉存大数据科技有限公司 | A kind of article duplicate checking method and system |
CN110516216A (en) * | 2019-05-15 | 2019-11-29 | 北京信息科技大学 | A kind of automatic writing template base construction method of sports news |
CN110134792A (en) * | 2019-05-22 | 2019-08-16 | 北京金山数字娱乐科技有限公司 | Text recognition method, device, electronic equipment and storage medium |
CN110134792B (en) * | 2019-05-22 | 2022-03-08 | 北京金山数字娱乐科技有限公司 | Text recognition method and device, electronic equipment and storage medium |
CN110309263A (en) * | 2019-06-06 | 2019-10-08 | 中国人民解放军军事科学院军事科学信息研究中心 | A kind of semantic-based working attributes content of text judgement method for confliction detection and device |
CN110532396A (en) * | 2019-06-11 | 2019-12-03 | 福建奇点时空数字科技有限公司 | A kind of entity similarity calculating method based on vector space model |
CN110209822A (en) * | 2019-06-11 | 2019-09-06 | 中译语通科技股份有限公司 | Sphere of learning data dependence prediction technique based on deep learning, computer |
CN110209822B (en) * | 2019-06-11 | 2021-12-21 | 中译语通科技股份有限公司 | Academic field data correlation prediction method based on deep learning and computer |
CN110297918A (en) * | 2019-06-25 | 2019-10-01 | 深圳市酷开网络科技有限公司 | A kind of method, intelligent terminal and storage medium calculating movie and television contents degree of correlation |
CN110457435A (en) * | 2019-07-26 | 2019-11-15 | 南京邮电大学 | A kind of patent novelty analysis system and its analysis method |
CN112651221A (en) * | 2019-10-10 | 2021-04-13 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN111027306A (en) * | 2019-12-23 | 2020-04-17 | 园宝科技(武汉)有限公司 | Intellectual property matching technology based on keyword extraction and word shifting distance |
CN111814456A (en) * | 2020-05-25 | 2020-10-23 | 国网上海市电力公司 | Verb-based Chinese text similarity calculation method |
CN111767724A (en) * | 2020-06-11 | 2020-10-13 | 安徽旅贲科技有限公司 | Text similarity calculation method and system |
CN112380830A (en) * | 2020-06-18 | 2021-02-19 | 达而观信息科技(上海)有限公司 | Method, system and computer readable storage medium for matching related sentences in different documents |
CN112380830B (en) * | 2020-06-18 | 2024-05-17 | 达观数据有限公司 | Matching method, system and computer readable storage medium for related sentences in different documents |
CN111930946A (en) * | 2020-08-18 | 2020-11-13 | 哈尔滨工程大学 | Patent classification method based on similarity measurement |
CN112163418A (en) * | 2020-08-31 | 2021-01-01 | 深圳市修远文化创意有限公司 | Text comparison method and related device |
CN115563515A (en) * | 2022-12-07 | 2023-01-03 | 粤港澳大湾区数字经济研究院(福田) | Text similarity detection method, device and equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108549634A (en) | A kind of Chinese patent text similarity calculating method | |
CN109344236B (en) | Problem similarity calculation method based on multiple characteristics | |
CN107229610B (en) | A kind of analysis method and device of affection data | |
US10867256B2 (en) | Method and system to provide related data | |
CN107247780A (en) | A kind of patent document method for measuring similarity of knowledge based body | |
CN110020189A (en) | A kind of article recommended method based on Chinese Similarity measures | |
CN103324700B (en) | Noumenon concept attribute learning method based on Web information | |
CN104834747A (en) | Short text classification method based on convolution neutral network | |
CN109190117A (en) | A kind of short text semantic similarity calculation method based on term vector | |
CN108874896B (en) | Humor identification method based on neural network and humor characteristics | |
CN110704621A (en) | Text processing method and device, storage medium and electronic equipment | |
CN110134925A (en) | A kind of Chinese patent text similarity calculating method | |
CN103646112A (en) | Dependency parsing field self-adaption method based on web search | |
CN114997288B (en) | Design resource association method | |
US12124802B2 (en) | System and method for analyzing similarity of natural language data | |
CN109408802A (en) | A kind of method, system and storage medium promoting sentence vector semanteme | |
CN114818717B (en) | Chinese named entity recognition method and system integrating vocabulary and syntax information | |
JP4534666B2 (en) | Text sentence search device and text sentence search program | |
CN116304748B (en) | Text similarity calculation method, system, equipment and medium | |
CN116644148A (en) | Keyword recognition method and device, electronic equipment and storage medium | |
Song et al. | Improving embedding-based unsupervised keyphrase extraction by incorporating structural information | |
CN109086443A (en) | Social media short text on-line talking method based on theme | |
CN114757184B (en) | Method and system for realizing knowledge question and answer in aviation field | |
CN113420127B (en) | Threat information processing method, threat information processing device, computing equipment and storage medium | |
Van Tu | A deep learning model of multiple knowledge sources integration for community question answering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180918 |