CN108920456A - A kind of keyword Automatic method - Google Patents
A kind of keyword Automatic method Download PDFInfo
- Publication number
- CN108920456A CN108920456A CN201810611476.7A CN201810611476A CN108920456A CN 108920456 A CN108920456 A CN 108920456A CN 201810611476 A CN201810611476 A CN 201810611476A CN 108920456 A CN108920456 A CN 108920456A
- Authority
- CN
- China
- Prior art keywords
- keywords
- candidate
- word
- technical standard
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000000605 extraction Methods 0.000 claims abstract description 35
- 238000001914 filtration Methods 0.000 claims abstract description 11
- 238000004364 calculation method Methods 0.000 claims description 23
- 238000004458 analytical method Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000007476 Maximum Likelihood Methods 0.000 claims description 3
- 230000009191 jumping Effects 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 2
- 238000005259 measurement Methods 0.000 claims description 2
- 238000012549 training Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 239000000470 constituent Substances 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000009849 deactivation Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- GVVPGTZRZFNKDS-JXMROGBWSA-N geranyl diphosphate Chemical compound CC(C)=CCC\C(C)=C\CO[P@](O)(=O)OP(O)(O)=O GVVPGTZRZFNKDS-JXMROGBWSA-N 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of keyword Automatic methods, including:General term in extraction technique standard, extract candidate keywords, after filtering general term for candidate keywords, integrated location feature, Term co-occurrence feature and context semantic feature calculate candidate keywords weighted score, according to candidate keywords weighted score range computation dynamic threshold, dynamic threshold definitive result keyword is utilized.Keyword Automatic method provided by the invention, merge position feature, Term co-occurrence feature and context semantic feature extracting keywords, inside documents position and context semantic feature are comprehensively considered to the weights influence of keyword, higher accuracy and recall rate are reached, improve 3GPP technical standard retrieval quality, cost of labor is reduced, the needs of practical application can be met well.
Description
Technical Field
The invention belongs to the technical field of automatic extraction of keywords, and particularly relates to an automatic extraction method of keywords facing to a 3GPP technical standard.
Background
The explosive development of mobile communication technology brings epoch-making changes to the human society. As a standard maker of The leading technology in The field of communications, The 3rd Generation Partnership Project (3 GPP) is dedicated to generalizing 3G standards based on evolved global system for mobile communications (GSM) core networks including WCDMA, TD-scdma, EDGE, etc.
In recent years, there are many cases of patent infringement litigation disputes among large-scale communication technology companies, and the stability of patent rights of the invention is challenged unprecedentedly. The 3GPP technical standards play an irreplaceable important role in the work of examining communication patents.
The 3GPP technical standard is a scientific non-patent document specific to the patent examination work in the communication field, and is usually used as a comparison document to measure the creativity and novelty of the patent application in the communication field.
The typical 3GPP technical standard Cover mainly includes a standard number, a release number, a document title and version number information, a form part explains the version number, a Scope part declares the application range, a Reference part gives a Reference list, Definitions and abbrevatients parts list important Definitions and abbreviations of documents, a Topic body part specifically introduces the technical background and details, and Annex mainly relates to the version change history.
In addition, there is a correlation between the 3GPP technical standard and the patent literature, and the difference between the 3GPP technical standard and the patent literature is shown in table 1.
Table 1 patent document differs from 3GPP technical standard
As can be seen from table 1, the 3GPP technical standard has its own unique organization and type. Of major interest in the actual patent examination are Technical Specifications (TS), Technical Reports (TR), and conference files. The technical specification and the technical report collectively describe relevant regulations, principles, simulation and experimental results of the technology, and the conference file mainly records specific conference information of each working group. In contrast, the technical specification and the technical report content format are similar, and the loaded core technical information is richer, and has greater mining value.
In actual patent review, the search of 3GPP technical standards is mainly based on the key words manually selected by the examiner. The quality of the retrieval result often depends on the quality of the keywords, and the traditional mode not only consumes time and labor, but also is difficult to ensure the hit rate of the comparison files. Compared with patent documents, the 3GPP technical standard has the characteristics of wide coverage, large information amount, irregular format and weak readability, and the characteristics directly determine that the 3GPP technical standard has higher difficulty in automatically extracting keywords than the patent documents. Therefore, the automatic extraction effect of the keywords of the 3GPP technical standard is improved, which is not only beneficial to improving the examination efficiency of the communication patents, but also has great significance for maintaining the authorization stability of the patents.
The automatic keyword extraction method has a great deal of relevant research at home and abroad, and generally comprises two major branches of a supervised learning method and an unsupervised learning method. The supervised learning method generally converts the keyword extraction problem into a two-class or multi-class problem in machine learning, and mainly relates to naive BeibeiClass models such as Bayes (Naive Bayes), maximum entropy (Maximumentopy), Support Vector Machines (Support Vector Machines), and the like. Although the method has a good prediction effect to a certain extent, the extraction effect often depends on the labeling quality of the training corpus and the scale of the training corpus, so that excessive human input cannot be avoided, and the method is difficult to adapt to the scene of mass data in practical application. The unsupervised learning method has the most obvious advantage over the supervised learning method in that the labor cost is greatly saved, and can be divided into an extraction method based on statistics, an extraction method based on a topic model and an extraction method based on a word graph model according to the algorithm idea. Wherein, the extraction method based on statistics generally combines the word frequency (term frequency) information, the word frequency-inverse document frequency (TF-IDF) and the chi2The weight of the candidate keywords is measured by statistical indexes such as values, and the method is sensitive to frequency and is easy to omit part of important low-frequency words. The most classical representative of the extraction method based on the topic model is an LDA (latent Dirichlet allocation) model algorithm, the LDA model infers a "document-topic" probability distribution and a "topic-term" probability distribution from a known "document-term" matrix through analyzing a training corpus, and the extraction effect of the method depends on the topic distribution characteristics of a training set. The extraction method based on the word graph is most widely applied to a TextRank algorithm, the thought of the TextRank algorithm is derived from a PageRank algorithm of Google, sentences or words in a text set form graph nodes, the similarity between the nodes (sentences or words) is used as the weight of edges, an iterative voting mechanism is adopted to carry out importance sequencing on the nodes in the graph model, and the method is not dependent on the number of texts, but has the limitation that only the internal information of the texts is considered and the distribution characteristics of vocabularies among different texts are ignored. The mainstream method at the present stage generally performs keyword extraction by fusing the advantages of different methods according to specific problems, and has the following defects: the method has the defects of poor recognition effect, poor low-frequency keyword recognition effect and the like due to the lack of consideration on semantic features.
Disclosure of Invention
In view of the above problems in the prior art, an object of the present invention is to provide an automatic keyword extraction method that can avoid the above technical drawbacks.
In order to achieve the above object, the present invention provides the following technical solutions:
an automatic keyword extraction method comprises the following steps: extracting common words in technical standards, extracting candidate key words, filtering the common words aiming at the candidate key words, calculating candidate key word weight scores by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics, calculating dynamic thresholds according to the candidate key word weight score ranges, and determining result key words by using the dynamic thresholds.
Further, the automatic keyword extraction method comprises the following steps:
step 1) removing text noise in the 3GPP technical standard;
step 2) extracting common words in the technical standard;
step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
Further, the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.
Further, the text noise includes pictures, tables, formulas, special symbols, and illegal characters.
Further, step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; is composed of n technical standardsIs expressed as D ═ D1,d2…di…dnAnd recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of
Wherein, P (w, d)i) For the occurrence of the word w in the technical standard diThe probability of 1. ltoreq. i. ltoreq.n, according to the maximum likelihood estimation method, P (w, d)i) Is calculated by the formula
Wherein, f (w, d)i) For words w in technical standards diThe number of occurrences in (c).
Further, extracting candidate keywords based on the dependency parsing tree includes:
step 1: traversing the technical standard set D, for each technical standard D in DiDividing into sentences according to punctuations, and representing the divided sentence sets as
1≤i≤ns,nsAs a document diThe number of Chinese sentences;
step 2: for set sequences (d)i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)i) Memory for recordingWherein T isiIndicates technical standard diThe corresponding syntax analysis tree of the ith sentence;
step 3: read in cyclesTaking a dependency parse tree set Trees (d)i) For any dependency syntax tree Ti∈Trees(di) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly modeiIf the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node which takes the noun phrase as a father node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into the candidate keyword set as a whole;
step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.
Further, the calculation method of the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to different-level titles of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard diMiddle candidate keyword set CK (d)i)={ck1,ck2…cki…cknIn which ckiFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(cki) Representing candidate keywords ckiPosition of occurrence, defining a characteristic function Pos (ck)i) Representing candidate keywords ckiThe weight in the dimension of the appearance position is assigned
Wherein, SnockiRepresenting candidate keywords ckiNumber of sentence in it, SnuckiRepresenting candidate keywords ckiNumber of sentences in text paragraph, len (ck)i) Representing candidate keywords ckiThe number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck)i) Represents the average of the position weights, then
Of these, fre (ck)i) Representing candidate keywords ckiThe frequencies occurring in the same technical standard.
Further, the word co-occurrence feature weight calculation method comprises the following steps:
the candidate keyword set with all technical standards is CK ═ CK (d)1),CK(d2)…CK(di)…CK(dn) H.for technical standard diAny one of the candidate keywords ckiMemory component ckiAre each cw1,cw2… cwi…cwmM is ckiNumber of words contained, let cwiCo-occurring word set of (1) is cocuri={wco1,wco2…wcoi…wcopP is the size of the co-occurrence set, wherein wcojRepresenting the word cwiWcoj∈CK(di) And satisfies wco1∩wco2∩…∩wcoj∩…∩wcop={cwiJ is more than or equal to 1 and less than or equal to p, then cwiFor candidate keywords ckiIs expressed as
Wherein, fre (cor)j) Representing the word cwiCo-occurrence of wcojFrequency of occurrence, len (wco)j) Indicating co-occurrence of wcojThe number of words contained; when the candidate keyword ckiWhen a plurality of words are included, candidate keywords ck are calculatediThe weight calculation formula on the dimension of word co-occurrence is
Further, the context semantic feature weight calculation method comprises the following steps:
the calculation task is decomposed into the probability maximum value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the objective function is
Wherein c isiE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)i| w) is expressed as
Wherein,and vwAre respectively the word ciAnd w, c' is all non-repeating words in the corpus, vcVector representation of 'c'; each technical standard D in the technical standard set DiViewed as being composed of a series of words w1…wi…wnComposition, assuming mutual independence between words, to technical criteria diOf each candidate keyword ckiIf the word type is used, the formula for calculating the prediction probability is
If the phrase type is used, the calculation formula is
After taking the logarithm of both sides of the above formula, the logP (w) on the left side is1…wi…wn|cki) As a measure of candidate keywords ckiThe weight measure in the semantic dimension is denoted as W (Sem (ck)i) logP (w)1…wi…wn|cki) The approximation is seen as logP (c)1…ci…cn|cki) Wherein w is1…wi…wnAs candidate keywords ckiThe Context within the scope of the model window is abbreviated as Context (ck)i) Then W (Sem (ck)i) Is calculated as
Further, the step 4) comprises:
to technical standard diAny one of the candidate keywords ckiComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ckiThe weight scores in the three characteristic dimensions are formulated as
W(cki)=W(Pos(cki))+W(Coo(cki))+W(Sem(cki));
Note diOf each candidate keyword ckiCorresponding score
Score(di)={W(ck1)…W(cki)…W(ckn) For Score (d) } for Score (d)i) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is
If d isiThe medium candidate keyword satisfies W (ck)i) When the k is more than or equal to lambda, ck isiAnd adding the result into the result keyword set.
The automatic extraction method of the keywords, provided by the invention, integrates the position characteristics, the word co-occurrence characteristics and the context semantic characteristics to extract the keywords, comprehensively considers the weight influence of the internal positions of the documents and the context semantic characteristics on the keywords, achieves higher accuracy and recall rate, improves the retrieval quality of the 3GPP technical standard, reduces the labor cost and can well meet the requirements of practical application.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a dependency parse tree;
FIG. 3 is a comparison of the CBOW model and Skip _ gram model frameworks.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of protection of the present invention.
The invention provides an automatic keyword extraction method, which comprises the steps of firstly extracting common words in a 3GPP technical standard based on a word frequency-document distribution entropy method, then extracting candidate keywords based on an algorithm of a dependency syntax analysis tree, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics after filtering the common words aiming at the candidate keywords, calculating a dynamic threshold according to a candidate keyword weight score range of each technical standard, and finally determining result keywords by using the dynamic threshold. Specifically, as shown in fig. 1, an automatic keyword extraction method includes the following steps:
step 1) preprocessing the 3GPP technical standard, which mainly comprises adopting an Apache POI analysis technical standard to remove text noises such as pictures, tables, formulas, special symbols and illegal characters in the technical standard;
step 2) extracting common words in all technical standards based on the word frequency-document distribution entropy;
step 3) segmenting each technical standard into sentence subsets, performing dependency syntax analysis on each sentence, extracting candidate keywords based on a dependency syntax analysis tree and filtering common words;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
The 3GPP technical standards include not only simple stop words such as "if", "at", "not", "or", etc., but also general words throughout most technical standards, such as "Figure", "version", "general", "given", etc., which are specific to the technical standards and have no representativeness and importance. It has been observed that both simple stop words and generic words that are specific to the technical standard appear in different versions and different types of technical standards with different frequencies, with a high degree of currency, and generally without the ability to generalize or abstract out the specific technical standard content. These words are collectively referred to as common words.
Clearly, coverage is not comprehensive enough if only the manually gathered common deactivation vocabulary is selected. Therefore, in order to reduce the interference of the general words to the keyword extraction task as much as possible, the concept of word frequency-document distribution entropy is introduced to automatically obtain the technical standard general words by combining the information entropy principle. The information entropy is introduced into an information theory by Shannon at first and is used for measuring the uncertainty of the discrete random variable, and the larger the information entropy value is, the larger the uncertainty corresponding to the random variable is. Similarly, regarding the word w as a random variable, the definition of the word frequency-document distribution entropy is given as follows.
Definition 1 word frequency-document distribution entropy refers to a measure of uncertainty in the state of distribution of a word w in a set of technical standards.
Let a document set consisting of n technical standards be denoted as D ═ D1,d2…di…dnAnd (5) recording the word frequency of the word w and the document distribution entropy as H (w), then H (w) is calculated as shown in formula (1),
wherein, P (w, d)i) For the occurrence of the word w in the technical standard diThe probability of 1. ltoreq. i. ltoreq.n, according to the maximum likelihood estimation method, P (w, d)i) Can be calculated from the formula (2),
wherein, f (w, d)i) For words w in technical standards diThe number of occurrences in (c). It can be seen that if the technical standard containing w is richer and w is distributed more uniformly in the technical standard set, the word frequency-document distribution entropy h (w) is larger, and indicates that w is in the technical standardThe greater the uncertainty in the distribution in the quasi-set D, and thus the more likely w is a generic word of no significance in the technical standard set.
Statistics show that most keywords are real word phrases such as nouns, verbs and adjectives generally and do not contain stop words without practical meanings and universal words distributed uniformly in technical standards. Thus, the keyword category is defined as verbs, adjectives, nouns, and noun phrases after the removal of the common words. In order to extract candidate keywords with semantic consistency and syntactic modification integrity, sentence in 3GPP technical standard is subjected to dependency syntactic analysis, and noun phrases, verbs, adjectives and nouns meeting syntactic modification consistency are extracted by combining a dependency syntactic analysis tree and are added into a candidate keyword set. Wherein, the NP with the minimum granularity in the syntactic parse tree is used as a candidate keyword for the noun phrase. And finally, filtering the universal words according to the candidate keyword set. For example: the syntax analysis of the sentence "local channels are SAPs between MAC and RLC" shows the result as shown in FIG. 2.
As can be seen from fig. 2, the adjectives "local" modify the nouns "channels", "local" and "channels" to form a Noun Phrase (NP), and "SAPs" and "MAC and RLC" are Noun Phrases (NP), while "SAPs beta ween MAC and RLC" is also a Noun Phrase (NP) as a whole, but in the dependency parsing tree, the noun phrases "MAC and RLC" and "beta ween" form preposition phrases PP and "SAPs" which are child nodes of NP and have sibling node relationship therebetween. Clearly, the granularity is smaller for the noun phrase "MAC and RLC" than for "sapsboween MAC and RLC". Therefore, "local", "channels", "local channels", "are", "SAPs", "MAC", "RLC", and "MAC and RLC" are selected as candidate keywords in the example sentence, and the candidate keywords are filtered using the extracted common words. According to the above analysis, the candidate keyword extraction algorithm based on the dependency parsing tree comprises the following steps:
step 1: traversing skillSet of technical criteria D, for each technical criterion D in DiDividing into sentences according to punctuations, and representing the divided sentence sets as
1≤i≤ns,nsAs a document diThe number of Chinese sentences.
Step 2: for set sequences (d)i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)i) Memory for recordingWherein T isiIndicates technical standard diThe ith sentence in the sentence is corresponding to the stored syntax parse tree.
Step 3: cyclic read dependency parse tree set Trees (d)i) For any dependency syntax tree Ti∈Trees(di) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly modeiIf the current node is a leaf node (not the last leaf node), judging whether the part of speech of the node is a noun, a verb and an adjective, adding the node into the candidate keyword set if the conditions are met, and otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a Noun Phrase (NP) or not, if the current node is the noun phrase and the right subtree is not empty, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the NP as a parent node exists in the subtree, and adding the child nodes of the NP into the candidate keyword set as a whole.
Step 4: since some technical standard common words still exist in the candidate keywords extracted in the previous step, the candidate keyword set needs to be further filtered by using the extracted common words, and if an element containing the common words exists in the candidate keyword set, the element is removed from the candidate keyword set.
By analyzing the characteristics of the 3GPP technical standard, the portions of Scope, Reference, Definitions and Abbreviations except the text can be found to have important Reference values for the whole document, and should be considered as the key positions. The content of each chapter of the body part usually expands around the nearest title, so that the title can be regarded as the concentration of the core content of the corresponding paragraph, and the candidate keywords appearing at the position should be given higher weight. Similarly, what appears in the section NOTEs (NOTEs) generally serves as additional emphasis or supplementary description herein and should therefore also be treated as a special location.
Therefore, the position of the candidate keyword appearing in the 3GPP technical standard is taken as a weight influence factor. The method comprises the steps of respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of a 3GPP technical standard, numbering sentences in the sentence sets from 1 in sequence, and if the number of the sentence in which a candidate keyword is located is smaller, the candidate keyword is closer to the title, so that the candidate keyword is more likely to be a keyword which tightly deducts the subject matter. Recording technical standard diMiddle candidate keyword set CK (d)i)={ck1,ck2…cki…cknIn which ckiFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(cki) Representing candidate keywords ckiPosition of occurrence, defining a characteristic function Pos (ck)i) Representing candidate keywords ckiThe weight in the dimension of the occurrence position is assigned, then Pos (ck)i) Can be expressed as shown in equation (3).
Wherein, SnockiRepresenting candidate keywords ckiSentence editing of placeNumber SnuckiRepresenting candidate keywords ckiNumber of sentences in text paragraph, len (ck)i) Representing candidate keywords ckiThe number of words contained. Denominator plus len (ck)i) To avoid the situation that the position weight appears 0, due to the candidate keyword ckiIn technical Standard diMay occur multiple times at different locations, and thus the weights occurring at different locations are averaged, denoted W (Pos (ck)i) Is) represents the average value of the position weights, the calculation method thereof is as shown in equation (4).
Of these, fre (ck)i) Representing candidate keywords ckiThe frequencies occurring in the same technical standard. The processing mode of taking the average value can enhance the candidate keywords ckiThe weight with lower frequency but appearing in a special position weakens the deviation caused by calculating the weight of the candidate keyword only by the frequency characteristic.
Word co-occurrence characteristics are a factor that is not negligible in keyword extraction. By observing the extracted candidate keywords of the 3GPP technical standard, it is found that there is a phenomenon that constituent words of one candidate word repeatedly appear in other candidate words of different lengths, for example: for three candidate keywords, "MCH", "MCH transmission", "MCH subframe allocation", which occurs in two other candidate keywords of different lengths, the "MCH transmission" and "mchhsubframe allocation" may be regarded as co-occurring words of the "MCH", which often express more specific information than the individual constituent words. Therefore, if a word constituting a candidate keyword has many co-occurring words, the word is considered to have a richer meaning and should be given a higher weight. According to the analysis, the co-occurrence word frequency and the word length of the candidate keyword forming words are used as word co-occurrence characteristics to calculate the weight of the candidate keyword.
Recording all technical standardsThe candidate keyword set of (c) is CK ═ CK (d)1),CK(d2)…CK(di)…CK(dn) H.for technical standard diAny one of the candidate keywords ckiMemory component ckiAre each cw1,cw2… cwi…cwmM is ckiNumber of words contained, let cwiCo-occurring word set of (1) is cocuri={wco1,wco2… wcoi…wcopP is the size of the co-occurrence word set (i.e., the number of co-occurrences in the co-occurrence word set), wherein wcojRepresenting the word cwiWcoj∈CK(di) And satisfies wco1∩wco2∩…∩wcoj∩…∩wcop={cwiJ is more than or equal to 1 and less than or equal to p, then cwiFor candidate keywords ckiThe contribution of (c) can be expressed by equation (5);
wherein, fre (cor)j) Representing the word cwiCo-occurrence of wcojFrequency of occurrence, len (wco)j) Indicating co-occurrence of wcojThe number of words contained. When the candidate keyword ckiWhen a plurality of words are included, candidate keywords ck are calculatediThe weight calculation method in this dimension of word co-occurrence is shown in equation (6).
It can be seen that when the candidate keyword ckiThe more frequently occurring component words in the co-occurrence words, the candidate keyword ck of each component word pairiIs greater, so the candidate keyword ckiThe greater the weight in this dimension of the word co-occurrence feature.
Keywords generally highly condense the core content of technical standards, and often have the commonality of collectively embodying the gist of technical standards from different semantic levels. Therefore, the influence of semantic features of the candidate keywords in the context on the weights cannot be ignored. Considering that the Word vector can well represent semantic characteristics, Word2vec is introduced to calculate the weight of the candidate keywords in the dimension of the semantic characteristics.
Word2vec is a scheme implementation tool which is provided by google based on deep learning thought and solves the problems of lack of model generalization force, dimension disaster and the like in the statistical language model calculation process. Word2vec comprises two training models of CBOW and Skip _ gram, in order to reduce the complexity of model solving, two training optimization methods of Hierachy Softmax (HS) and Negative Sampling (NS) are provided, and a training frame is formed by combining the training models and the optimization methods. As shown in fig. 3, the training frames formed by the two models have a common point in that they both include an Input Layer (Input Layer), a Projection Layer (Projection Layer), and an Output Layer (Output Layer), and the difference is that the training frame based on the CBOW model predicts the current word w according to the context semantic environment in which the words appear, and the training frame based on the Skip _ gram model predicts the context semantic information according to the current word w.
In order to solve the problem of predicting context (w) (window c) by current word w, the Skip _ gram model decomposes the calculation task into the probability maximum of predicting each word forming context (w) by current word w independently, and the objective function is
Wherein c isiE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)i| w) is expressed by Softmax normalization, as shown in equation (7);
wherein,and vwAre respectively the word ciAnd w, c' is all non-repeating words in the corpus, the number is large, and Hierachy Softmax or Negative Sampling optimization can be adopted, vcVector representation of 'c'. Each technical standard D in the technical standard set DiViewed as being composed of a series of words w1…wi… wnComposition, assuming mutual independence between words, to technical criteria diOf each candidate keyword ckiIf the word type is the word type, calculating the prediction probability by using a formula (8), and if the word type is the phrase type, calculating by using a formula (9);
wherein P (w)j|cki) By using the calculation of equation (7) for the variable substitution, it can be seen that when the probability P (w) is predicted1…wi…wn|cki) The larger the candidate keyword ck is, the larger the candidate keyword ck isiThe better the effect of predicting context information, the more likely it is to be a keyword that characterizes full-text information. In order to avoid as far as possible the occurrence of extremely small errors due to the excessively small conditional probability in the continuous multiplication calculation, the logP (w) on the left side is obtained by taking the logarithm of both sides of the above equation1…wi… wn|cki) As a measure of candidate keywords ckiThe weight measure in the semantic dimension is denoted as W (Sem (ck)i) Meanwhile, considering that the relation is established to the similar words when the Word2vec training corpus is considered, logP (w) is used for simplifying calculation1…wi…wn|cki) Approximately as logP (c)1…ci…cn|cki) Wherein w is1…wi…wnAs candidate keywords ckiContext within the scope of the model window, abbreviated as Context (ck)i) Then W (Sem (ck)i) The calculation method is shown in formula (10);
to technical standard diAny one of the candidate keywords ckiComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, and calculating the candidate keywords ck by adopting a formula (11)iThe weights in the three feature dimensions are scored.
W(cki)=W(Pos(cki))+W(Coo(cki))+W(Sem(cki))
(11)。
The influence of the insufficiency of a single characteristic factor on the extraction effect of the key words can be avoided by fusing three different characteristics, and d is recordediOf each candidate keyword ckiCorresponding score
Score(di)={W(ck1)…W(cki)…W(ckn) For Score (d) } for Score (d)i) The scores in (1) are ranked from high to low, and a dynamic threshold lambda is set as the average value of all the scores, and the calculation mode is shown as a formula (12);
if d isiThe medium candidate keyword satisfies W (ck)i) When the k is more than or equal to lambda, ck isiAnd adding the result into the result keyword set. The reason why the fixed threshold is not selected is that different technical standards have differences in length, and the candidate keyword score ranges calculated by different technical standards are different, so that a dynamic threshold is set for the actual score range of a single technical standard.
The method is used for carrying out experiments, experimental data are selected from 2016 technical standards (including technical specifications and technical reports) on a 3GPP website, and 8000 pieces of experimental data are obtained in total after de-noising is carried out again. The effective series (series) number range of the technical standards is 01-12, 21-38, 41-46, 48-52 and 55, and 42 series are counted, each series comprises a plurality of versions and is 14G in size, and each technical standard consists of Cover, form, Scope, Reference, Definitions and bands, Topic body and Annex parts.
In the experiment, three evaluation indexes of accuracy (P), recall rate (R) and F-value (F-Score) which are commonly used in natural language processing tasks are adopted to evaluate the extraction effect of the keywords, and the calculation methods are respectively shown in formulas (13) to (15).
Extracting technical standard general words from the preprocessed technical standard by using a method based on the word frequency-document distribution entropy, obtaining the optimal threshold value of the word frequency-document distribution entropy as 5.42 through a plurality of experiments, selecting the words larger than the threshold value as technical standard general words, obtaining 13566 general words in total, wherein part of general word extraction results are shown in table 2.
Table 2 partial common word extraction results
Serial number | Common words | H(W) | Serial number | Common words | H(W) |
1 | version | 10.9665 | 11 | all | 9.9539 |
2 | should | 10.8165 | 12 | possible | 9.8908 |
3 | latest | 10.7022 | 13 | foreword | 9.8543 |
4 | approve | 10.6394 | 14 | through | 9.8097 |
5 | specifiction | 10.5639 | 15 | modify | 9.7739 |
6 | update | 10.4934 | 16 | restriction | 9.6978 |
7 | present | 10.2963 | 17 | this | 9.6536 |
8 | within | 10.1056 | 18 | available | 9.6281 |
9 | be | 10.0572 | 19 | release | 9.5941 |
10 | further | 10.0188 | 20 | when | 9.5148 |
As can be seen from table 2, the algorithm based on the term frequency-document distribution entropy can extract not only the common stop words "all", "this", "while", and the like, but also common words in the technical standard, for example: "version", "specification", "release", and the like. By adopting the method, most technical standard common words can be effectively obtained.
And after filtering the candidate keyword set in each technical standard by using the general word list, respectively calculating the weights corresponding to the position characteristic, the word co-occurrence characteristic and the context semantic characteristic. When the context semantic features are calculated, a Skip-Gram model in Word2vec and a Huffman Softmax optimization method are selected for training 14G technical standards in an experiment, a context window is set to be 10, vector dimensions are set to be 200, and 965.1M model files are obtained after 10 iterations. To analyze the effect of different features on the extraction of technical standard keywords, the combination of experimentally set comparison features is shown in table 3.
TABLE 3 combination of features
And (3) respectively calculating the scores of the candidate keywords of each technical standard under different feature combinations by combining formulas (3) to (11), calculating a dynamic threshold value by utilizing a formula (12), and screening out the candidate keywords meeting the conditions as the identified keywords. And simultaneously randomly extracting 1000 technical standards containing different series and versions from 8000 technical standards, and respectively screening 2, 4, 6, 8 and 10 keywords from each technical standard as a reference keyword set in a form of intersection of three-person cross labeling. And respectively carrying out morphology reduction on the recognized keywords and the manually marked reference keyword set, then comparing, if the recognized keywords and the marked keywords have the same morphology or are mutually abbreviated and fully named, considering the recognized keywords as the correct recognition condition, meanwhile, counting the correctness, the recall rate and the F-value of the recognized keywords under different numbers of keywords by different feature combinations, and the experimental result is shown in table 4.
TABLE 4 keyword extraction results under different feature combinations
As can be seen from table 4, when the number of key words is 2, the Feature recognition recall rates of Feature1, Feature4, Feature5, and Feature7 are higher than those of other Feature combinations. This is because, when the number of keywords is small, those candidate keywords appearing at a particular position are more likely to be correctly recognized as keywords. Meanwhile, the words in the special positions provide less context semantic information, so that the position characteristics of the keywords appearing in the technical standard relatively dominate. When the number of key words is increased from 2, the recall rate corresponding to Feature1 is slowly increased and gradually decreased compared with Feature 1-Feature 3; the Feature2 obviously increases the recall rate of the correct rate when the number of the keywords is 4-8, and then the correct rate is reduced to some extent; when the number of keywords exceeds 6, Feature3 increases the recall ratio. The method is characterized in that the influence of the position on the weight of the keywords is gradually reduced along with the increase of the number of the keywords, and the influence of the word co-occurrence characteristics and the context semantic characteristics on the weight of the keywords is gradually increased. Meanwhile, comparing Feature5 with Feature7 shows that the accuracy and recall rate are increased after the word co-occurrence Feature is added. This is because word co-occurrence factors help identify more phrase-type keywords that are likely to correspond to abbreviated keywords having a certain general meaning but not occupying a position advantage, and as the number of keywords increases, the keywords identified by word co-occurrence characteristics are more likely to be included in the reference keyword set. Comparing Feature4 with Feature7, it can be found that the recall rate is obviously increased from the keyword number of 4 after the context semantic features are added. The reason is that when the number of keywords increases, candidate keywords characterized by rich contextual semantic information are also more likely to be selected as keywords. When the number of key words is the same, comparing Feature1, Feature2, Feature3 and Feature7, it can be found that Feature7 obtains better recognition effect than any single Feature due to the combination of advantages of different features.
The automatic extraction method of the keywords, provided by the invention, integrates the position characteristics, the word co-occurrence characteristics and the context semantic characteristics to extract the keywords, comprehensively considers the weight influence of the internal positions of the documents and the context semantic characteristics on the keywords, achieves higher accuracy and recall rate, improves the retrieval quality of the 3GPP technical standard, reduces the labor cost and can well meet the requirements of practical application.
The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (10)
1. An automatic keyword extraction method is characterized by comprising the following steps: extracting common words, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating the weight score of the candidate keywords by integrating the position characteristics, the word co-occurrence characteristics and the context semantic characteristics, calculating a dynamic threshold according to the weight score range of the candidate keywords, and determining result keywords by utilizing the dynamic threshold.
2. The method for automatically extracting keywords according to claim 1, wherein the method for automatically extracting keywords comprises:
step 1) removing text noise in the 3GPP technical standard;
step 2) extracting common words in the technical standard;
step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
3. The method for automatically extracting keywords according to claim 1, wherein the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.
4. The automatic keyword extraction method according to claims 1 to 3, wherein the step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; let a document set consisting of n technical standards be denoted as D ═ D1,d2...di...dnAnd recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of
Wherein, P (w, d)i) For the occurrence of the word w in the technical standard diI is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d)i) Is calculated by the formula
Wherein, f (w, d)i) As wordsw is in technical standard diThe number of occurrences in (c).
5. The method for automatically extracting keywords according to claims 1-4, wherein extracting candidate keywords based on the dependency syntax analysis tree comprises:
step 1: traversing the technical standard set D, for each technical standard D in DiDividing into sentences according to punctuations, and representing the divided sentence sets as
1≤i≤ns,nsAs a document diThe number of Chinese sentences;
step 2: for set sequences (d)i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)i) Memory for recordingWherein T isiIndicates technical standard diThe dependency syntax analysis tree corresponding to the ith sentence;
step 3: cyclic read dependency parse tree set Trees (d)i) For any dependency syntax tree Ti∈Trees(di) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly modeiIf the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the noun phrase as a parent node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into a candidate keyword set as a whole;
step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.
6. The method for automatically extracting keywords according to claims 1-5, wherein the method for calculating the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard diMiddle candidate keyword set CK (d)i)={ck1,ck2...cki...cknIn which ckiFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(cki) Representing candidate keywords ckiPosition of occurrence, defining a characteristic function Pos (ck)i) Representing candidate keywords ckiThe weight in the dimension of the appearance position is assigned
Wherein, SnockiRepresenting candidate keywords ckiNumber of sentence in it, SnuckiRepresenting candidate keywords ckiNumber of sentences in text paragraph, len (ck)i) Representing candidate keywords ckiThe number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck)i) Represents the average of the position weights, then
Of these, fre (ck)i) Representing candidate keywords ckiThe frequencies occurring in the same technical standard.
7. The method for automatically extracting keywords according to claims 1-6, wherein the method for calculating the weight of the co-occurrence features of the words comprises the following steps:
the candidate keyword set with all technical standards is CK ═ CK (d)1),CK(d2)...CK(di)...CK(dn) H.for technical standard diAny one of the candidate keywords ckiMemory component ckiAre each cw1,cw2…cwi…cwmM is ckiNumber of words contained, let cwiCo-occurring word set of (1) is cocuri={wco1,wco2…wcoi…wcopP is the size of the co-occurrence set, wherein wcojRepresenting the word cwiWcoj∈CK(di) And satisfies wco1∩wco2∩…∩wcoj∩…∩wcop={cwiJ is more than or equal to 1 and less than or equal to p, then cwiFor candidate keywords ckiIs expressed as
Wherein, fre (cor)j) Representing the word cwiCo-occurrence of wcojFrequency of occurrence, len (wco)j) Indicating co-occurrence of wcojThe number of words contained; when the candidate keyword cki contains a plurality of words, the candidate keyword ck is calculatediThe weight calculation formula on the dimension of word co-occurrence is
8. The method for automatically extracting keywords according to claims 1-7, wherein the context semantic feature weight calculation method comprises:
the calculation task is decomposed into the probability maximum value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the objective function is
Wherein c isiE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)i| w) is expressed as
Wherein,and vwAre respectively the word ciAnd w, c' is all non-repeating words in the corpus, vc′A vector representation of c'; each technical standard D in the technical standard set DiViewed as being composed of a series of words w1…wi…wnComposition, assuming mutual independence between words, to technical criteria diOf each candidate keyword ckiIf the word type is used, the formula for calculating the prediction probability is
If the phrase type is used, the calculation formula is
After taking the logarithm of both sides of the above formula, the logP (w) on the left side is1…wi…wn|cki) As a measure of candidate keywords ckiThe weight measure in the semantic dimension is denoted as W (Sem (ck)i) logP (w)1…wi…wn|cki) Approximately as logP (c)1…ci…cn|cki) Wherein w is1…wi…wnAs candidate keywords ckiContext within the scope of the model window, abbreviated as Context (ck)i) Then W (Sem (ck)i) Is calculated as
9. The automatic keyword extraction method according to claims 1 to 8, wherein the step 4) comprises:
to technical standard diAny one of the candidate keywords ckiComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ckiThe weight scores in the three characteristic dimensions are formulated as
W(cki)=W(Pos(cki))+W(Coo(cki))+W(Sem(cki));
Note diOf each candidate keyword ckiCorresponding Score (d)i)={W(ck1)...W(cki)...W(ckn) For Score (d) } for Score (d)i) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is
If d isiThe medium candidate keyword satisfies W (ck)i) When the k is more than or equal to lambda, ck isiAnd adding the result into the result keyword set.
10. The method for automatically extracting keywords according to claims 1-9, wherein the text noise comprises pictures, tables, formulas, special symbols and illegal characters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810611476.7A CN108920456B (en) | 2018-06-13 | 2018-06-13 | Automatic keyword extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810611476.7A CN108920456B (en) | 2018-06-13 | 2018-06-13 | Automatic keyword extraction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108920456A true CN108920456A (en) | 2018-11-30 |
CN108920456B CN108920456B (en) | 2022-08-30 |
Family
ID=64419617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810611476.7A Active CN108920456B (en) | 2018-06-13 | 2018-06-13 | Automatic keyword extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108920456B (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
CN109960724A (en) * | 2019-03-13 | 2019-07-02 | 北京工业大学 | A kind of text snippet method based on TF-IDF |
CN110134767A (en) * | 2019-05-10 | 2019-08-16 | 云知声(上海)智能科技有限公司 | A kind of screening technique of vocabulary |
CN110147425A (en) * | 2019-05-22 | 2019-08-20 | 华泰期货有限公司 | A kind of keyword extracting method, device, computer equipment and storage medium |
CN110377724A (en) * | 2019-07-01 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of corpus keyword Automatic algorithm based on data mining |
CN111435405A (en) * | 2019-01-15 | 2020-07-21 | 北京行数通科技有限公司 | Method and device for automatically labeling key sentences of article |
CN111552786A (en) * | 2020-04-16 | 2020-08-18 | 重庆大学 | Question-answering working method based on keyword extraction |
CN111597793A (en) * | 2020-04-20 | 2020-08-28 | 中山大学 | Paper innovation measuring method based on SAO-ADV structure |
CN111680509A (en) * | 2020-06-10 | 2020-09-18 | 四川九洲电器集团有限责任公司 | Method and device for automatically extracting text keywords based on co-occurrence language network |
CN111985217A (en) * | 2020-09-09 | 2020-11-24 | 吉林大学 | Keyword extraction method and computing device |
CN112988951A (en) * | 2021-03-16 | 2021-06-18 | 福州数据技术研究院有限公司 | Scientific research project review expert accurate recommendation method and storage device |
CN113191145A (en) * | 2021-05-21 | 2021-07-30 | 百度在线网络技术(北京)有限公司 | Keyword processing method and device, electronic equipment and medium |
CN113657113A (en) * | 2021-08-24 | 2021-11-16 | 北京字跳网络技术有限公司 | Text processing method and device and electronic equipment |
CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
CN113971216A (en) * | 2021-10-22 | 2022-01-25 | 北京百度网讯科技有限公司 | Data processing method and device, electronic equipment and memory |
CN114492433A (en) * | 2022-01-27 | 2022-05-13 | 南京烽火星空通信发展有限公司 | Method for automatically selecting proper keyword combination to extract text |
CN114626361A (en) * | 2020-12-10 | 2022-06-14 | 广州视源电子科技股份有限公司 | Sentence making method, sentence making model training method and device and computer equipment |
CN118657634A (en) * | 2024-08-21 | 2024-09-17 | 青岛中投创新技术转移有限公司 | Patent analysis and evaluation method based on artificial intelligence |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110004465A1 (en) * | 2009-07-02 | 2011-01-06 | Battelle Memorial Institute | Computation and Analysis of Significant Themes |
CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN105843795A (en) * | 2016-03-21 | 2016-08-10 | 华南理工大学 | Topic model based document keyword extraction method and system |
-
2018
- 2018-06-13 CN CN201810611476.7A patent/CN108920456B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110004465A1 (en) * | 2009-07-02 | 2011-01-06 | Battelle Memorial Institute | Computation and Analysis of Significant Themes |
CN102929937A (en) * | 2012-09-28 | 2013-02-13 | 福州博远无线网络科技有限公司 | Text-subject-model-based data processing method for commodity classification |
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN105843795A (en) * | 2016-03-21 | 2016-08-10 | 华南理工大学 | Topic model based document keyword extraction method and system |
Non-Patent Citations (1)
Title |
---|
都云程等: "基于字同现频率的关键词自动抽取", 《北京信息科技大学学报》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109614626A (en) * | 2018-12-21 | 2019-04-12 | 北京信息科技大学 | Keyword Automatic method based on gravitational model |
CN111435405A (en) * | 2019-01-15 | 2020-07-21 | 北京行数通科技有限公司 | Method and device for automatically labeling key sentences of article |
CN109960724A (en) * | 2019-03-13 | 2019-07-02 | 北京工业大学 | A kind of text snippet method based on TF-IDF |
CN110134767A (en) * | 2019-05-10 | 2019-08-16 | 云知声(上海)智能科技有限公司 | A kind of screening technique of vocabulary |
CN110147425B (en) * | 2019-05-22 | 2021-04-06 | 华泰期货有限公司 | Keyword extraction method and device, computer equipment and storage medium |
CN110147425A (en) * | 2019-05-22 | 2019-08-20 | 华泰期货有限公司 | A kind of keyword extracting method, device, computer equipment and storage medium |
CN110377724A (en) * | 2019-07-01 | 2019-10-25 | 厦门美域中央信息科技有限公司 | A kind of corpus keyword Automatic algorithm based on data mining |
CN111552786A (en) * | 2020-04-16 | 2020-08-18 | 重庆大学 | Question-answering working method based on keyword extraction |
CN111597793A (en) * | 2020-04-20 | 2020-08-28 | 中山大学 | Paper innovation measuring method based on SAO-ADV structure |
CN111597793B (en) * | 2020-04-20 | 2023-06-16 | 中山大学 | Paper innovation measuring method based on SAO-ADV structure |
CN111680509A (en) * | 2020-06-10 | 2020-09-18 | 四川九洲电器集团有限责任公司 | Method and device for automatically extracting text keywords based on co-occurrence language network |
CN111985217A (en) * | 2020-09-09 | 2020-11-24 | 吉林大学 | Keyword extraction method and computing device |
CN111985217B (en) * | 2020-09-09 | 2022-08-02 | 吉林大学 | Keyword extraction method, computing device and readable storage medium |
CN114626361A (en) * | 2020-12-10 | 2022-06-14 | 广州视源电子科技股份有限公司 | Sentence making method, sentence making model training method and device and computer equipment |
CN112988951A (en) * | 2021-03-16 | 2021-06-18 | 福州数据技术研究院有限公司 | Scientific research project review expert accurate recommendation method and storage device |
CN113191145A (en) * | 2021-05-21 | 2021-07-30 | 百度在线网络技术(北京)有限公司 | Keyword processing method and device, electronic equipment and medium |
CN113191145B (en) * | 2021-05-21 | 2023-08-11 | 百度在线网络技术(北京)有限公司 | Keyword processing method and device, electronic equipment and medium |
CN113657113A (en) * | 2021-08-24 | 2021-11-16 | 北京字跳网络技术有限公司 | Text processing method and device and electronic equipment |
CN113743090A (en) * | 2021-09-08 | 2021-12-03 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
CN113743090B (en) * | 2021-09-08 | 2024-04-12 | 度小满科技(北京)有限公司 | Keyword extraction method and device |
CN113971216A (en) * | 2021-10-22 | 2022-01-25 | 北京百度网讯科技有限公司 | Data processing method and device, electronic equipment and memory |
CN114492433A (en) * | 2022-01-27 | 2022-05-13 | 南京烽火星空通信发展有限公司 | Method for automatically selecting proper keyword combination to extract text |
CN118657634A (en) * | 2024-08-21 | 2024-09-17 | 青岛中投创新技术转移有限公司 | Patent analysis and evaluation method based on artificial intelligence |
Also Published As
Publication number | Publication date |
---|---|
CN108920456B (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108920456B (en) | Automatic keyword extraction method | |
CN110874531B (en) | Topic analysis method and device and storage medium | |
Beeferman et al. | Statistical models for text segmentation | |
Sridhar | Unsupervised topic modeling for short texts using distributed representations of words | |
CN107229610B (en) | A kind of analysis method and device of affection data | |
US9317498B2 (en) | Systems and methods for generating summaries of documents | |
US10437867B2 (en) | Scenario generating apparatus and computer program therefor | |
US9275115B2 (en) | Correlating corpus/corpora value from answered questions | |
US9892111B2 (en) | Method and device to estimate similarity between documents having multiple segments | |
US8645418B2 (en) | Method and apparatus for word quality mining and evaluating | |
EP3086240A1 (en) | Complex predicate template gathering device, and computer program therefor | |
CN103678316A (en) | Entity relationship classifying device and entity relationship classifying method | |
CN113988053A (en) | Hot word extraction method and device | |
CN109614626A (en) | Keyword Automatic method based on gravitational model | |
CN104317783A (en) | SRC calculation method | |
CN111310467B (en) | Topic extraction method and system combining semantic inference in long text | |
Hofmann et al. | Predicting the growth of morphological families from social and linguistic factors | |
CN111930949B (en) | Search string processing method and device, computer readable medium and electronic equipment | |
CN106126501B (en) | A kind of noun Word sense disambiguation method and device based on interdependent constraint and knowledge | |
Wei et al. | Query based summarization using topic background knowledge | |
CN111899832B (en) | Medical theme management system and method based on context semantic analysis | |
Mendels et al. | Collecting code-switched data from social media | |
CN115455975A (en) | Method and device for extracting topic keywords based on multi-model fusion decision | |
JP5128328B2 (en) | Ambiguity evaluation apparatus and program | |
CN114548113A (en) | Event-based reference resolution system, method, terminal and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |