[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108920456A - A kind of keyword Automatic method - Google Patents

A kind of keyword Automatic method Download PDF

Info

Publication number
CN108920456A
CN108920456A CN201810611476.7A CN201810611476A CN108920456A CN 108920456 A CN108920456 A CN 108920456A CN 201810611476 A CN201810611476 A CN 201810611476A CN 108920456 A CN108920456 A CN 108920456A
Authority
CN
China
Prior art keywords
keywords
candidate
word
technical standard
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810611476.7A
Other languages
Chinese (zh)
Other versions
CN108920456B (en
Inventor
吕学强
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201810611476.7A priority Critical patent/CN108920456B/en
Publication of CN108920456A publication Critical patent/CN108920456A/en
Application granted granted Critical
Publication of CN108920456B publication Critical patent/CN108920456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of keyword Automatic methods, including:General term in extraction technique standard, extract candidate keywords, after filtering general term for candidate keywords, integrated location feature, Term co-occurrence feature and context semantic feature calculate candidate keywords weighted score, according to candidate keywords weighted score range computation dynamic threshold, dynamic threshold definitive result keyword is utilized.Keyword Automatic method provided by the invention, merge position feature, Term co-occurrence feature and context semantic feature extracting keywords, inside documents position and context semantic feature are comprehensively considered to the weights influence of keyword, higher accuracy and recall rate are reached, improve 3GPP technical standard retrieval quality, cost of labor is reduced, the needs of practical application can be met well.

Description

Automatic keyword extraction method
Technical Field
The invention belongs to the technical field of automatic extraction of keywords, and particularly relates to an automatic extraction method of keywords facing to a 3GPP technical standard.
Background
The explosive development of mobile communication technology brings epoch-making changes to the human society. As a standard maker of The leading technology in The field of communications, The 3rd Generation Partnership Project (3 GPP) is dedicated to generalizing 3G standards based on evolved global system for mobile communications (GSM) core networks including WCDMA, TD-scdma, EDGE, etc.
In recent years, there are many cases of patent infringement litigation disputes among large-scale communication technology companies, and the stability of patent rights of the invention is challenged unprecedentedly. The 3GPP technical standards play an irreplaceable important role in the work of examining communication patents.
The 3GPP technical standard is a scientific non-patent document specific to the patent examination work in the communication field, and is usually used as a comparison document to measure the creativity and novelty of the patent application in the communication field.
The typical 3GPP technical standard Cover mainly includes a standard number, a release number, a document title and version number information, a form part explains the version number, a Scope part declares the application range, a Reference part gives a Reference list, Definitions and abbrevatients parts list important Definitions and abbreviations of documents, a Topic body part specifically introduces the technical background and details, and Annex mainly relates to the version change history.
In addition, there is a correlation between the 3GPP technical standard and the patent literature, and the difference between the 3GPP technical standard and the patent literature is shown in table 1.
Table 1 patent document differs from 3GPP technical standard
As can be seen from table 1, the 3GPP technical standard has its own unique organization and type. Of major interest in the actual patent examination are Technical Specifications (TS), Technical Reports (TR), and conference files. The technical specification and the technical report collectively describe relevant regulations, principles, simulation and experimental results of the technology, and the conference file mainly records specific conference information of each working group. In contrast, the technical specification and the technical report content format are similar, and the loaded core technical information is richer, and has greater mining value.
In actual patent review, the search of 3GPP technical standards is mainly based on the key words manually selected by the examiner. The quality of the retrieval result often depends on the quality of the keywords, and the traditional mode not only consumes time and labor, but also is difficult to ensure the hit rate of the comparison files. Compared with patent documents, the 3GPP technical standard has the characteristics of wide coverage, large information amount, irregular format and weak readability, and the characteristics directly determine that the 3GPP technical standard has higher difficulty in automatically extracting keywords than the patent documents. Therefore, the automatic extraction effect of the keywords of the 3GPP technical standard is improved, which is not only beneficial to improving the examination efficiency of the communication patents, but also has great significance for maintaining the authorization stability of the patents.
The automatic keyword extraction method has a great deal of relevant research at home and abroad, and generally comprises two major branches of a supervised learning method and an unsupervised learning method. The supervised learning method generally converts the keyword extraction problem into a two-class or multi-class problem in machine learning, and mainly relates to naive BeibeiClass models such as Bayes (Naive Bayes), maximum entropy (Maximumentopy), Support Vector Machines (Support Vector Machines), and the like. Although the method has a good prediction effect to a certain extent, the extraction effect often depends on the labeling quality of the training corpus and the scale of the training corpus, so that excessive human input cannot be avoided, and the method is difficult to adapt to the scene of mass data in practical application. The unsupervised learning method has the most obvious advantage over the supervised learning method in that the labor cost is greatly saved, and can be divided into an extraction method based on statistics, an extraction method based on a topic model and an extraction method based on a word graph model according to the algorithm idea. Wherein, the extraction method based on statistics generally combines the word frequency (term frequency) information, the word frequency-inverse document frequency (TF-IDF) and the chi2The weight of the candidate keywords is measured by statistical indexes such as values, and the method is sensitive to frequency and is easy to omit part of important low-frequency words. The most classical representative of the extraction method based on the topic model is an LDA (latent Dirichlet allocation) model algorithm, the LDA model infers a "document-topic" probability distribution and a "topic-term" probability distribution from a known "document-term" matrix through analyzing a training corpus, and the extraction effect of the method depends on the topic distribution characteristics of a training set. The extraction method based on the word graph is most widely applied to a TextRank algorithm, the thought of the TextRank algorithm is derived from a PageRank algorithm of Google, sentences or words in a text set form graph nodes, the similarity between the nodes (sentences or words) is used as the weight of edges, an iterative voting mechanism is adopted to carry out importance sequencing on the nodes in the graph model, and the method is not dependent on the number of texts, but has the limitation that only the internal information of the texts is considered and the distribution characteristics of vocabularies among different texts are ignored. The mainstream method at the present stage generally performs keyword extraction by fusing the advantages of different methods according to specific problems, and has the following defects: the method has the defects of poor recognition effect, poor low-frequency keyword recognition effect and the like due to the lack of consideration on semantic features.
Disclosure of Invention
In view of the above problems in the prior art, an object of the present invention is to provide an automatic keyword extraction method that can avoid the above technical drawbacks.
In order to achieve the above object, the present invention provides the following technical solutions:
an automatic keyword extraction method comprises the following steps: extracting common words in technical standards, extracting candidate key words, filtering the common words aiming at the candidate key words, calculating candidate key word weight scores by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics, calculating dynamic thresholds according to the candidate key word weight score ranges, and determining result key words by using the dynamic thresholds.
Further, the automatic keyword extraction method comprises the following steps:
step 1) removing text noise in the 3GPP technical standard;
step 2) extracting common words in the technical standard;
step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
Further, the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.
Further, the text noise includes pictures, tables, formulas, special symbols, and illegal characters.
Further, step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; is composed of n technical standardsIs expressed as D ═ D1,d2…di…dnAnd recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of
Wherein, P (w, d)i) For the occurrence of the word w in the technical standard diThe probability of 1. ltoreq. i. ltoreq.n, according to the maximum likelihood estimation method, P (w, d)i) Is calculated by the formula
Wherein, f (w, d)i) For words w in technical standards diThe number of occurrences in (c).
Further, extracting candidate keywords based on the dependency parsing tree includes:
step 1: traversing the technical standard set D, for each technical standard D in DiDividing into sentences according to punctuations, and representing the divided sentence sets as
1≤i≤ns,nsAs a document diThe number of Chinese sentences;
step 2: for set sequences (d)i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)i) Memory for recordingWherein T isiIndicates technical standard diThe corresponding syntax analysis tree of the ith sentence;
step 3: read in cyclesTaking a dependency parse tree set Trees (d)i) For any dependency syntax tree Ti∈Trees(di) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly modeiIf the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node which takes the noun phrase as a father node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into the candidate keyword set as a whole;
step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.
Further, the calculation method of the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to different-level titles of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard diMiddle candidate keyword set CK (d)i)={ck1,ck2…cki…cknIn which ckiFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(cki) Representing candidate keywords ckiPosition of occurrence, defining a characteristic function Pos (ck)i) Representing candidate keywords ckiThe weight in the dimension of the appearance position is assigned
Wherein, SnockiRepresenting candidate keywords ckiNumber of sentence in it, SnuckiRepresenting candidate keywords ckiNumber of sentences in text paragraph, len (ck)i) Representing candidate keywords ckiThe number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck)i) Represents the average of the position weights, then
Of these, fre (ck)i) Representing candidate keywords ckiThe frequencies occurring in the same technical standard.
Further, the word co-occurrence feature weight calculation method comprises the following steps:
the candidate keyword set with all technical standards is CK ═ CK (d)1),CK(d2)…CK(di)…CK(dn) H.for technical standard diAny one of the candidate keywords ckiMemory component ckiAre each cw1,cw2… cwi…cwmM is ckiNumber of words contained, let cwiCo-occurring word set of (1) is cocuri={wco1,wco2…wcoi…wcopP is the size of the co-occurrence set, wherein wcojRepresenting the word cwiWcoj∈CK(di) And satisfies wco1∩wco2∩…∩wcoj∩…∩wcop={cwiJ is more than or equal to 1 and less than or equal to p, then cwiFor candidate keywords ckiIs expressed as
Wherein, fre (cor)j) Representing the word cwiCo-occurrence of wcojFrequency of occurrence, len (wco)j) Indicating co-occurrence of wcojThe number of words contained; when the candidate keyword ckiWhen a plurality of words are included, candidate keywords ck are calculatediThe weight calculation formula on the dimension of word co-occurrence is
Further, the context semantic feature weight calculation method comprises the following steps:
the calculation task is decomposed into the probability maximum value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the objective function is
Wherein c isiE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)i| w) is expressed as
Wherein,and vwAre respectively the word ciAnd w, c' is all non-repeating words in the corpus, vcVector representation of 'c'; each technical standard D in the technical standard set DiViewed as being composed of a series of words w1…wi…wnComposition, assuming mutual independence between words, to technical criteria diOf each candidate keyword ckiIf the word type is used, the formula for calculating the prediction probability is
If the phrase type is used, the calculation formula is
After taking the logarithm of both sides of the above formula, the logP (w) on the left side is1…wi…wn|cki) As a measure of candidate keywords ckiThe weight measure in the semantic dimension is denoted as W (Sem (ck)i) logP (w)1…wi…wn|cki) The approximation is seen as logP (c)1…ci…cn|cki) Wherein w is1…wi…wnAs candidate keywords ckiThe Context within the scope of the model window is abbreviated as Context (ck)i) Then W (Sem (ck)i) Is calculated as
Further, the step 4) comprises:
to technical standard diAny one of the candidate keywords ckiComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ckiThe weight scores in the three characteristic dimensions are formulated as
W(cki)=W(Pos(cki))+W(Coo(cki))+W(Sem(cki));
Note diOf each candidate keyword ckiCorresponding score
Score(di)={W(ck1)…W(cki)…W(ckn) For Score (d) } for Score (d)i) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is
If d isiThe medium candidate keyword satisfies W (ck)i) When the k is more than or equal to lambda, ck isiAnd adding the result into the result keyword set.
The automatic extraction method of the keywords, provided by the invention, integrates the position characteristics, the word co-occurrence characteristics and the context semantic characteristics to extract the keywords, comprehensively considers the weight influence of the internal positions of the documents and the context semantic characteristics on the keywords, achieves higher accuracy and recall rate, improves the retrieval quality of the 3GPP technical standard, reduces the labor cost and can well meet the requirements of practical application.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a dependency parse tree;
FIG. 3 is a comparison of the CBOW model and Skip _ gram model frameworks.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of protection of the present invention.
The invention provides an automatic keyword extraction method, which comprises the steps of firstly extracting common words in a 3GPP technical standard based on a word frequency-document distribution entropy method, then extracting candidate keywords based on an algorithm of a dependency syntax analysis tree, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics after filtering the common words aiming at the candidate keywords, calculating a dynamic threshold according to a candidate keyword weight score range of each technical standard, and finally determining result keywords by using the dynamic threshold. Specifically, as shown in fig. 1, an automatic keyword extraction method includes the following steps:
step 1) preprocessing the 3GPP technical standard, which mainly comprises adopting an Apache POI analysis technical standard to remove text noises such as pictures, tables, formulas, special symbols and illegal characters in the technical standard;
step 2) extracting common words in all technical standards based on the word frequency-document distribution entropy;
step 3) segmenting each technical standard into sentence subsets, performing dependency syntax analysis on each sentence, extracting candidate keywords based on a dependency syntax analysis tree and filtering common words;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
The 3GPP technical standards include not only simple stop words such as "if", "at", "not", "or", etc., but also general words throughout most technical standards, such as "Figure", "version", "general", "given", etc., which are specific to the technical standards and have no representativeness and importance. It has been observed that both simple stop words and generic words that are specific to the technical standard appear in different versions and different types of technical standards with different frequencies, with a high degree of currency, and generally without the ability to generalize or abstract out the specific technical standard content. These words are collectively referred to as common words.
Clearly, coverage is not comprehensive enough if only the manually gathered common deactivation vocabulary is selected. Therefore, in order to reduce the interference of the general words to the keyword extraction task as much as possible, the concept of word frequency-document distribution entropy is introduced to automatically obtain the technical standard general words by combining the information entropy principle. The information entropy is introduced into an information theory by Shannon at first and is used for measuring the uncertainty of the discrete random variable, and the larger the information entropy value is, the larger the uncertainty corresponding to the random variable is. Similarly, regarding the word w as a random variable, the definition of the word frequency-document distribution entropy is given as follows.
Definition 1 word frequency-document distribution entropy refers to a measure of uncertainty in the state of distribution of a word w in a set of technical standards.
Let a document set consisting of n technical standards be denoted as D ═ D1,d2…di…dnAnd (5) recording the word frequency of the word w and the document distribution entropy as H (w), then H (w) is calculated as shown in formula (1),
wherein, P (w, d)i) For the occurrence of the word w in the technical standard diThe probability of 1. ltoreq. i. ltoreq.n, according to the maximum likelihood estimation method, P (w, d)i) Can be calculated from the formula (2),
wherein, f (w, d)i) For words w in technical standards diThe number of occurrences in (c). It can be seen that if the technical standard containing w is richer and w is distributed more uniformly in the technical standard set, the word frequency-document distribution entropy h (w) is larger, and indicates that w is in the technical standardThe greater the uncertainty in the distribution in the quasi-set D, and thus the more likely w is a generic word of no significance in the technical standard set.
Statistics show that most keywords are real word phrases such as nouns, verbs and adjectives generally and do not contain stop words without practical meanings and universal words distributed uniformly in technical standards. Thus, the keyword category is defined as verbs, adjectives, nouns, and noun phrases after the removal of the common words. In order to extract candidate keywords with semantic consistency and syntactic modification integrity, sentence in 3GPP technical standard is subjected to dependency syntactic analysis, and noun phrases, verbs, adjectives and nouns meeting syntactic modification consistency are extracted by combining a dependency syntactic analysis tree and are added into a candidate keyword set. Wherein, the NP with the minimum granularity in the syntactic parse tree is used as a candidate keyword for the noun phrase. And finally, filtering the universal words according to the candidate keyword set. For example: the syntax analysis of the sentence "local channels are SAPs between MAC and RLC" shows the result as shown in FIG. 2.
As can be seen from fig. 2, the adjectives "local" modify the nouns "channels", "local" and "channels" to form a Noun Phrase (NP), and "SAPs" and "MAC and RLC" are Noun Phrases (NP), while "SAPs beta ween MAC and RLC" is also a Noun Phrase (NP) as a whole, but in the dependency parsing tree, the noun phrases "MAC and RLC" and "beta ween" form preposition phrases PP and "SAPs" which are child nodes of NP and have sibling node relationship therebetween. Clearly, the granularity is smaller for the noun phrase "MAC and RLC" than for "sapsboween MAC and RLC". Therefore, "local", "channels", "local channels", "are", "SAPs", "MAC", "RLC", and "MAC and RLC" are selected as candidate keywords in the example sentence, and the candidate keywords are filtered using the extracted common words. According to the above analysis, the candidate keyword extraction algorithm based on the dependency parsing tree comprises the following steps:
step 1: traversing skillSet of technical criteria D, for each technical criterion D in DiDividing into sentences according to punctuations, and representing the divided sentence sets as
1≤i≤ns,nsAs a document diThe number of Chinese sentences.
Step 2: for set sequences (d)i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)i) Memory for recordingWherein T isiIndicates technical standard diThe ith sentence in the sentence is corresponding to the stored syntax parse tree.
Step 3: cyclic read dependency parse tree set Trees (d)i) For any dependency syntax tree Ti∈Trees(di) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly modeiIf the current node is a leaf node (not the last leaf node), judging whether the part of speech of the node is a noun, a verb and an adjective, adding the node into the candidate keyword set if the conditions are met, and otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a Noun Phrase (NP) or not, if the current node is the noun phrase and the right subtree is not empty, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the NP as a parent node exists in the subtree, and adding the child nodes of the NP into the candidate keyword set as a whole.
Step 4: since some technical standard common words still exist in the candidate keywords extracted in the previous step, the candidate keyword set needs to be further filtered by using the extracted common words, and if an element containing the common words exists in the candidate keyword set, the element is removed from the candidate keyword set.
By analyzing the characteristics of the 3GPP technical standard, the portions of Scope, Reference, Definitions and Abbreviations except the text can be found to have important Reference values for the whole document, and should be considered as the key positions. The content of each chapter of the body part usually expands around the nearest title, so that the title can be regarded as the concentration of the core content of the corresponding paragraph, and the candidate keywords appearing at the position should be given higher weight. Similarly, what appears in the section NOTEs (NOTEs) generally serves as additional emphasis or supplementary description herein and should therefore also be treated as a special location.
Therefore, the position of the candidate keyword appearing in the 3GPP technical standard is taken as a weight influence factor. The method comprises the steps of respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of a 3GPP technical standard, numbering sentences in the sentence sets from 1 in sequence, and if the number of the sentence in which a candidate keyword is located is smaller, the candidate keyword is closer to the title, so that the candidate keyword is more likely to be a keyword which tightly deducts the subject matter. Recording technical standard diMiddle candidate keyword set CK (d)i)={ck1,ck2…cki…cknIn which ckiFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(cki) Representing candidate keywords ckiPosition of occurrence, defining a characteristic function Pos (ck)i) Representing candidate keywords ckiThe weight in the dimension of the occurrence position is assigned, then Pos (ck)i) Can be expressed as shown in equation (3).
Wherein, SnockiRepresenting candidate keywords ckiSentence editing of placeNumber SnuckiRepresenting candidate keywords ckiNumber of sentences in text paragraph, len (ck)i) Representing candidate keywords ckiThe number of words contained. Denominator plus len (ck)i) To avoid the situation that the position weight appears 0, due to the candidate keyword ckiIn technical Standard diMay occur multiple times at different locations, and thus the weights occurring at different locations are averaged, denoted W (Pos (ck)i) Is) represents the average value of the position weights, the calculation method thereof is as shown in equation (4).
Of these, fre (ck)i) Representing candidate keywords ckiThe frequencies occurring in the same technical standard. The processing mode of taking the average value can enhance the candidate keywords ckiThe weight with lower frequency but appearing in a special position weakens the deviation caused by calculating the weight of the candidate keyword only by the frequency characteristic.
Word co-occurrence characteristics are a factor that is not negligible in keyword extraction. By observing the extracted candidate keywords of the 3GPP technical standard, it is found that there is a phenomenon that constituent words of one candidate word repeatedly appear in other candidate words of different lengths, for example: for three candidate keywords, "MCH", "MCH transmission", "MCH subframe allocation", which occurs in two other candidate keywords of different lengths, the "MCH transmission" and "mchhsubframe allocation" may be regarded as co-occurring words of the "MCH", which often express more specific information than the individual constituent words. Therefore, if a word constituting a candidate keyword has many co-occurring words, the word is considered to have a richer meaning and should be given a higher weight. According to the analysis, the co-occurrence word frequency and the word length of the candidate keyword forming words are used as word co-occurrence characteristics to calculate the weight of the candidate keyword.
Recording all technical standardsThe candidate keyword set of (c) is CK ═ CK (d)1),CK(d2)…CK(di)…CK(dn) H.for technical standard diAny one of the candidate keywords ckiMemory component ckiAre each cw1,cw2… cwi…cwmM is ckiNumber of words contained, let cwiCo-occurring word set of (1) is cocuri={wco1,wco2… wcoi…wcopP is the size of the co-occurrence word set (i.e., the number of co-occurrences in the co-occurrence word set), wherein wcojRepresenting the word cwiWcoj∈CK(di) And satisfies wco1∩wco2∩…∩wcoj∩…∩wcop={cwiJ is more than or equal to 1 and less than or equal to p, then cwiFor candidate keywords ckiThe contribution of (c) can be expressed by equation (5);
wherein, fre (cor)j) Representing the word cwiCo-occurrence of wcojFrequency of occurrence, len (wco)j) Indicating co-occurrence of wcojThe number of words contained. When the candidate keyword ckiWhen a plurality of words are included, candidate keywords ck are calculatediThe weight calculation method in this dimension of word co-occurrence is shown in equation (6).
It can be seen that when the candidate keyword ckiThe more frequently occurring component words in the co-occurrence words, the candidate keyword ck of each component word pairiIs greater, so the candidate keyword ckiThe greater the weight in this dimension of the word co-occurrence feature.
Keywords generally highly condense the core content of technical standards, and often have the commonality of collectively embodying the gist of technical standards from different semantic levels. Therefore, the influence of semantic features of the candidate keywords in the context on the weights cannot be ignored. Considering that the Word vector can well represent semantic characteristics, Word2vec is introduced to calculate the weight of the candidate keywords in the dimension of the semantic characteristics.
Word2vec is a scheme implementation tool which is provided by google based on deep learning thought and solves the problems of lack of model generalization force, dimension disaster and the like in the statistical language model calculation process. Word2vec comprises two training models of CBOW and Skip _ gram, in order to reduce the complexity of model solving, two training optimization methods of Hierachy Softmax (HS) and Negative Sampling (NS) are provided, and a training frame is formed by combining the training models and the optimization methods. As shown in fig. 3, the training frames formed by the two models have a common point in that they both include an Input Layer (Input Layer), a Projection Layer (Projection Layer), and an Output Layer (Output Layer), and the difference is that the training frame based on the CBOW model predicts the current word w according to the context semantic environment in which the words appear, and the training frame based on the Skip _ gram model predicts the context semantic information according to the current word w.
In order to solve the problem of predicting context (w) (window c) by current word w, the Skip _ gram model decomposes the calculation task into the probability maximum of predicting each word forming context (w) by current word w independently, and the objective function is
Wherein c isiE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)i| w) is expressed by Softmax normalization, as shown in equation (7);
wherein,and vwAre respectively the word ciAnd w, c' is all non-repeating words in the corpus, the number is large, and Hierachy Softmax or Negative Sampling optimization can be adopted, vcVector representation of 'c'. Each technical standard D in the technical standard set DiViewed as being composed of a series of words w1…wi… wnComposition, assuming mutual independence between words, to technical criteria diOf each candidate keyword ckiIf the word type is the word type, calculating the prediction probability by using a formula (8), and if the word type is the phrase type, calculating by using a formula (9);
wherein P (w)j|cki) By using the calculation of equation (7) for the variable substitution, it can be seen that when the probability P (w) is predicted1…wi…wn|cki) The larger the candidate keyword ck is, the larger the candidate keyword ck isiThe better the effect of predicting context information, the more likely it is to be a keyword that characterizes full-text information. In order to avoid as far as possible the occurrence of extremely small errors due to the excessively small conditional probability in the continuous multiplication calculation, the logP (w) on the left side is obtained by taking the logarithm of both sides of the above equation1…wi… wn|cki) As a measure of candidate keywords ckiThe weight measure in the semantic dimension is denoted as W (Sem (ck)i) Meanwhile, considering that the relation is established to the similar words when the Word2vec training corpus is considered, logP (w) is used for simplifying calculation1…wi…wn|cki) Approximately as logP (c)1…ci…cn|cki) Wherein w is1…wi…wnAs candidate keywords ckiContext within the scope of the model window, abbreviated as Context (ck)i) Then W (Sem (ck)i) The calculation method is shown in formula (10);
to technical standard diAny one of the candidate keywords ckiComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, and calculating the candidate keywords ck by adopting a formula (11)iThe weights in the three feature dimensions are scored.
W(cki)=W(Pos(cki))+W(Coo(cki))+W(Sem(cki))
(11)。
The influence of the insufficiency of a single characteristic factor on the extraction effect of the key words can be avoided by fusing three different characteristics, and d is recordediOf each candidate keyword ckiCorresponding score
Score(di)={W(ck1)…W(cki)…W(ckn) For Score (d) } for Score (d)i) The scores in (1) are ranked from high to low, and a dynamic threshold lambda is set as the average value of all the scores, and the calculation mode is shown as a formula (12);
if d isiThe medium candidate keyword satisfies W (ck)i) When the k is more than or equal to lambda, ck isiAnd adding the result into the result keyword set. The reason why the fixed threshold is not selected is that different technical standards have differences in length, and the candidate keyword score ranges calculated by different technical standards are different, so that a dynamic threshold is set for the actual score range of a single technical standard.
The method is used for carrying out experiments, experimental data are selected from 2016 technical standards (including technical specifications and technical reports) on a 3GPP website, and 8000 pieces of experimental data are obtained in total after de-noising is carried out again. The effective series (series) number range of the technical standards is 01-12, 21-38, 41-46, 48-52 and 55, and 42 series are counted, each series comprises a plurality of versions and is 14G in size, and each technical standard consists of Cover, form, Scope, Reference, Definitions and bands, Topic body and Annex parts.
In the experiment, three evaluation indexes of accuracy (P), recall rate (R) and F-value (F-Score) which are commonly used in natural language processing tasks are adopted to evaluate the extraction effect of the keywords, and the calculation methods are respectively shown in formulas (13) to (15).
Extracting technical standard general words from the preprocessed technical standard by using a method based on the word frequency-document distribution entropy, obtaining the optimal threshold value of the word frequency-document distribution entropy as 5.42 through a plurality of experiments, selecting the words larger than the threshold value as technical standard general words, obtaining 13566 general words in total, wherein part of general word extraction results are shown in table 2.
Table 2 partial common word extraction results
Serial number Common words H(W) Serial number Common words H(W)
1 version 10.9665 11 all 9.9539
2 should 10.8165 12 possible 9.8908
3 latest 10.7022 13 foreword 9.8543
4 approve 10.6394 14 through 9.8097
5 specifiction 10.5639 15 modify 9.7739
6 update 10.4934 16 restriction 9.6978
7 present 10.2963 17 this 9.6536
8 within 10.1056 18 available 9.6281
9 be 10.0572 19 release 9.5941
10 further 10.0188 20 when 9.5148
As can be seen from table 2, the algorithm based on the term frequency-document distribution entropy can extract not only the common stop words "all", "this", "while", and the like, but also common words in the technical standard, for example: "version", "specification", "release", and the like. By adopting the method, most technical standard common words can be effectively obtained.
And after filtering the candidate keyword set in each technical standard by using the general word list, respectively calculating the weights corresponding to the position characteristic, the word co-occurrence characteristic and the context semantic characteristic. When the context semantic features are calculated, a Skip-Gram model in Word2vec and a Huffman Softmax optimization method are selected for training 14G technical standards in an experiment, a context window is set to be 10, vector dimensions are set to be 200, and 965.1M model files are obtained after 10 iterations. To analyze the effect of different features on the extraction of technical standard keywords, the combination of experimentally set comparison features is shown in table 3.
TABLE 3 combination of features
And (3) respectively calculating the scores of the candidate keywords of each technical standard under different feature combinations by combining formulas (3) to (11), calculating a dynamic threshold value by utilizing a formula (12), and screening out the candidate keywords meeting the conditions as the identified keywords. And simultaneously randomly extracting 1000 technical standards containing different series and versions from 8000 technical standards, and respectively screening 2, 4, 6, 8 and 10 keywords from each technical standard as a reference keyword set in a form of intersection of three-person cross labeling. And respectively carrying out morphology reduction on the recognized keywords and the manually marked reference keyword set, then comparing, if the recognized keywords and the marked keywords have the same morphology or are mutually abbreviated and fully named, considering the recognized keywords as the correct recognition condition, meanwhile, counting the correctness, the recall rate and the F-value of the recognized keywords under different numbers of keywords by different feature combinations, and the experimental result is shown in table 4.
TABLE 4 keyword extraction results under different feature combinations
As can be seen from table 4, when the number of key words is 2, the Feature recognition recall rates of Feature1, Feature4, Feature5, and Feature7 are higher than those of other Feature combinations. This is because, when the number of keywords is small, those candidate keywords appearing at a particular position are more likely to be correctly recognized as keywords. Meanwhile, the words in the special positions provide less context semantic information, so that the position characteristics of the keywords appearing in the technical standard relatively dominate. When the number of key words is increased from 2, the recall rate corresponding to Feature1 is slowly increased and gradually decreased compared with Feature 1-Feature 3; the Feature2 obviously increases the recall rate of the correct rate when the number of the keywords is 4-8, and then the correct rate is reduced to some extent; when the number of keywords exceeds 6, Feature3 increases the recall ratio. The method is characterized in that the influence of the position on the weight of the keywords is gradually reduced along with the increase of the number of the keywords, and the influence of the word co-occurrence characteristics and the context semantic characteristics on the weight of the keywords is gradually increased. Meanwhile, comparing Feature5 with Feature7 shows that the accuracy and recall rate are increased after the word co-occurrence Feature is added. This is because word co-occurrence factors help identify more phrase-type keywords that are likely to correspond to abbreviated keywords having a certain general meaning but not occupying a position advantage, and as the number of keywords increases, the keywords identified by word co-occurrence characteristics are more likely to be included in the reference keyword set. Comparing Feature4 with Feature7, it can be found that the recall rate is obviously increased from the keyword number of 4 after the context semantic features are added. The reason is that when the number of keywords increases, candidate keywords characterized by rich contextual semantic information are also more likely to be selected as keywords. When the number of key words is the same, comparing Feature1, Feature2, Feature3 and Feature7, it can be found that Feature7 obtains better recognition effect than any single Feature due to the combination of advantages of different features.
The automatic extraction method of the keywords, provided by the invention, integrates the position characteristics, the word co-occurrence characteristics and the context semantic characteristics to extract the keywords, comprehensively considers the weight influence of the internal positions of the documents and the context semantic characteristics on the keywords, achieves higher accuracy and recall rate, improves the retrieval quality of the 3GPP technical standard, reduces the labor cost and can well meet the requirements of practical application.
The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. An automatic keyword extraction method is characterized by comprising the following steps: extracting common words, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating the weight score of the candidate keywords by integrating the position characteristics, the word co-occurrence characteristics and the context semantic characteristics, calculating a dynamic threshold according to the weight score range of the candidate keywords, and determining result keywords by utilizing the dynamic threshold.
2. The method for automatically extracting keywords according to claim 1, wherein the method for automatically extracting keywords comprises:
step 1) removing text noise in the 3GPP technical standard;
step 2) extracting common words in the technical standard;
step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
3. The method for automatically extracting keywords according to claim 1, wherein the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.
4. The automatic keyword extraction method according to claims 1 to 3, wherein the step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; let a document set consisting of n technical standards be denoted as D ═ D1,d2...di...dnAnd recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of
Wherein, P (w, d)i) For the occurrence of the word w in the technical standard diI is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d)i) Is calculated by the formula
Wherein, f (w, d)i) As wordsw is in technical standard diThe number of occurrences in (c).
5. The method for automatically extracting keywords according to claims 1-4, wherein extracting candidate keywords based on the dependency syntax analysis tree comprises:
step 1: traversing the technical standard set D, for each technical standard D in DiDividing into sentences according to punctuations, and representing the divided sentence sets as
1≤i≤ns,nsAs a document diThe number of Chinese sentences;
step 2: for set sequences (d)i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)i) Memory for recordingWherein T isiIndicates technical standard diThe dependency syntax analysis tree corresponding to the ith sentence;
step 3: cyclic read dependency parse tree set Trees (d)i) For any dependency syntax tree Ti∈Trees(di) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly modeiIf the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the noun phrase as a parent node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into a candidate keyword set as a whole;
step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.
6. The method for automatically extracting keywords according to claims 1-5, wherein the method for calculating the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard diMiddle candidate keyword set CK (d)i)={ck1,ck2...cki...cknIn which ckiFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(cki) Representing candidate keywords ckiPosition of occurrence, defining a characteristic function Pos (ck)i) Representing candidate keywords ckiThe weight in the dimension of the appearance position is assigned
Wherein, SnockiRepresenting candidate keywords ckiNumber of sentence in it, SnuckiRepresenting candidate keywords ckiNumber of sentences in text paragraph, len (ck)i) Representing candidate keywords ckiThe number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck)i) Represents the average of the position weights, then
Of these, fre (ck)i) Representing candidate keywords ckiThe frequencies occurring in the same technical standard.
7. The method for automatically extracting keywords according to claims 1-6, wherein the method for calculating the weight of the co-occurrence features of the words comprises the following steps:
the candidate keyword set with all technical standards is CK ═ CK (d)1),CK(d2)...CK(di)...CK(dn) H.for technical standard diAny one of the candidate keywords ckiMemory component ckiAre each cw1,cw2…cwi…cwmM is ckiNumber of words contained, let cwiCo-occurring word set of (1) is cocuri={wco1,wco2…wcoi…wcopP is the size of the co-occurrence set, wherein wcojRepresenting the word cwiWcoj∈CK(di) And satisfies wco1∩wco2∩…∩wcoj∩…∩wcop={cwiJ is more than or equal to 1 and less than or equal to p, then cwiFor candidate keywords ckiIs expressed as
Wherein, fre (cor)j) Representing the word cwiCo-occurrence of wcojFrequency of occurrence, len (wco)j) Indicating co-occurrence of wcojThe number of words contained; when the candidate keyword cki contains a plurality of words, the candidate keyword ck is calculatediThe weight calculation formula on the dimension of word co-occurrence is
8. The method for automatically extracting keywords according to claims 1-7, wherein the context semantic feature weight calculation method comprises:
the calculation task is decomposed into the probability maximum value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the objective function is
Wherein c isiE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)i| w) is expressed as
Wherein,and vwAre respectively the word ciAnd w, c' is all non-repeating words in the corpus, vc′A vector representation of c'; each technical standard D in the technical standard set DiViewed as being composed of a series of words w1…wi…wnComposition, assuming mutual independence between words, to technical criteria diOf each candidate keyword ckiIf the word type is used, the formula for calculating the prediction probability is
If the phrase type is used, the calculation formula is
After taking the logarithm of both sides of the above formula, the logP (w) on the left side is1…wi…wn|cki) As a measure of candidate keywords ckiThe weight measure in the semantic dimension is denoted as W (Sem (ck)i) logP (w)1…wi…wn|cki) Approximately as logP (c)1…ci…cn|cki) Wherein w is1…wi…wnAs candidate keywords ckiContext within the scope of the model window, abbreviated as Context (ck)i) Then W (Sem (ck)i) Is calculated as
9. The automatic keyword extraction method according to claims 1 to 8, wherein the step 4) comprises:
to technical standard diAny one of the candidate keywords ckiComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ckiThe weight scores in the three characteristic dimensions are formulated as
W(cki)=W(Pos(cki))+W(Coo(cki))+W(Sem(cki));
Note diOf each candidate keyword ckiCorresponding Score (d)i)={W(ck1)...W(cki)...W(ckn) For Score (d) } for Score (d)i) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is
If d isiThe medium candidate keyword satisfies W (ck)i) When the k is more than or equal to lambda, ck isiAnd adding the result into the result keyword set.
10. The method for automatically extracting keywords according to claims 1-9, wherein the text noise comprises pictures, tables, formulas, special symbols and illegal characters.
CN201810611476.7A 2018-06-13 2018-06-13 Automatic keyword extraction method Active CN108920456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810611476.7A CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810611476.7A CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Publications (2)

Publication Number Publication Date
CN108920456A true CN108920456A (en) 2018-11-30
CN108920456B CN108920456B (en) 2022-08-30

Family

ID=64419617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810611476.7A Active CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Country Status (1)

Country Link
CN (1) CN108920456B (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN109960724A (en) * 2019-03-13 2019-07-02 北京工业大学 A kind of text snippet method based on TF-IDF
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111435405A (en) * 2019-01-15 2020-07-21 北京行数通科技有限公司 Method and device for automatically labeling key sentences of article
CN111552786A (en) * 2020-04-16 2020-08-18 重庆大学 Question-answering working method based on keyword extraction
CN111597793A (en) * 2020-04-20 2020-08-28 中山大学 Paper innovation measuring method based on SAO-ADV structure
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network
CN111985217A (en) * 2020-09-09 2020-11-24 吉林大学 Keyword extraction method and computing device
CN112988951A (en) * 2021-03-16 2021-06-18 福州数据技术研究院有限公司 Scientific research project review expert accurate recommendation method and storage device
CN113191145A (en) * 2021-05-21 2021-07-30 百度在线网络技术(北京)有限公司 Keyword processing method and device, electronic equipment and medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113971216A (en) * 2021-10-22 2022-01-25 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory
CN114492433A (en) * 2022-01-27 2022-05-13 南京烽火星空通信发展有限公司 Method for automatically selecting proper keyword combination to extract text
CN114626361A (en) * 2020-12-10 2022-06-14 广州视源电子科技股份有限公司 Sentence making method, sentence making model training method and device and computer equipment
CN118657634A (en) * 2024-08-21 2024-09-17 青岛中投创新技术转移有限公司 Patent analysis and evaluation method based on artificial intelligence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110004465A1 (en) * 2009-07-02 2011-01-06 Battelle Memorial Institute Computation and Analysis of Significant Themes
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110004465A1 (en) * 2009-07-02 2011-01-06 Battelle Memorial Institute Computation and Analysis of Significant Themes
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
都云程等: "基于字同现频率的关键词自动抽取", 《北京信息科技大学学报》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN111435405A (en) * 2019-01-15 2020-07-21 北京行数通科技有限公司 Method and device for automatically labeling key sentences of article
CN109960724A (en) * 2019-03-13 2019-07-02 北京工业大学 A kind of text snippet method based on TF-IDF
CN110134767A (en) * 2019-05-10 2019-08-16 云知声(上海)智能科技有限公司 A kind of screening technique of vocabulary
CN110147425B (en) * 2019-05-22 2021-04-06 华泰期货有限公司 Keyword extraction method and device, computer equipment and storage medium
CN110147425A (en) * 2019-05-22 2019-08-20 华泰期货有限公司 A kind of keyword extracting method, device, computer equipment and storage medium
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111552786A (en) * 2020-04-16 2020-08-18 重庆大学 Question-answering working method based on keyword extraction
CN111597793A (en) * 2020-04-20 2020-08-28 中山大学 Paper innovation measuring method based on SAO-ADV structure
CN111597793B (en) * 2020-04-20 2023-06-16 中山大学 Paper innovation measuring method based on SAO-ADV structure
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network
CN111985217A (en) * 2020-09-09 2020-11-24 吉林大学 Keyword extraction method and computing device
CN111985217B (en) * 2020-09-09 2022-08-02 吉林大学 Keyword extraction method, computing device and readable storage medium
CN114626361A (en) * 2020-12-10 2022-06-14 广州视源电子科技股份有限公司 Sentence making method, sentence making model training method and device and computer equipment
CN112988951A (en) * 2021-03-16 2021-06-18 福州数据技术研究院有限公司 Scientific research project review expert accurate recommendation method and storage device
CN113191145A (en) * 2021-05-21 2021-07-30 百度在线网络技术(北京)有限公司 Keyword processing method and device, electronic equipment and medium
CN113191145B (en) * 2021-05-21 2023-08-11 百度在线网络技术(北京)有限公司 Keyword processing method and device, electronic equipment and medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
CN113743090A (en) * 2021-09-08 2021-12-03 度小满科技(北京)有限公司 Keyword extraction method and device
CN113743090B (en) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 Keyword extraction method and device
CN113971216A (en) * 2021-10-22 2022-01-25 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory
CN114492433A (en) * 2022-01-27 2022-05-13 南京烽火星空通信发展有限公司 Method for automatically selecting proper keyword combination to extract text
CN118657634A (en) * 2024-08-21 2024-09-17 青岛中投创新技术转移有限公司 Patent analysis and evaluation method based on artificial intelligence

Also Published As

Publication number Publication date
CN108920456B (en) 2022-08-30

Similar Documents

Publication Publication Date Title
CN108920456B (en) Automatic keyword extraction method
CN110874531B (en) Topic analysis method and device and storage medium
Beeferman et al. Statistical models for text segmentation
Sridhar Unsupervised topic modeling for short texts using distributed representations of words
CN107229610B (en) A kind of analysis method and device of affection data
US9317498B2 (en) Systems and methods for generating summaries of documents
US10437867B2 (en) Scenario generating apparatus and computer program therefor
US9275115B2 (en) Correlating corpus/corpora value from answered questions
US9892111B2 (en) Method and device to estimate similarity between documents having multiple segments
US8645418B2 (en) Method and apparatus for word quality mining and evaluating
EP3086240A1 (en) Complex predicate template gathering device, and computer program therefor
CN103678316A (en) Entity relationship classifying device and entity relationship classifying method
CN113988053A (en) Hot word extraction method and device
CN109614626A (en) Keyword Automatic method based on gravitational model
CN104317783A (en) SRC calculation method
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
Hofmann et al. Predicting the growth of morphological families from social and linguistic factors
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN106126501B (en) A kind of noun Word sense disambiguation method and device based on interdependent constraint and knowledge
Wei et al. Query based summarization using topic background knowledge
CN111899832B (en) Medical theme management system and method based on context semantic analysis
Mendels et al. Collecting code-switched data from social media
CN115455975A (en) Method and device for extracting topic keywords based on multi-model fusion decision
JP5128328B2 (en) Ambiguity evaluation apparatus and program
CN114548113A (en) Event-based reference resolution system, method, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant