CN108920456A

CN108920456A - A kind of keyword Automatic method

Info

Publication number: CN108920456A
Application number: CN201810611476.7A
Authority: CN
Inventors: 吕学强; 董志安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2018-11-30
Anticipated expiration: 2038-06-13
Also published as: CN108920456B

Abstract

The present invention relates to a kind of keyword Automatic methods, including：General term in extraction technique standard, extract candidate keywords, after filtering general term for candidate keywords, integrated location feature, Term co-occurrence feature and context semantic feature calculate candidate keywords weighted score, according to candidate keywords weighted score range computation dynamic threshold, dynamic threshold definitive result keyword is utilized.Keyword Automatic method provided by the invention, merge position feature, Term co-occurrence feature and context semantic feature extracting keywords, inside documents position and context semantic feature are comprehensively considered to the weights influence of keyword, higher accuracy and recall rate are reached, improve 3GPP technical standard retrieval quality, cost of labor is reduced, the needs of practical application can be met well.

Description

Automatic keyword extraction method

Technical Field

The invention belongs to the technical field of automatic extraction of keywords, and particularly relates to an automatic extraction method of keywords facing to a 3GPP technical standard.

Background

The explosive development of mobile communication technology brings epoch-making changes to the human society. As a standard maker of The leading technology in The field of communications, The 3rd Generation Partnership Project (3 GPP) is dedicated to generalizing 3G standards based on evolved global system for mobile communications (GSM) core networks including WCDMA, TD-scdma, EDGE, etc.

In recent years, there are many cases of patent infringement litigation disputes among large-scale communication technology companies, and the stability of patent rights of the invention is challenged unprecedentedly. The 3GPP technical standards play an irreplaceable important role in the work of examining communication patents.

The 3GPP technical standard is a scientific non-patent document specific to the patent examination work in the communication field, and is usually used as a comparison document to measure the creativity and novelty of the patent application in the communication field.

The typical 3GPP technical standard Cover mainly includes a standard number, a release number, a document title and version number information, a form part explains the version number, a Scope part declares the application range, a Reference part gives a Reference list, Definitions and abbrevatients parts list important Definitions and abbreviations of documents, a Topic body part specifically introduces the technical background and details, and Annex mainly relates to the version change history.

In addition, there is a correlation between the 3GPP technical standard and the patent literature, and the difference between the 3GPP technical standard and the patent literature is shown in table 1.

Table 1 patent document differs from 3GPP technical standard

As can be seen from table 1, the 3GPP technical standard has its own unique organization and type. Of major interest in the actual patent examination are Technical Specifications (TS), Technical Reports (TR), and conference files. The technical specification and the technical report collectively describe relevant regulations, principles, simulation and experimental results of the technology, and the conference file mainly records specific conference information of each working group. In contrast, the technical specification and the technical report content format are similar, and the loaded core technical information is richer, and has greater mining value.

In actual patent review, the search of 3GPP technical standards is mainly based on the key words manually selected by the examiner. The quality of the retrieval result often depends on the quality of the keywords, and the traditional mode not only consumes time and labor, but also is difficult to ensure the hit rate of the comparison files. Compared with patent documents, the 3GPP technical standard has the characteristics of wide coverage, large information amount, irregular format and weak readability, and the characteristics directly determine that the 3GPP technical standard has higher difficulty in automatically extracting keywords than the patent documents. Therefore, the automatic extraction effect of the keywords of the 3GPP technical standard is improved, which is not only beneficial to improving the examination efficiency of the communication patents, but also has great significance for maintaining the authorization stability of the patents.

The automatic keyword extraction method has a great deal of relevant research at home and abroad, and generally comprises two major branches of a supervised learning method and an unsupervised learning method. The supervised learning method generally converts the keyword extraction problem into a two-class or multi-class problem in machine learning, and mainly relates to naive BeibeiClass models such as Bayes (Naive Bayes), maximum entropy (Maximumentopy), Support Vector Machines (Support Vector Machines), and the like. Although the method has a good prediction effect to a certain extent, the extraction effect often depends on the labeling quality of the training corpus and the scale of the training corpus, so that excessive human input cannot be avoided, and the method is difficult to adapt to the scene of mass data in practical application. The unsupervised learning method has the most obvious advantage over the supervised learning method in that the labor cost is greatly saved, and can be divided into an extraction method based on statistics, an extraction method based on a topic model and an extraction method based on a word graph model according to the algorithm idea. Wherein, the extraction method based on statistics generally combines the word frequency (term frequency) information, the word frequency-inverse document frequency (TF-IDF) and the chi²The weight of the candidate keywords is measured by statistical indexes such as values, and the method is sensitive to frequency and is easy to omit part of important low-frequency words. The most classical representative of the extraction method based on the topic model is an LDA (latent Dirichlet allocation) model algorithm, the LDA model infers a "document-topic" probability distribution and a "topic-term" probability distribution from a known "document-term" matrix through analyzing a training corpus, and the extraction effect of the method depends on the topic distribution characteristics of a training set. The extraction method based on the word graph is most widely applied to a TextRank algorithm, the thought of the TextRank algorithm is derived from a PageRank algorithm of Google, sentences or words in a text set form graph nodes, the similarity between the nodes (sentences or words) is used as the weight of edges, an iterative voting mechanism is adopted to carry out importance sequencing on the nodes in the graph model, and the method is not dependent on the number of texts, but has the limitation that only the internal information of the texts is considered and the distribution characteristics of vocabularies among different texts are ignored. The mainstream method at the present stage generally performs keyword extraction by fusing the advantages of different methods according to specific problems, and has the following defects: the method has the defects of poor recognition effect, poor low-frequency keyword recognition effect and the like due to the lack of consideration on semantic features.

Disclosure of Invention

In view of the above problems in the prior art, an object of the present invention is to provide an automatic keyword extraction method that can avoid the above technical drawbacks.

In order to achieve the above object, the present invention provides the following technical solutions:

an automatic keyword extraction method comprises the following steps: extracting common words in technical standards, extracting candidate key words, filtering the common words aiming at the candidate key words, calculating candidate key word weight scores by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics, calculating dynamic thresholds according to the candidate key word weight score ranges, and determining result key words by using the dynamic thresholds.

Further, the automatic keyword extraction method comprises the following steps:

step 1) removing text noise in the 3GPP technical standard;

step 2) extracting common words in the technical standard;

step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;

and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.

Further, the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.

Further, the text noise includes pictures, tables, formulas, special symbols, and illegal characters.

Further, step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; is composed of n technical standardsIs expressed as D ═ D₁，d₂…d_i…d_nAnd recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of

Wherein, P (w, d)_i) For the occurrence of the word w in the technical standard d_iThe probability of 1. ltoreq. i. ltoreq.n, according to the maximum likelihood estimation method, P (w, d)_i) Is calculated by the formula

Wherein, f (w, d)_i) For words w in technical standards d_iThe number of occurrences in (c).

Further, extracting candidate keywords based on the dependency parsing tree includes:

step 1: traversing the technical standard set D, for each technical standard D in D_iDividing into sentences according to punctuations, and representing the divided sentence sets as

1≤i≤n_s，n_sAs a document d_iThe number of Chinese sentences;

step 2: for set sequences (d)_i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)_i) Memory for recordingWherein T is_iIndicates technical standard d_iThe corresponding syntax analysis tree of the ith sentence;

step 3: read in cyclesTaking a dependency parse tree set Trees (d)_i) For any dependency syntax tree T_i∈Trees(d_i) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly mode_iIf the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node which takes the noun phrase as a father node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into the candidate keyword set as a whole;

step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.

Further, the calculation method of the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to different-level titles of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard d_iMiddle candidate keyword set CK (d)_i)＝{ck₁，ck₂…ck_i…ck_nIn which ck_iFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as

SP＝{Title，Scope，Reference，Definitions，Abbrevations，NOTE}，

locate(ck_i) Representing candidate keywords ck_iPosition of occurrence, defining a characteristic function Pos (ck)_i) Representing candidate keywords ck_iThe weight in the dimension of the appearance position is assigned

Wherein, Sno_ckiRepresenting candidate keywords ck_iNumber of sentence in it, Snu_ckiRepresenting candidate keywords ck_iNumber of sentences in text paragraph, len (ck)_i) Representing candidate keywords ck_iThe number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck)_i) Represents the average of the position weights, then

Of these, fre (ck)_i) Representing candidate keywords ck_iThe frequencies occurring in the same technical standard.

Further, the word co-occurrence feature weight calculation method comprises the following steps:

the candidate keyword set with all technical standards is CK ═ CK (d)₁)，CK(d₂)…CK(d_i)…CK(d_n) H.for technical standard d_iAny one of the candidate keywords ck_iMemory component ck_iAre each cw₁，cw₂… cw_i…cw_mM is ck_iNumber of words contained, let cw_iCo-occurring word set of (1) is cocur_i＝{wco₁，wco₂…wco_i…wco_pP is the size of the co-occurrence set, wherein wco_jRepresenting the word cw_iWco_j∈CK(d_i) And satisfies wco₁∩wco₂∩…∩wco_j∩…∩wco_p＝{cw_iJ is more than or equal to 1 and less than or equal to p, then cw_iFor candidate keywords ck_iIs expressed as

Wherein, fre (cor)_j) Representing the word cw_iCo-occurrence of wco_jFrequency of occurrence, len (wco)_j) Indicating co-occurrence of wco_jThe number of words contained; when the candidate keyword ck_iWhen a plurality of words are included, candidate keywords ck are calculated_iThe weight calculation formula on the dimension of word co-occurrence is

Further, the context semantic feature weight calculation method comprises the following steps:

the calculation task is decomposed into the probability maximum value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the objective function is

Wherein c is_iE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)_i| w) is expressed as

Wherein,and v_wAre respectively the word c_iAnd w, c' is all non-repeating words in the corpus, v_cVector representation of 'c'; each technical standard D in the technical standard set D_iViewed as being composed of a series of words w₁…w_i…w_nComposition, assuming mutual independence between words, to technical criteria d_iOf each candidate keyword ck_iIf the word type is used, the formula for calculating the prediction probability is

If the phrase type is used, the calculation formula is

After taking the logarithm of both sides of the above formula, the logP (w) on the left side is₁…w_i…w_n|ck_i) As a measure of candidate keywords ck_iThe weight measure in the semantic dimension is denoted as W (Sem (ck)_i) logP (w)₁…w_i…w_n|ck_i) The approximation is seen as logP (c)₁…c_i…c_n|ck_i) Wherein w is₁…w_i…w_nAs candidate keywords ck_iThe Context within the scope of the model window is abbreviated as Context (ck)_i) Then W (Sem (ck)_i) Is calculated as

Further, the step 4) comprises:

to technical standard d_iAny one of the candidate keywords ck_iComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ck_iThe weight scores in the three characteristic dimensions are formulated as

W(ck_i)＝W(Pos(ck_i))+W(Coo(ck_i))+W(Sem(ck_i))；

Note d_iOf each candidate keyword ck_iCorresponding score

Score(d_i)＝{W(ck₁)…W(ck_i)…W(ck_n) For Score (d) } for Score (d)_i) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is

If d is_iThe medium candidate keyword satisfies W (ck)_i) When the k is more than or equal to lambda, ck is_iAnd adding the result into the result keyword set.

The automatic extraction method of the keywords, provided by the invention, integrates the position characteristics, the word co-occurrence characteristics and the context semantic characteristics to extract the keywords, comprehensively considers the weight influence of the internal positions of the documents and the context semantic characteristics on the keywords, achieves higher accuracy and recall rate, improves the retrieval quality of the 3GPP technical standard, reduces the labor cost and can well meet the requirements of practical application.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a dependency parse tree;

FIG. 3 is a comparison of the CBOW model and Skip _ gram model frameworks.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the scope of protection of the present invention.

The invention provides an automatic keyword extraction method, which comprises the steps of firstly extracting common words in a 3GPP technical standard based on a word frequency-document distribution entropy method, then extracting candidate keywords based on an algorithm of a dependency syntax analysis tree, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics after filtering the common words aiming at the candidate keywords, calculating a dynamic threshold according to a candidate keyword weight score range of each technical standard, and finally determining result keywords by using the dynamic threshold. Specifically, as shown in fig. 1, an automatic keyword extraction method includes the following steps:

step 1) preprocessing the 3GPP technical standard, which mainly comprises adopting an Apache POI analysis technical standard to remove text noises such as pictures, tables, formulas, special symbols and illegal characters in the technical standard;

step 2) extracting common words in all technical standards based on the word frequency-document distribution entropy;

step 3) segmenting each technical standard into sentence subsets, performing dependency syntax analysis on each sentence, extracting candidate keywords based on a dependency syntax analysis tree and filtering common words;

The 3GPP technical standards include not only simple stop words such as "if", "at", "not", "or", etc., but also general words throughout most technical standards, such as "Figure", "version", "general", "given", etc., which are specific to the technical standards and have no representativeness and importance. It has been observed that both simple stop words and generic words that are specific to the technical standard appear in different versions and different types of technical standards with different frequencies, with a high degree of currency, and generally without the ability to generalize or abstract out the specific technical standard content. These words are collectively referred to as common words.

Clearly, coverage is not comprehensive enough if only the manually gathered common deactivation vocabulary is selected. Therefore, in order to reduce the interference of the general words to the keyword extraction task as much as possible, the concept of word frequency-document distribution entropy is introduced to automatically obtain the technical standard general words by combining the information entropy principle. The information entropy is introduced into an information theory by Shannon at first and is used for measuring the uncertainty of the discrete random variable, and the larger the information entropy value is, the larger the uncertainty corresponding to the random variable is. Similarly, regarding the word w as a random variable, the definition of the word frequency-document distribution entropy is given as follows.

Definition 1 word frequency-document distribution entropy refers to a measure of uncertainty in the state of distribution of a word w in a set of technical standards.

Let a document set consisting of n technical standards be denoted as D ═ D₁，d₂…d_i…d_nAnd (5) recording the word frequency of the word w and the document distribution entropy as H (w), then H (w) is calculated as shown in formula (1),

wherein, P (w, d)_i) For the occurrence of the word w in the technical standard d_iThe probability of 1. ltoreq. i. ltoreq.n, according to the maximum likelihood estimation method, P (w, d)_i) Can be calculated from the formula (2),

wherein, f (w, d)_i) For words w in technical standards d_iThe number of occurrences in (c). It can be seen that if the technical standard containing w is richer and w is distributed more uniformly in the technical standard set, the word frequency-document distribution entropy h (w) is larger, and indicates that w is in the technical standardThe greater the uncertainty in the distribution in the quasi-set D, and thus the more likely w is a generic word of no significance in the technical standard set.

Statistics show that most keywords are real word phrases such as nouns, verbs and adjectives generally and do not contain stop words without practical meanings and universal words distributed uniformly in technical standards. Thus, the keyword category is defined as verbs, adjectives, nouns, and noun phrases after the removal of the common words. In order to extract candidate keywords with semantic consistency and syntactic modification integrity, sentence in 3GPP technical standard is subjected to dependency syntactic analysis, and noun phrases, verbs, adjectives and nouns meeting syntactic modification consistency are extracted by combining a dependency syntactic analysis tree and are added into a candidate keyword set. Wherein, the NP with the minimum granularity in the syntactic parse tree is used as a candidate keyword for the noun phrase. And finally, filtering the universal words according to the candidate keyword set. For example: the syntax analysis of the sentence "local channels are SAPs between MAC and RLC" shows the result as shown in FIG. 2.

As can be seen from fig. 2, the adjectives "local" modify the nouns "channels", "local" and "channels" to form a Noun Phrase (NP), and "SAPs" and "MAC and RLC" are Noun Phrases (NP), while "SAPs beta ween MAC and RLC" is also a Noun Phrase (NP) as a whole, but in the dependency parsing tree, the noun phrases "MAC and RLC" and "beta ween" form preposition phrases PP and "SAPs" which are child nodes of NP and have sibling node relationship therebetween. Clearly, the granularity is smaller for the noun phrase "MAC and RLC" than for "sapsboween MAC and RLC". Therefore, "local", "channels", "local channels", "are", "SAPs", "MAC", "RLC", and "MAC and RLC" are selected as candidate keywords in the example sentence, and the candidate keywords are filtered using the extracted common words. According to the above analysis, the candidate keyword extraction algorithm based on the dependency parsing tree comprises the following steps:

step 1: traversing skillSet of technical criteria D, for each technical criterion D in D_iDividing into sentences according to punctuations, and representing the divided sentence sets as

1≤i≤n_s，n_sAs a document d_iThe number of Chinese sentences.

Step 2: for set sequences (d)_i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)_i) Memory for recordingWherein T is_iIndicates technical standard d_iThe ith sentence in the sentence is corresponding to the stored syntax parse tree.

Step 3: cyclic read dependency parse tree set Trees (d)_i) For any dependency syntax tree T_i∈Trees(d_i) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly mode_iIf the current node is a leaf node (not the last leaf node), judging whether the part of speech of the node is a noun, a verb and an adjective, adding the node into the candidate keyword set if the conditions are met, and otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a Noun Phrase (NP) or not, if the current node is the noun phrase and the right subtree is not empty, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the NP as a parent node exists in the subtree, and adding the child nodes of the NP into the candidate keyword set as a whole.

Step 4: since some technical standard common words still exist in the candidate keywords extracted in the previous step, the candidate keyword set needs to be further filtered by using the extracted common words, and if an element containing the common words exists in the candidate keyword set, the element is removed from the candidate keyword set.

By analyzing the characteristics of the 3GPP technical standard, the portions of Scope, Reference, Definitions and Abbreviations except the text can be found to have important Reference values for the whole document, and should be considered as the key positions. The content of each chapter of the body part usually expands around the nearest title, so that the title can be regarded as the concentration of the core content of the corresponding paragraph, and the candidate keywords appearing at the position should be given higher weight. Similarly, what appears in the section NOTEs (NOTEs) generally serves as additional emphasis or supplementary description herein and should therefore also be treated as a special location.

Therefore, the position of the candidate keyword appearing in the 3GPP technical standard is taken as a weight influence factor. The method comprises the steps of respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of a 3GPP technical standard, numbering sentences in the sentence sets from 1 in sequence, and if the number of the sentence in which a candidate keyword is located is smaller, the candidate keyword is closer to the title, so that the candidate keyword is more likely to be a keyword which tightly deducts the subject matter. Recording technical standard d_iMiddle candidate keyword set CK (d)_i)＝{ck₁，ck₂…ck_i…ck_nIn which ck_iFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as

SP＝{Title，Scope，Reference，Definitions，Abbrevations，NOTE}，

locate(ck_i) Representing candidate keywords ck_iPosition of occurrence, defining a characteristic function Pos (ck)_i) Representing candidate keywords ck_iThe weight in the dimension of the occurrence position is assigned, then Pos (ck)_i) Can be expressed as shown in equation (3).

Wherein, Sno_ckiRepresenting candidate keywords ck_iSentence editing of placeNumber Snu_ckiRepresenting candidate keywords ck_iNumber of sentences in text paragraph, len (ck)_i) Representing candidate keywords ck_iThe number of words contained. Denominator plus len (ck)_i) To avoid the situation that the position weight appears 0, due to the candidate keyword ck_iIn technical Standard d_iMay occur multiple times at different locations, and thus the weights occurring at different locations are averaged, denoted W (Pos (ck)_i) Is) represents the average value of the position weights, the calculation method thereof is as shown in equation (4).

Of these, fre (ck)_i) Representing candidate keywords ck_iThe frequencies occurring in the same technical standard. The processing mode of taking the average value can enhance the candidate keywords ck_iThe weight with lower frequency but appearing in a special position weakens the deviation caused by calculating the weight of the candidate keyword only by the frequency characteristic.

Word co-occurrence characteristics are a factor that is not negligible in keyword extraction. By observing the extracted candidate keywords of the 3GPP technical standard, it is found that there is a phenomenon that constituent words of one candidate word repeatedly appear in other candidate words of different lengths, for example: for three candidate keywords, "MCH", "MCH transmission", "MCH subframe allocation", which occurs in two other candidate keywords of different lengths, the "MCH transmission" and "mchhsubframe allocation" may be regarded as co-occurring words of the "MCH", which often express more specific information than the individual constituent words. Therefore, if a word constituting a candidate keyword has many co-occurring words, the word is considered to have a richer meaning and should be given a higher weight. According to the analysis, the co-occurrence word frequency and the word length of the candidate keyword forming words are used as word co-occurrence characteristics to calculate the weight of the candidate keyword.

Recording all technical standardsThe candidate keyword set of (c) is CK ═ CK (d)₁)，CK(d₂)…CK(d_i)…CK(d_n) H.for technical standard d_iAny one of the candidate keywords ck_iMemory component ck_iAre each cw₁，cw₂… cw_i…cw_mM is ck_iNumber of words contained, let cw_iCo-occurring word set of (1) is cocur_i＝{wco₁，wco₂… wco_i…wco_pP is the size of the co-occurrence word set (i.e., the number of co-occurrences in the co-occurrence word set), wherein wco_jRepresenting the word cw_iWco_j∈CK(d_i) And satisfies wco₁∩wco₂∩…∩wco_j∩…∩wco_p＝{cw_iJ is more than or equal to 1 and less than or equal to p, then cw_iFor candidate keywords ck_iThe contribution of (c) can be expressed by equation (5);

wherein, fre (cor)_j) Representing the word cw_iCo-occurrence of wco_jFrequency of occurrence, len (wco)_j) Indicating co-occurrence of wco_jThe number of words contained. When the candidate keyword ck_iWhen a plurality of words are included, candidate keywords ck are calculated_iThe weight calculation method in this dimension of word co-occurrence is shown in equation (6).

It can be seen that when the candidate keyword ck_iThe more frequently occurring component words in the co-occurrence words, the candidate keyword ck of each component word pair_iIs greater, so the candidate keyword ck_iThe greater the weight in this dimension of the word co-occurrence feature.

Keywords generally highly condense the core content of technical standards, and often have the commonality of collectively embodying the gist of technical standards from different semantic levels. Therefore, the influence of semantic features of the candidate keywords in the context on the weights cannot be ignored. Considering that the Word vector can well represent semantic characteristics, Word2vec is introduced to calculate the weight of the candidate keywords in the dimension of the semantic characteristics.

Word2vec is a scheme implementation tool which is provided by google based on deep learning thought and solves the problems of lack of model generalization force, dimension disaster and the like in the statistical language model calculation process. Word2vec comprises two training models of CBOW and Skip _ gram, in order to reduce the complexity of model solving, two training optimization methods of Hierachy Softmax (HS) and Negative Sampling (NS) are provided, and a training frame is formed by combining the training models and the optimization methods. As shown in fig. 3, the training frames formed by the two models have a common point in that they both include an Input Layer (Input Layer), a Projection Layer (Projection Layer), and an Output Layer (Output Layer), and the difference is that the training frame based on the CBOW model predicts the current word w according to the context semantic environment in which the words appear, and the training frame based on the Skip _ gram model predicts the context semantic information according to the current word w.

In order to solve the problem of predicting context (w) (window c) by current word w, the Skip _ gram model decomposes the calculation task into the probability maximum of predicting each word forming context (w) by current word w independently, and the objective function is

Wherein c is_iE.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c)_i| w) is expressed by Softmax normalization, as shown in equation (7);

wherein,and v_wAre respectively the word c_iAnd w, c' is all non-repeating words in the corpus, the number is large, and Hierachy Softmax or Negative Sampling optimization can be adopted, v_cVector representation of 'c'. Each technical standard D in the technical standard set D_iViewed as being composed of a series of words w₁…w_i… w_nComposition, assuming mutual independence between words, to technical criteria d_iOf each candidate keyword ck_iIf the word type is the word type, calculating the prediction probability by using a formula (8), and if the word type is the phrase type, calculating by using a formula (9);

wherein P (w)_j|ck_i) By using the calculation of equation (7) for the variable substitution, it can be seen that when the probability P (w) is predicted₁…w_i…w_n|ck_i) The larger the candidate keyword ck is, the larger the candidate keyword ck is_iThe better the effect of predicting context information, the more likely it is to be a keyword that characterizes full-text information. In order to avoid as far as possible the occurrence of extremely small errors due to the excessively small conditional probability in the continuous multiplication calculation, the logP (w) on the left side is obtained by taking the logarithm of both sides of the above equation₁…w_i… w_n|ck_i) As a measure of candidate keywords ck_iThe weight measure in the semantic dimension is denoted as W (Sem (ck)_i) Meanwhile, considering that the relation is established to the similar words when the Word2vec training corpus is considered, logP (w) is used for simplifying calculation₁…w_i…w_n|ck_i) Approximately as logP (c)₁…c_i…c_n|ck_i) Wherein w is₁…w_i…w_nAs candidate keywords ck_iContext within the scope of the model window, abbreviated as Context (ck)_i) Then W (Sem (ck)_i) The calculation method is shown in formula (10);

to technical standard d_iAny one of the candidate keywords ck_iComprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, and calculating the candidate keywords ck by adopting a formula (11)_iThe weights in the three feature dimensions are scored.

W(ck_i)＝W(Pos(ck_i))+W(Coo(ck_i))+W(Sem(ck_i))

(11)。

The influence of the insufficiency of a single characteristic factor on the extraction effect of the key words can be avoided by fusing three different characteristics, and d is recorded_iOf each candidate keyword ck_iCorresponding score

Score(d_i)＝{W(ck₁)…W(ck_i)…W(ck_n) For Score (d) } for Score (d)_i) The scores in (1) are ranked from high to low, and a dynamic threshold lambda is set as the average value of all the scores, and the calculation mode is shown as a formula (12);

if d is_iThe medium candidate keyword satisfies W (ck)_i) When the k is more than or equal to lambda, ck is_iAnd adding the result into the result keyword set. The reason why the fixed threshold is not selected is that different technical standards have differences in length, and the candidate keyword score ranges calculated by different technical standards are different, so that a dynamic threshold is set for the actual score range of a single technical standard.

The method is used for carrying out experiments, experimental data are selected from 2016 technical standards (including technical specifications and technical reports) on a 3GPP website, and 8000 pieces of experimental data are obtained in total after de-noising is carried out again. The effective series (series) number range of the technical standards is 01-12, 21-38, 41-46, 48-52 and 55, and 42 series are counted, each series comprises a plurality of versions and is 14G in size, and each technical standard consists of Cover, form, Scope, Reference, Definitions and bands, Topic body and Annex parts.

In the experiment, three evaluation indexes of accuracy (P), recall rate (R) and F-value (F-Score) which are commonly used in natural language processing tasks are adopted to evaluate the extraction effect of the keywords, and the calculation methods are respectively shown in formulas (13) to (15).

Extracting technical standard general words from the preprocessed technical standard by using a method based on the word frequency-document distribution entropy, obtaining the optimal threshold value of the word frequency-document distribution entropy as 5.42 through a plurality of experiments, selecting the words larger than the threshold value as technical standard general words, obtaining 13566 general words in total, wherein part of general word extraction results are shown in table 2.

Table 2 partial common word extraction results

Serial number	Common words	H(W)	Serial number	Common words	H(W)
						1	version	10.9665	11	all	9.9539
2	should	10.8165	12	possible	9.8908
						3	latest	10.7022	13	foreword	9.8543
4	approve	10.6394	14	through	9.8097
						5	specifiction	10.5639	15	modify	9.7739
6	update	10.4934	16	restriction	9.6978
						7	present	10.2963	17	this	9.6536
8	within	10.1056	18	available	9.6281
						9	be	10.0572	19	release	9.5941
10	further	10.0188	20	when	9.5148

As can be seen from table 2, the algorithm based on the term frequency-document distribution entropy can extract not only the common stop words "all", "this", "while", and the like, but also common words in the technical standard, for example: "version", "specification", "release", and the like. By adopting the method, most technical standard common words can be effectively obtained.

And after filtering the candidate keyword set in each technical standard by using the general word list, respectively calculating the weights corresponding to the position characteristic, the word co-occurrence characteristic and the context semantic characteristic. When the context semantic features are calculated, a Skip-Gram model in Word2vec and a Huffman Softmax optimization method are selected for training 14G technical standards in an experiment, a context window is set to be 10, vector dimensions are set to be 200, and 965.1M model files are obtained after 10 iterations. To analyze the effect of different features on the extraction of technical standard keywords, the combination of experimentally set comparison features is shown in table 3.

TABLE 3 combination of features

And (3) respectively calculating the scores of the candidate keywords of each technical standard under different feature combinations by combining formulas (3) to (11), calculating a dynamic threshold value by utilizing a formula (12), and screening out the candidate keywords meeting the conditions as the identified keywords. And simultaneously randomly extracting 1000 technical standards containing different series and versions from 8000 technical standards, and respectively screening 2, 4, 6, 8 and 10 keywords from each technical standard as a reference keyword set in a form of intersection of three-person cross labeling. And respectively carrying out morphology reduction on the recognized keywords and the manually marked reference keyword set, then comparing, if the recognized keywords and the marked keywords have the same morphology or are mutually abbreviated and fully named, considering the recognized keywords as the correct recognition condition, meanwhile, counting the correctness, the recall rate and the F-value of the recognized keywords under different numbers of keywords by different feature combinations, and the experimental result is shown in table 4.

TABLE 4 keyword extraction results under different feature combinations

As can be seen from table 4, when the number of key words is 2, the Feature recognition recall rates of Feature1, Feature4, Feature5, and Feature7 are higher than those of other Feature combinations. This is because, when the number of keywords is small, those candidate keywords appearing at a particular position are more likely to be correctly recognized as keywords. Meanwhile, the words in the special positions provide less context semantic information, so that the position characteristics of the keywords appearing in the technical standard relatively dominate. When the number of key words is increased from 2, the recall rate corresponding to Feature1 is slowly increased and gradually decreased compared with Feature 1-Feature 3; the Feature2 obviously increases the recall rate of the correct rate when the number of the keywords is 4-8, and then the correct rate is reduced to some extent; when the number of keywords exceeds 6, Feature3 increases the recall ratio. The method is characterized in that the influence of the position on the weight of the keywords is gradually reduced along with the increase of the number of the keywords, and the influence of the word co-occurrence characteristics and the context semantic characteristics on the weight of the keywords is gradually increased. Meanwhile, comparing Feature5 with Feature7 shows that the accuracy and recall rate are increased after the word co-occurrence Feature is added. This is because word co-occurrence factors help identify more phrase-type keywords that are likely to correspond to abbreviated keywords having a certain general meaning but not occupying a position advantage, and as the number of keywords increases, the keywords identified by word co-occurrence characteristics are more likely to be included in the reference keyword set. Comparing Feature4 with Feature7, it can be found that the recall rate is obviously increased from the keyword number of 4 after the context semantic features are added. The reason is that when the number of keywords increases, candidate keywords characterized by rich contextual semantic information are also more likely to be selected as keywords. When the number of key words is the same, comparing Feature1, Feature2, Feature3 and Feature7, it can be found that Feature7 obtains better recognition effect than any single Feature due to the combination of advantages of different features.

The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An automatic keyword extraction method is characterized by comprising the following steps: extracting common words, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating the weight score of the candidate keywords by integrating the position characteristics, the word co-occurrence characteristics and the context semantic characteristics, calculating a dynamic threshold according to the weight score range of the candidate keywords, and determining result keywords by utilizing the dynamic threshold.

2. The method for automatically extracting keywords according to claim 1, wherein the method for automatically extracting keywords comprises:

step 1) removing text noise in the 3GPP technical standard;

step 2) extracting common words in the technical standard;

3. The method for automatically extracting keywords according to claim 1, wherein the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.

4. The automatic keyword extraction method according to claims 1 to 3, wherein the step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; let a document set consisting of n technical standards be denoted as D ═ D₁，d₂...d_i...d_nAnd recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of

Wherein, P (w, d)_i) For the occurrence of the word w in the technical standard d_iI is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d)_i) Is calculated by the formula

Wherein, f (w, d)_i) As wordsw is in technical standard d_iThe number of occurrences in (c).

5. The method for automatically extracting keywords according to claims 1-4, wherein extracting candidate keywords based on the dependency syntax analysis tree comprises:

1≤i≤n_s，n_sAs a document d_iThe number of Chinese sentences;

step 2: for set sequences (d)_i) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d)_i) Memory for recordingWherein T is_iIndicates technical standard d_iThe dependency syntax analysis tree corresponding to the ith sentence;

step 3: cyclic read dependency parse tree set Trees (d)_i) For any dependency syntax tree T_i∈Trees(d_i) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly mode_iIf the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the noun phrase as a parent node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into a candidate keyword set as a whole;

6. The method for automatically extracting keywords according to claims 1-5, wherein the method for calculating the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard d_iMiddle candidate keyword set CK (d)_i)＝{ck₁，ck₂...ck_i...ck_nIn which ck_iFor any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as

SP＝{Title，Scope，Reference，Definitions，Abbrevations，NOTE}，

7. The method for automatically extracting keywords according to claims 1-6, wherein the method for calculating the weight of the co-occurrence features of the words comprises the following steps:

the candidate keyword set with all technical standards is CK ═ CK (d)₁)，CK(d₂)...CK(d_i)...CK(d_n) H.for technical standard d_iAny one of the candidate keywords ck_iMemory component ck_iAre each cw₁，cw₂…cw_i…cw_mM is ck_iNumber of words contained, let cw_iCo-occurring word set of (1) is cocur_i＝{wco₁，wco₂…wco_i…wco_pP is the size of the co-occurrence set, wherein wco_jRepresenting the word cw_iWco_j∈CK(d_i) And satisfies wco₁∩wco₂∩…∩wco_j∩…∩wco_p＝{cw_iJ is more than or equal to 1 and less than or equal to p, then cw_iFor candidate keywords ck_iIs expressed as

Wherein, fre (cor)_j) Representing the word cw_iCo-occurrence of wco_jFrequency of occurrence, len (wco)_j) Indicating co-occurrence of wco_jThe number of words contained; when the candidate keyword cki contains a plurality of words, the candidate keyword ck is calculated_iThe weight calculation formula on the dimension of word co-occurrence is

8. The method for automatically extracting keywords according to claims 1-7, wherein the context semantic feature weight calculation method comprises:

Wherein,and v_wAre respectively the word c_iAnd w, c' is all non-repeating words in the corpus, v_c′A vector representation of c'; each technical standard D in the technical standard set D_iViewed as being composed of a series of words w₁…w_i…w_nComposition, assuming mutual independence between words, to technical criteria d_iOf each candidate keyword ck_iIf the word type is used, the formula for calculating the prediction probability is

If the phrase type is used, the calculation formula is

After taking the logarithm of both sides of the above formula, the logP (w) on the left side is₁…w_i…w_n|ck_i) As a measure of candidate keywords ck_iThe weight measure in the semantic dimension is denoted as W (Sem (ck)_i) logP (w)₁…w_i…w_n|ck_i) Approximately as logP (c)₁…c_i…c_n|ck_i) Wherein w is₁…w_i…w_nAs candidate keywords ck_iContext within the scope of the model window, abbreviated as Context (ck)_i) Then W (Sem (ck)_i) Is calculated as

9. The automatic keyword extraction method according to claims 1 to 8, wherein the step 4) comprises:

W(ck_i)＝W(Pos(ck_i))+W(Coo(ck_i))+W(Sem(ck_i))；

Note d_iOf each candidate keyword ck_iCorresponding Score (d)_i)＝{W(ck₁)...W(ck_i)...W(ck_n) For Score (d) } for Score (d)_i) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is

10. The method for automatically extracting keywords according to claims 1-9, wherein the text noise comprises pictures, tables, formulas, special symbols and illegal characters.