[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN102298576B - Method and device for generating document keywords - Google Patents

Method and device for generating document keywords Download PDF

Info

Publication number
CN102298576B
CN102298576B CN201010208994.8A CN201010208994A CN102298576B CN 102298576 B CN102298576 B CN 102298576B CN 201010208994 A CN201010208994 A CN 201010208994A CN 102298576 B CN102298576 B CN 102298576B
Authority
CN
China
Prior art keywords
word
words
cluster
score
mrow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010208994.8A
Other languages
Chinese (zh)
Other versions
CN102298576A (en
Inventor
孙军
谢宣松
姜珊珊
赵利军
郑继川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ricoh Co Ltd
Original Assignee
Ricoh Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ricoh Co Ltd filed Critical Ricoh Co Ltd
Priority to CN201010208994.8A priority Critical patent/CN102298576B/en
Publication of CN102298576A publication Critical patent/CN102298576A/en
Application granted granted Critical
Publication of CN102298576B publication Critical patent/CN102298576B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and device for generating document keywords. The method comprises the following steps of: acquiring a plurality of initial keyword sets and combining the plurality of initial keyword sets to obtain a candidate keyword set; determining a word score for each word in the candidate keyword set; clustering words based on the word score of each word and the similarity of words; and assigning keywords for a document based on each cluster obtained by clustering. According to the method and the device, existing word sets or word extraction algorithm information can be comprehensively utilized, so that more diversified and balanced results are obtained.

Description

Document keyword generation method and device
Technical Field
The invention relates to information processing and information extraction, in particular to a method and a device for generating document keywords.
Background
Keyword extraction or generation generally refers to how words or phrases are automatically extracted or generated from a document (or multiple related documents) that represent well the subject matter of the document. The keyword extraction technology is widely applied to various intelligent text information processing fields such as information retrieval, text classification/clustering, information filtering, document summarization and the like.
Many keyword extraction or generation techniques have been proposed. For example, U.S. patent application publication No. US20050278325A1 entitled "Graph-based text for text processing" proposes a method for constructing a Graph of text units from natural language text and deriving a ranking of the text units from the Graph. For another example, an article entitled "Clustering to FindExempler Terms for Keyphrase Extraction", Zhouyuan Liu, Peng Li, Yabin Zheng, Maosong Sun, Natural language processing related conference EMNLP 2009, page 257 and 266 proposes a method for extracting key phrases from documents by using Clustering technology. The method first obtains a single word (word) and a cluster of the single word. Keywords are then selected from the document based on some simple POS (part of speech) pattern. As another example, U.S. patent No. 7478092(B2), entitled "Key term extraction," proposes a method that attempts to filter a set of candidate words by some simple exclusion technique, such as detecting nearly repeated words by a set of common words.
In addition, for a given document, some specified vocabulary may already exist, such as some key terms specified by authors in some academic papers. In addition, the user may assign some "seed" terms to the document according to some a priori knowledge or personal preference, and the "seed" terms assigned by the user refer to terms related to the user's needs, such as terms input when the user provides a query, so that, for example, in the case of a document with multiple topics, only keywords related to a certain topic are extracted. Further, as noted above, there are already many existing word extraction algorithms available. Applying these algorithms to a given document may result in a number of different sets of terms. Each existing term extraction algorithm generally solves the term assignment problem from a different perspective.
Therefore, there is a need for a method and apparatus that can integrate these existing words or word extraction results obtained by existing word extraction algorithms to obtain a more optimal set of key words for a document.
Disclosure of Invention
The invention is therefore proposed.
According to an aspect of the present invention, a document keyword generation method is provided, which may include: obtaining a plurality of preliminary key word sets, and combining the plurality of preliminary key word sets to obtain a candidate key word set; determining a word score for each word in the candidate keyword set; clustering the words based on the word scores of the words and the similarity between the words; and assigning key terms to the document based on each cluster obtained by clustering.
The objective function of a cluster may be calculated based on three factors: sum of similarity between words in each cluster; entropy of the distribution of the number of words in each cluster; and entropy of the distribution of word score sums within clusters for each cluster.
The plurality of preliminary keyword sets may include at least one of: the method comprises the steps of obtaining a plurality of corresponding key word sets through a plurality of existing keyword extraction algorithms, obtaining a set consisting of nouns and noun phrases in a document, obtaining a set consisting of key words or index words specified by a document author, and obtaining a set consisting of 'seed' words specified by a user.
Determining a term score for each term in the set of candidate key terms may include: the method comprises the steps of constructing a word recommendation graph by using words in an initial candidate key word set, enabling nodes of the graph to correspond to the words, enabling links between word nodes in the word recommendation graph to be established according to recommendation relations among the words and having weights corresponding to the recommendation relations, enabling each word node to have an initial score, iteratively propagating the score of each word node to adjacent word nodes, keeping a part of the original score of each word node after each round of propagation, and stopping the propagation process at convergence or reaching a certain maximum iteration number.
The score obtained by each word node after each round of propagation from a neighbor node may depend on the score of the neighbor node and the weight of the link between the word node and the neighbor node.
If two words belong to a sentence, the weight of the link between the two words may depend on the grammatical relation between the two words in the grammar tree of the sentence.
The similarity between two words may be determined based on co-occurrence statistics of the two words or words contained by the two words in a predetermined set of documents.
Assigning the document with the key terms may include: for each cluster, calculating a selection score of each word in the cluster based on the word score of the word and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting the word with the largest selection score from the words of the domain dictionary and the words of the cluster as the key word of the document.
Assigning the document with the key terms may include: for each cluster, calculating the selection score of each word in the cluster based on the word score of the word in the cluster, the proportion of the word score to the sum of word scores in the cluster and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting a first preset number of words with the largest selection scores from all words of the domain dictionary and all words of the clusters, uniformly sorting the selected first preset number of words aiming at all clusters in a descending order according to the selection scores, and selecting a second preset number of words which are sorted at the top as key words of the document.
According to another aspect of the present invention, there is provided a document keyword generation apparatus, which may include: a candidate keyword set obtaining unit configured to obtain a plurality of preliminary keyword sets, and merge the plurality of preliminary keyword sets to obtain a candidate keyword set; a term score determining component for determining a term score for each term in the set of candidate key terms; the clustering component is used for clustering the words based on the word scores of the words and the similarity among the words; and a keyword assignment section for assigning a keyword to the document based on each cluster obtained by the clustering.
With the method and apparatus of the present invention, because each existing word extraction algorithm generally solves the word assignment problem from a different perspective, the existing word set also has its aspects to be considered with emphasis. Therefore, the method and the device can comprehensively utilize the existing word set or word extraction algorithm information, thereby obtaining more diversified and balanced results.
In addition, the sum of the similarity between words in each cluster is comprehensively considered in the objective function of the cluster; entropy of the distribution of the number of words in each cluster; and entropy of the distribution of the word score sum within each cluster, further more diverse and balanced results can be obtained.
In addition, with the assistance of a domain dictionary, the word assignment method of the present invention can output words that are not in the original document but rather better describe some aspects of the original document.
Drawings
FIG. 1 is an overall flow diagram of a document keyword generation method according to one embodiment of the invention;
FIG. 2 is a flowchart of a specific example of a document keyword generation method according to another embodiment of the present invention;
FIG. 3 is a block diagram illustrating a document keyword generation apparatus according to one embodiment of the present invention; and
fig. 4 is a diagram illustrating an example of a hardware configuration of a document keyword generation apparatus according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, the following detailed description of the invention is provided in conjunction with the accompanying drawings and the detailed description of the invention.
In this document, words and single words (word) refer to a word in chinese or a word in english. The term (term) may refer to a single word or a group (or phrase) of words.
In addition, in order to avoid obscuring the gist of the present invention, well-known features or structures are not described in the present document, for example, in keyword extraction, text is usually segmented first, and in evaluating the importance of a word, several word features such as word frequency, inverted document frequency, word position, word length, word part of speech, etc. are considered. There are many well-known techniques for word segmentation, such as the segmentation technique ICTCLAS and the like, for word feature selection, which are not the focus of the present invention and are therefore not described in detail, but it is to be understood that this does not imply that the present invention may not include these well-known features or structures, but rather that these segmentation and word feature selection techniques may be used with the present invention.
FIG. 1 is an overall flowchart of a document keyword generation method according to one embodiment of the present invention.
As shown in fig. 1, the document keyword generation method according to one embodiment of the present invention may include an initial candidate keyword set obtaining step S110, a word score determining step S120, a clustering step S130, and a keyword assigning step S140. The steps are specifically described below.
In step S110, a plurality of preliminary keyword sets are obtained, and the plurality of preliminary keyword sets are merged to obtain a candidate keyword set.
The plurality of preliminary keyword set sources herein are arbitrary and may include, for example, at least one of: the method comprises the steps of obtaining a plurality of corresponding key word sets through a plurality of existing keyword extraction algorithms, obtaining a set consisting of nouns and noun phrases in a document, obtaining a set consisting of key words or index words specified by a document author, and obtaining a set consisting of 'seed' words specified by a user. The existing keyword extraction algorithms described above are arbitrary, such as the extraction algorithms mentioned in the background section. In addition, the above sources only list a set of nouns and noun phrases, since there is a high probability that a general noun or noun phrase can represent the subject of a document. However, this is merely exemplary, and other parts of speech words or phrases, such as a set of verbs and verb phrases, may also apply to the present invention.
Merging the multiple preliminary key term sets herein may include a simple merge operation, e.g., simply removing redundant terms. However, the merging of multiple preliminary keyword sets may also include some complex operations, such as:
(1) filtering, for example, filtering out words that are unlikely to be key words according to preset conditions, for example, words that appear very frequently in daily life, such as "today", "month and day", etc., or filtering out stop words, such as "but", "however", etc.;
(2) an initial word score is assigned to each candidate keyword. The calculation of the initial term score may vary depending on the source of the candidate keyword. Let T denote the multiple preliminary key term sets that are supposed to be merged1,T2,...,TmEach preliminary set of key terms Ti={Term1,Term2,...,Termj,...}. If the preliminary set of key terms Ti is derived from a known keyword extraction algorithm, the known keyword extraction algorithm will typically be directed to the extracted keywords TermjCalculating a Score or weight Score (Term)j) Each keyword extraction algorithm k may have different scores or confidence levels Conf according to different performanceskScore (Term) may be based on the Score of an initial keyword in the preliminary set of key terms from which it was derivedj) And confidence Conf of the corresponding keyword extraction algorithm itselfkTo find an initial word score of the initial keyword, e.g. Confk*Score(Termj) In addition, if the keyword extraction algorithm does not calculate the term score, the term score can be manually specified according to experience; if the initial key term is derived from a set of key terms specified by the document author or index terms or a set of user-specified "seed" terms, then such candidate key terms may be assigned a higher initial term score; if a candidate keyword is derived from a set of nouns and noun phrases in a document, such candidate keyword may be assigned a lower initial word score.
(3) When removing redundant words, the initial word score is also increased according to the number of repetitions of the word.
In step S120, word scores are determined for the respective words in the set of candidate key words.
The word score of each word here may be the initial word score of each word obtained in step S110. However, preferably, the word scores of the words may be further refined according to the recommended relationships between the words. A description will be given below of optimizing the word score by constructing the word recommendation map and the word score propagation process with reference to fig. 2. In addition, some a priori knowledge may also be utilized, for example, where a document is known to be a story about the world cup, its word score may be boosted for words related to soccer, such as "goal", "shoot", etc.
In step S130, words are clustered based on the word scores of the respective words and the similarity between the respective words.
The similarity between two terms may be calculated based on co-occurrence statistics of the two terms or words contained by the two terms in a predetermined set of documents.
Suppose two words term1And term2Is shown as
Figure BSA00000159934700061
And
Figure BSA00000159934700062
wherein
Figure BSA00000159934700063
Is the jth word termjWord ofiTerm1Consisting of p words and the word term2 consisting of q words. The two words term1And term2A specific implementation formula (1) of the similarity measure therebetween may be:
similarity ( term 1 , term 2 ) = ( word 1 ( 1 ) word 2 ( 1 ) . . . word p ( 1 ) , word 1 ( 2 ) word 2 ( 2 ) . . . word q ( 2 ) )
(1)
<math> <mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>p</mi> </munderover> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>q</mi> </munderover> <mi>similarity</mi> <mrow> <mo>(</mo> <msubsup> <mi>word</mi> <mi>i</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </msubsup> <mo>,</mo> <msubsup> <mi>word</mi> <mi>j</mi> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>/</mo> <mrow> <mo>(</mo> <mi>p</mi> <mo>&CenterDot;</mo> <mi>q</mi> <mo>)</mo> </mrow> </mrow> </math>
wherein
Figure BSA00000159934700066
Representing words
Figure BSA00000159934700067
And
Figure BSA00000159934700068
a measure of similarity therebetween. A similarity measure between two words (words) can be derived from co-occurrence statistics in a certain set of documents.
For example, two words word are calculated by the mutual information method as the following formula (2)iAnd wordjSimilarity between wordsi,wordj):
similarity ( word i , word j ) = log p ( word i , word j ) p ( word i ) p ( word j ) - - - ( 2 )
Wherein, p (word)i)、p(wordj) Respectively representing words wordiAnd wordjProbability of occurrence in a document, p (word)i,wordj) Representing words wordiAnd wordjProbability of co-occurrence in the same sentence or a window of a predetermined size in the document.
The occurrence probability information of the words, the similarity between the words, and the similarity between the words may be determined in advance and stored in a word similarity database, or may be calculated from the processed target document on site.
The above calculation of the similarity between words by using the mutual information method is merely an example, and may be performed by using a statistical method such as a Log Likelihood Ratio (Log likehood Ratio), Chi-squared test (Chi-squared), and a knowledge method given to a dictionary (e.g., WordNet).
In addition, the similarity between words can be determined before clustering or in the field during clustering.
Clustering is an unsupervised machine learning algorithm that groups individuals or samples into classes, each of which can be considered as a point in the feature space. The basic idea is to cluster closely spaced and dense points in feature space into one class or cluster.
In the word clustering herein, each word is each sample, and the similarity between words can be regarded as the distance between words. Therefore, various existing Clustering algorithms, for example, the background art cited entitled "Clustering to Find Exemplar Terms for Keyphrase Extraction", and the Clustering algorithm mentioned in the article of natural language processing related conference EMNLP 2009, can be applied to the present invention.
The number c of clusters obtained by final clustering may be predetermined, such as the number of key words specified by a user or a system, or may be uncertain and determined according to the final operation result of a clustering algorithm.
According to an embodiment of the present invention, the number of clusters c is the number of key terms specified by the user or the system, and the term clustering process may be as follows:
1) randomly dividing the words into c clusters;
2) for each word, tentatively putting the word into a different cluster, calculating the change of the objective function, and putting the word into the cluster which enables the increase of the objective function to be maximum;
3) and if the value of the objective function does not increase for all the words, terminating the program and returning the current clustering result.
The objective function in the clustering algorithm is very important and can be designed according to actual needs. According to an embodiment of the present invention, in clustering words, both similarity between words and word scores of respective words are considered, and it is desirable to obtain a more balanced clustering. The balance here can be embodied in, for example, two aspects: the scores and the size distribution of the words in each cluster are balanced; the number distribution of elements, namely words, of each cluster is balanced.
Therefore, according to an embodiment of the present invention, in designing the objective function of the cluster, the above-described balanced factors may be considered in addition to the general similarity between elements within the cluster and the similarity between elements between clusters. The targets of the clusters according to an embodiment of the present invention include at least one of the following items:
1. the fraction and size distribution of each cluster are balanced;
2. the number distribution of elements of each cluster is balanced;
3. the greater the similarity between the intra-cluster elements of each cluster, the better;
4. the smaller the similarity between the elements between clusters, the better.
According to one embodiment of the invention, the objective function of balanced clustering that considers both node scores and word similarity contains the following three factors: the sum of intra-cluster similarities for each cluster; entropy of the distribution of the number of words in each cluster; entropy of distribution of word score sums within clusters for each cluster.
The specific form of the objective function may be selected in many ways, for example, any of the following equations (3), (4), (5) may be used:
<math> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>&alpha;</mi> <mo>&CenterDot;</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&beta;</mi> <mo>&CenterDot;</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <mfrac> <mrow> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <msub> <mi>S</mi> <mi>k</mi> </msub> </mfrac> <mo>+</mo> <mi>&alpha;</mi> <mo>&CenterDot;</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <mfrac> <mrow> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>+</mo> <mi>&alpha;</mi> <mo>&CenterDot;</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mo>+</mo> <mi>&beta;</mi> <mo>&CenterDot;</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein,
e (i, j) is the similarity between words i and j;
πkis the kth word cluster;
S1~cis S1,S2,...,ScIn which
Figure BSA00000159934700084
Namely SkIs the score s of all words i in the kth word clusteriThe sum of (1);
n1~cis n1,n2,...,ncIn short, where niIs the number of words in the ith word cluster;
H(S1~c) Is the entropy of the distribution of the word score sums within the clusters for each cluster.An example of an entropy definition is the classical Shannon (Shannon) entropy, which is defined as shown in equation (6) below:
<math> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <mo>-</mo> <mfrac> <msub> <mi>S</mi> <mi>k</mi> </msub> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msub> <mi>S</mi> <mi>i</mi> </msub> </mrow> </mfrac> <mi>log</mi> <mrow> <mo>(</mo> <mfrac> <msub> <mi>S</mi> <mi>k</mi> </msub> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msub> <mi>S</mi> <mi>i</mi> </msub> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>
H(n1~c) Is the entropy of the distribution of the number of words within a cluster for each cluster. An example of an entropy definition is the classical Shannon (Shannon) entropy, which is defined as shown in the following equation:
<math> <mrow> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <mo>-</mo> <mfrac> <msub> <mi>n</mi> <mi>k</mi> </msub> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msub> <mi>n</mi> <mi>i</mi> </msub> </mrow> </mfrac> <mi>log</mi> <mrow> <mo>(</mo> <mfrac> <msub> <mi>n</mi> <mi>k</mi> </msub> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <msub> <mi>n</mi> <mi>i</mi> </msub> </mrow> </mfrac> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>
alpha is a parameter which is specified in advance and is used for indicating the importance degree of the distribution balance of the number of the words in each cluster;
β is a parameter designated in advance for indicating the degree of importance of the distribution balance of the word score sums in the respective clusters.
By utilizing the clustering method of the embodiment of the invention, the balance of the sizes of the word clusters and the balance of the word score sum are considered at the same time, so that compared with the condition that only the similarity between words in the clusters is considered when the words are directly clustered in the prior art, a more balanced result can be obtained, and each cluster obtained by clustering better embodies the theme of the document.
The clustering process of the above embodiments is merely an example. In fact, a commonly used technique of aggregating one cluster (e.g., a cluster having a smaller number of elements) to another cluster nearby, or dividing one cluster (e.g., a cluster having a larger number of elements) into two clusters, or the like may be used in the present invention.
In addition, the form of the objective function described above is merely an example, and other objective functions, for example, an objective function that also takes into account the inter-element similarity between clusters, may also be used in the present invention.
In addition, the balance of the word score and size distribution of the cluster is characterized by the entropy of the distribution of the word score sum and the entropy of the distribution of the word number in the specific formulas (3), (4), (5) of the objective function, and the balance of the word number distribution of the cluster is characterized by the entropy of the distribution of the word number, which is merely exemplary, and other statistical indexes such as the Standard Deviation (Standard Deviation) may also be used to characterize the balance.
In step S140, a keyword is assigned to the document based on each cluster obtained by clustering.
After clusters are obtained, the term with the largest term score may simply be selected from each cluster and assigned to the document as the key term. Alternatively, the word with the greatest similarity to other words in the cluster may be selected from each cluster as the key word. Alternatively, the term scores for a term and its similarity to other terms in the cluster may be considered together to decide whether to assign it as a key term to a document.
In addition, a domain dictionary may be incorporated to select key terms to be assigned to a document from the clusters or from the domain dictionary. An example of assigning the key words in conjunction with the domain dictionary will be described in detail below with reference to fig. 2.
With the method of the above embodiment of the present invention, the existing specified words or the results of various existing word extraction algorithms can be considered, which has a great advantage that aspects of a document can be covered by the assigned words because the existing specified words or existing word extraction algorithms may take into account the characteristics of the diversification of various aspects of a given document.
In addition, with the method of the above embodiment, in the execution process of clustering, both the similarity between words and the word scores are considered, and the output word cluster can cover diversified topics in a specified document. The method of this embodiment not only makes the element number distribution of each cluster relatively balanced, but also makes the score and size distribution of each cluster relatively balanced, so that the output word cluster has the following characteristics: topics that are more important to the document may cover more refined sub-topics, and topics that are less important to the document may cover more coarse-grained sub-topics. This property enables the assigned terms to more effectively represent and describe the original document.
A flowchart of a specific example of a document keyword generation method according to another embodiment of the present invention is described below with reference to fig. 2. In fig. 2, dark gray boxes represent inputs, intermediate results or outputs, and white boxes represent processes or operations.
As shown in fig. 2, in step S210, a plurality of preliminary key term sets are obtained. Specifically, step S210 may consist of sub-steps S201, S202, S203, S204. Generating all the lowest possible preliminary candidate words, for example, composed of all nouns and noun phrases in the document, in sub-step S201; in substep S202, a plurality of preliminary key term sets are generated using a plurality of existing term extraction algorithms; in sub-step S203, a set of key terms or index terms, if any, specified by the document author is obtained; in sub-step S204, a set of user-specified "seed terms," if any, is obtained.
In step S220, a candidate keyword set with an initial word score is constructed using the plurality of preliminary keyword sets obtained in step S210, i.e., each candidate keyword in the candidate keyword set has an initial word score. As for the initial word score, it may be obtained from the credibility of the keyword extraction algorithm itself and the score of each word in the preliminary keyword set obtained by the algorithm, as described above with reference to fig. 1, or may be previously specified empirically. In addition, as previously described, operations such as removing redundant words, filtering out inactive words, weighting word scores according to the number of times a word occurs, etc. may be performed at this stage.
In step S230, a word recommendation map is constructed using the individual words in the candidate keyword set. Specifically, each node of the word recommendation graph corresponds to each word in the candidate key word set, links between word nodes in the word recommendation graph are established according to recommendation relations among the words and have weights corresponding to the recommendation relations, and each word node has an initial score, that is, an initial word score of the word corresponding to the node.
The word recommendation relationship may be many, four examples are given below:
a high-weight word relationship is formed by child nodes of the same parent node in a sentence grammar tree. Such as the following sentence:
“The My Colors features include positive film,lighter skin tone,darker skintone,vivid blue,vivid green,vivid red,color accent and color swap.”
in the syntax tree of the sentence, the following words have a common parent node, and thus the words have such a relationship therebetween: "positive film", "lighter skin tone", "darker skin", "visual blue", "visual green", "visual red", "color access", and "colorswap".
The second word recommendation relationship consists of between any two words in the same sentence, or between two words in the same sentence with a distance smaller than a certain threshold.
A third word recommendation relationship is a recommendation of words in sentences in the middle of a paragraph to words in the beginning and end subject sentences of the paragraph.
The fourth word recommendation relationship is a recommendation of words in the title of an article by words throughout and a recommendation of words in the content of a chapter by words in the title of the chapter.
Other word recommendation relationships may be considered by those skilled in the art as desired and a predetermined weight may be assigned to each word recommendation relationship.
If one or more word recommendation relationships are complied between two words, a link is established between the two words, and the weight of the link may be set to a weighted average of the different word recommendation relationships.
In step S240, the word score is propagated on the word recommendation map. In a sense, propagating word scores on a word recommendation graph may be considered as a simulation of human social recommendation relationships, and specifically, for example, in human social relationships, a person's rating may include two factors, one being the person's own performance score, another being the person's rating or recommendation by others in the society, and the degree of recommendation by a different person to the person may be determined by two factors, one being the importance of the recommender, and the other being the closeness of the relationship between the recommender and the person, i.e., the person being recommended.
In particular, the term score propagation may be an iterative process in which the term of each node passes the score to its neighbors (the term of the node plays the role of the recommender and its neighbors are recommenders) and also obtains the score from the neighbors (the term of the node plays the role of the recommenders and its neighbors are recommenders) itself. . After each round of propagation, the score for each word node consists of two parts, the first part being the retained part of its original score and the second part being the score obtained from the neighbor node, which may depend on the score of the neighbor node and the weight of the link between the word node and the neighbor node.
After all word scores have gradually stabilized or after the maximum number of iterations has been reached, the process of score propagation may terminate and return the score of the current word as the final word score.
According to an embodiment of the present invention, the score of the node i after each iteration can be calculated by the following score propagation formula:
<math> <mrow> <msubsup> <mi>s</mi> <mi>i</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> </msubsup> <mo>=</mo> <mi>a</mi> <mo>+</mo> <mi>b</mi> <mo>&CenterDot;</mo> <msubsup> <mi>s</mi> <mi>i</mi> <mrow> <mo>(</mo> <mn>0</mn> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>a</mi> <mo>-</mo> <mi>b</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <munder> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </munder> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>D</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>D</mi> <mi>j</mi> </msub> <mo>,</mo> <msubsup> <mi>s</mi> <mi>j</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein,
a is a smoothing parameter specified in advance;
b is a pre-specified parameter indicating the proportion of the initial fraction;
si (0)is the initial merged score for word i;
si (1)the score of the word i after the 1 st iteration;
si (1+1)the score of the word i after the (1+1) th iteration;
n (i) is the neighbor of the ith word node in the word graph;
Wj,iis the weight of the link from word node j to word node i in the word graph;
dithe degree of the word node i is equal to the number of links pointing to other nodes by taking the word node i as a starting point;
Diis the weighted degree of the term node i, equal to the sum of the weights of the links that point to other nodes starting at the term node i.
Note that a is 0. ltoreq. a.ltoreq.1, b is 0. ltoreq. b.ltoreq.1, and a + b is 0. ltoreq.1.
Figure BSA00000159934700121
Is a predefined function, herein called recommendation function, whichThe recommendation function may satisfy the following characteristics: 1. function of wj,iAnd
Figure BSA00000159934700122
monotonically increasing; 2. function of di,dj,Di,DjMonotonically decreasing. There are various implementations of this function, three specific examples are as follows:
g ( w j , i , d i , d j , D i , D j , s j ( l ) ) = w j , i s j ( l ) d i d j - - - ( 9 )
g ( w j , i , d i , d j , D i , D j , s j ( l ) ) = w j , i s j ( l ) D i D j - - - ( 10 )
g ( w j , i , d i , d j , D i , D j , s j ( l ) ) = 2 w j , i s j ( l ) D i + D j - - - ( 11 )
however, it should be clear to those skilled in the art that the score propagation formula (8) and the recommendation formulas (9), (10) and (11) are only exemplary and not intended to limit the present invention, and other formula forms may be designed as needed.
Through step S240, word scores of the respective words in the candidate keyword set are finally determined.
Then, in step S250, a word similarity map is constructed to obtain the similarity between the words in the candidate keyword set. In the word similarity graph, each node corresponds to each word in the candidate key word set, every two nodes are linked, and the weight of the link represents the similarity of the words of the two nodes. The calculation of the similarity between words of two nodes is substantially the same as the calculation of the similarity between words described with reference to step S130 of fig. 1, and is not repeated here.
Next, proceeding to step S260, the nodes in the word similarity graph are subjected to balanced clustering, where each node in the word similarity graph may have a word score of a corresponding word based on the word recommendation graph obtained in step S240.
Illustratively, as described above with reference to step 130 of FIG. 1, the objective function of the clustering process herein may employ the aforementioned formula (3), i.e.
<math> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <munder> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>&alpha;</mi> <mo>&CenterDot;</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>&beta;</mi> <mo>&CenterDot;</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> </math>
Three terms in the formula respectively represent the sum of the intra-cluster similarity of each cluster; entropy of the distribution of the number of words in each cluster; entropy of distribution of word score sums within clusters for each cluster.
According to the present example, the number of clusters is determined to be equal to the number of key terms to be obtained, and is set as c, and as described above, the specific clustering process is:
1) randomly dividing the words into c clusters;
2) for each word, tentatively putting the word into a different cluster, calculating the change of the objective function, and putting the word into the cluster which enables the increase of the objective function to be maximum;
3) and if the value of the objective function is not increased or the iteration times reach the maximum times specified in advance for all the words, terminating the program and returning the current clustering result.
The clustering method of the embodiment not only enables the element number distribution of each cluster to be balanced, but also enables the score and size distribution of each cluster to be balanced, so that the output word cluster can cover diversified themes in the appointed document.
After clustering, proceeding to step S270, the document key words are selected based on the clustered individual clusters and the domain dictionary.
The most commonly used terms in a domain dictionary, which are commonly referred to as domain dictionaries, are included in the domain dictionary and can be determined by a domain expert or selected by experiments. Borrowing a domain dictionary here is based on the following considerations: in some cases, it may be that a word or words in the domain dictionary that should have a greater degree of similarity or relevance to existing words in a cluster better describe the topic of the document than existing words in the cluster.
Specifically, as a first example, the process of selecting the document key words based on the clustered clusters and the domain dictionary may be performed as follows:
for each of the clusters it is desirable to have,
calculating the selection score of each word in the cluster based on the word score of the word and the similarity between the word and other words in the cluster;
calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and
from the respective words of the domain dictionary and the respective words of the cluster, the word having the largest selection score is selected as the key word of the document.
For another example, as a second example, the process of selecting the document key words based on the clustered clusters and the domain dictionary may also be performed as follows:
for each of the clusters it is desirable to have,
calculating the selection score of each word in the cluster based on the word score of each word in the cluster, the proportion of the word score to the sum of word scores in the cluster and the similarity between the word and other words in the cluster;
calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and
selecting a first predetermined number of words having a maximum selection score from the respective words of the domain dictionary and the respective words of the cluster,
and uniformly sorting the selected first preset number of terms of each cluster in a descending order according to the selection scores, and selecting a second preset number of terms which are sorted at the top as key terms of the document.
In the first example, assuming that 5 clusters are obtained after clustering, each word cluster is concentrated into one key word by means of the domain dictionary, and finally 5 key words are obtained.
In a second example, also assuming that 5 clusters are obtained after clustering, for each cluster, for example, 2 words are selected from the cluster or the domain dictionary, and finally 10 words are obtained, the 10 words are sorted in order from high to low according to word selection scores, and the word with the highest score, for example, the number of words specified by a user or a system, is selected, for example, 8 words with the highest score are selected as the document key words.
A term selection score for a term characterizes an evaluation factor that selects the term as a document keyword.
One example of calculating word selection scores for words in the cluster and words in the domain dictionary is given below.
Let us assume that the set of scored words generated in the previous step S240 is denoted as { (t)1,s1),(t2,s2),...,(tu,su) Where t isiIs the ith word, siIs the score of the ith word, and u is the number of words in the candidate key word set. Suppose the domain dictionary is written as tu+1,tu+2,...,tu+vWhere t isu+iIs the ith word in the domain dictionary, and v is the number of words in the domain dictionary. Here we assume a word score s for each word in the domain dictionaryu+1=su+2=...=su+v0. For a word cluster k and a word i in the word cluster or domain dictionary, a definition of a word selection score (i ∈ π @kOr u + 1. ltoreq. i. ltoreq. u + v, where πkReferring to the kth word cluster) can be calculated by the following equation (12):
<math> <mrow> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>k</mi> </msub> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>&gamma;</mi> <mo>&CenterDot;</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>+</mo> <mi>&eta;</mi> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>s</mi> <mi>i</mi> </msub> <mrow> <munder> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <msub> <mi>s</mi> <mi>j</mi> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>12</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein n iskIs the number of words in the word cluster k; γ is a parameter specified in advance, representing the degree of importance of the word score in the word selection score; η is a parameter specified in advance and represents the degree of importance of the word score in the word selection score to the proportion of the sum of word scores in the cluster.
The above calculation formula (12) of the word selection score is merely an example. Other calculation formulas may be designed as desired. For example, the word selection score may alternatively be calculated using the following calculation formula (13).
<math> <mrow> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>k</mi> </msub> </mfrac> <munder> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>&pi;</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>&gamma;</mi> <mo>&CenterDot;</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>+</mo> <mi>&eta;</mi> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>s</mi> <mi>i</mi> </msub> <msub> <mi>s</mi> <mrow> <mi>max</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> </msub> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> </mrow> </math>
Wherein n iskIs the number of words in the word cluster k; smax(k)Is the maximum word score in word cluster k; γ is a parameter specified in advance, representing the degree of importance of the word score in the word selection score; η is a pre-specified parameter representing the degree of importance of the ratio of the word score to the maximum word score in the cluster in the word selection score.
In the above calculation formulas (12) and (13) of the word selection score, three factors are considered, namely the word score of each word, the proportion of the word score to the sum of the word scores in the cluster, and the similarity between the word and other words in the cluster. However, this is merely an example, and only one or two of the factors may be considered, or more other factors may also be considered as desired.
In this embodiment, by selecting document key words from a domain dictionary or cluster, words that are not in the original document but that better describe some aspect of the original document can be output as key words.
FIG. 3 illustrates a block diagram of a document keyword generation apparatus according to one embodiment of the present invention.
The document keyword generation means may include: a candidate keyword set obtaining unit 102, configured to obtain a plurality of preliminary keyword sets, and merge the plurality of preliminary keyword sets to obtain a candidate keyword set; a word score determining component 103 for determining a word score for each word in the set of candidate key words; a clustering component 104 for clustering the words based on the word scores of the words and the similarity between the words; and a keyword assignment section 105 for assigning a keyword to the document based on each cluster obtained by the clustering.
The clustering component 104 can calculate an objective function for the cluster based on three factors: sum of similarity between words in each cluster; entropy of the distribution of the number of words in each cluster; and entropy of the distribution of word score sums within clusters for each cluster.
The plurality of preliminary keyword sets may include at least one of: the method comprises the steps of obtaining a plurality of corresponding key word sets through a plurality of existing keyword extraction algorithms, obtaining a set consisting of nouns and noun phrases in a document, obtaining a set consisting of key words or index words specified by a document author, and obtaining a set consisting of 'seed' words specified by a user.
The term score determination component 103 may determine term scores for individual terms in the set of candidate key terms comprising: the method comprises the steps of constructing a word recommendation graph by using words in an initial candidate key word set, enabling nodes of the graph to correspond to the words, enabling links between word nodes in the word recommendation graph to be established according to recommendation relations among the words and having weights corresponding to the recommendation relations, enabling each word node to have an initial score, iteratively propagating the score of each word node to adjacent word nodes, keeping a part of the original score of each word node after each round of propagation, and stopping the propagation process at convergence or reaching a certain maximum iteration number.
The score obtained by each word node after each round of propagation from a neighbor node may depend on the score of the neighbor node and the weight of the link between the word node and the neighbor node.
If two words belong to a sentence, the weight of the link between the two words may depend on the grammatical relation between the two words in the grammar tree of the sentence.
Furthermore, the similarity between two words may be determined based on co-occurrence statistics of the two words or words contained by the two words in a predetermined set of documents.
The keyword assignment component 105 assigning keywords to documents may include: for each cluster, calculating a selection score of each word in the cluster based on the word score of the word and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting the word with the largest selection score from the words of the domain dictionary and the words of the cluster as the key word of the document.
The keyword assignment component 105 assigning keywords to documents may include: for each cluster, calculating the selection score of each word in the cluster based on the word score of the word in the cluster, the proportion of the word score to the sum of word scores in the cluster and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting a first preset number of words with the largest selection scores from all words of the domain dictionary and all words of the clusters, uniformly sorting the selected first preset number of words aiming at all clusters in a descending order according to the selection scores, and selecting a second preset number of words which are sorted at the top as key words of the document.
Finally, a description will be given, with reference to fig. 4, as an example of a hardware configuration of the document keyword generation apparatus that performs the above-described processing. A CPU (central processing unit) 701 performs various processes in accordance with a program stored in a ROM (read only memory) 702 or a storage section 708. For example, the CPU executes the program of document keyword generation described in the above-described embodiments. A RAM (random access memory) 703 appropriately stores programs, data, and the like executed by the CPU 701. The CPU 301, ROM 702, and RAM 703 are connected to each other via a bus 704.
The CPU701 is connected to an input/output interface 705 via a bus 704. An input section 706 including a keyboard, a mouse, a microphone, and the like, and an output section including a display, a speaker, and the like are connected to the input/output interface 705. The CPU701 executes various processes in accordance with instructions input from the input section 706. The CPU701 outputs the result of the processing to the output section 707.
The storage section 708 connected to the input/output interface 705 includes, for example, a hard disk, and stores programs executed by the CPU701 and various data. The communication section 709 communicates with an external device through a network such as the internet, a local area network, or the like.
A drive 710 connected to the input/output interface 705 drives a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and obtains a program, data, and the like recorded therein. The obtained program and data are transferred to the storage section 708 as needed, and stored therein.
While the principles of the invention have been described in connection with specific embodiments thereof, it should be noted that it will be understood by those skilled in the art that all or any of the steps or elements of the method and apparatus of the invention may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or any combination thereof, which will be within the skill of those in the art after reading the description of the invention and applying their basic programming skills.
Thus, the objects of the invention may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. The object of the invention is thus also achieved solely by providing a program product comprising program code for implementing the method or the apparatus. That is, such a program product also constitutes the present invention, and a storage medium storing such a program product also constitutes the present invention. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.
It is further noted that in the apparatus and method of the present invention, it is apparent that each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other, e.g. the step of determining the word score and the step of calculating the similarity between words may be performed sequentially, in parallel or independently in any order.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A document keyword generation method comprises the following steps:
obtaining a plurality of preliminary key word sets, and combining the plurality of preliminary key word sets to obtain a candidate key word set;
determining a word score for each word in the candidate keyword set;
clustering the words based on the word scores of the words and the similarity between the words, so that the scores and the size distribution of the words in each cluster are balanced, and the number distribution of the words in each cluster is balanced;
and assigning key terms to the documents based on each cluster obtained by clustering.
2. The keyword generation method according to claim 1, calculating an objective function of the cluster based on three factors:
sum of similarity between words in each cluster;
entropy of the distribution of the number of words in each cluster; and
entropy of distribution of word score sums within clusters for each cluster.
3. The keyword generation method of claim 1, the plurality of preliminary keyword sets comprising at least one of: the method comprises the steps of obtaining a plurality of corresponding key word sets through a plurality of existing keyword extraction algorithms, obtaining a set consisting of nouns and noun phrases in a document, obtaining a set consisting of key words or index words specified by a document author, and obtaining a set consisting of 'seed' words specified by a user.
4. The keyword generation method of claim 1, the determining a word score for each word in the set of candidate keyword comprises:
the method comprises the steps of constructing a word recommendation graph by using words in an initial candidate key word set, enabling nodes of the graph to correspond to the words, enabling links between word nodes in the word recommendation graph to be established according to recommendation relations among the words and having weights corresponding to the recommendation relations, enabling each word node to have an initial score, iteratively propagating the score of each word node to adjacent word nodes, keeping a part of the original score of each word node after each round of propagation, and stopping the propagation process at convergence or reaching a certain maximum iteration number.
5. The method of generating key terms according to claim 4, wherein the score obtained by each term node after each round of propagation from a neighboring node depends on the score of the neighboring node and the weight of the link between the term node and the neighboring node.
6. The key word generating method according to claim 5, wherein if two words belong to one sentence, the weight of the link between the two words depends on a grammatical relationship between the two words in a syntax tree of the sentence.
7. The keyword generation method according to claim 1, wherein the similarity between two words is determined based on co-occurrence statistical information of the two words or words contained in the two words in a predetermined document set.
8. The keyword generation method of claim 1, the assigning keywords to the document comprising:
for each of the clusters it is desirable to have,
calculating the selection score of each word in the cluster based on the word score of the word and the similarity between the word and other words in the cluster;
calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and
from the respective words of the domain dictionary and the respective words of the cluster, the word having the largest selection score is selected as the key word of the document.
9. The keyword generation method of claim 1, the assigning keywords to the document comprising:
for each of the clusters it is desirable to have,
calculating the selection score of each word in the cluster based on the word score of each word in the cluster, the proportion of the word score to the sum of word scores in the cluster and the similarity between the word and other words in the cluster;
calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and
selecting a first predetermined number of words having a maximum selection score from the respective words of the domain dictionary and the respective words of the cluster,
and uniformly sorting the selected first preset number of terms aiming at each cluster in a descending order according to the selection scores, and selecting a second preset number of terms which are sorted at the top as key terms of the document.
10. A keyword generation apparatus of a document, comprising:
a candidate keyword set obtaining unit configured to obtain a plurality of preliminary keyword sets, and merge the plurality of preliminary keyword sets to obtain a candidate keyword set;
a term score determining component for determining a term score for each term in the set of candidate key terms;
the clustering component is used for clustering the words based on the word scores of the words and the similarity among the words, so that the scores and the size distribution of the words in each cluster are balanced, and the number distribution of the words in each cluster is balanced; and
and the key term assignment component is used for assigning key terms to the document based on each cluster obtained by clustering.
CN201010208994.8A 2010-06-25 2010-06-25 Method and device for generating document keywords Expired - Fee Related CN102298576B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010208994.8A CN102298576B (en) 2010-06-25 2010-06-25 Method and device for generating document keywords

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010208994.8A CN102298576B (en) 2010-06-25 2010-06-25 Method and device for generating document keywords

Publications (2)

Publication Number Publication Date
CN102298576A CN102298576A (en) 2011-12-28
CN102298576B true CN102298576B (en) 2014-07-02

Family

ID=45358999

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010208994.8A Expired - Fee Related CN102298576B (en) 2010-06-25 2010-06-25 Method and device for generating document keywords

Country Status (1)

Country Link
CN (1) CN102298576B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013161850A1 (en) * 2012-04-26 2013-10-31 日本電気株式会社 Text mining system, text mining method, and program
CN103455487B (en) * 2012-05-29 2018-07-06 腾讯科技(深圳)有限公司 The extracting method and device of a kind of search term
CN103870458B (en) * 2012-12-07 2017-07-18 富士通株式会社 Data processing equipment, data processing method and program
CN103970756B (en) * 2013-01-28 2018-12-28 腾讯科技(深圳)有限公司 hot topic extracting method, device and server
CN104063387B (en) * 2013-03-19 2017-07-28 三星电子(中国)研发中心 Apparatus and method of extracting keywords in the text
CN104239300B (en) * 2013-06-06 2017-10-20 富士通株式会社 The method and apparatus that semantic key words are excavated from text
US9619450B2 (en) * 2013-06-27 2017-04-11 Google Inc. Automatic generation of headlines
CN103425748B (en) * 2013-07-19 2017-06-06 百度在线网络技术(北京)有限公司 A kind of document resources advise the method for digging and device of word
CN103399901B (en) * 2013-07-25 2016-06-08 三星电子(中国)研发中心 A kind of keyword abstraction method
CN103646074B (en) * 2013-12-11 2017-06-23 北京奇虎科技有限公司 It is a kind of to determine the method and device that picture cluster describes text core word
CN105138523A (en) * 2014-05-30 2015-12-09 富士通株式会社 Method and device for determining semantic keywords in text
CN105224682B (en) * 2015-10-27 2018-06-05 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN106251035A (en) * 2016-07-15 2016-12-21 国网北京市电力公司 The data processing method calculated for the project indicator and device
CN106302650A (en) * 2016-07-29 2017-01-04 深圳市沃特沃德股份有限公司 House pet application message method for pushing and device
CN106599269B (en) * 2016-12-22 2019-12-03 东软集团股份有限公司 Keyword extracting method and device
CN108241652A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 Keyword clustering method and device
CN108536676B (en) * 2018-03-28 2020-10-13 广州华多网络科技有限公司 Data processing method and device, electronic equipment and storage medium
CN109710937A (en) * 2018-12-27 2019-05-03 南京大学 Interdependent syntax tree constructs system
CN109885752B (en) * 2019-01-14 2021-03-02 口碑(上海)信息技术有限公司 Brand word mining method, device, equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1920814A (en) * 2006-05-09 2007-02-28 上海态格文化传播有限公司 Extensible intelligent internet search system
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8229957B2 (en) * 2005-04-22 2012-07-24 Google, Inc. Categorizing objects, such as documents and/or clusters, with respect to a taxonomy and data structures derived from such categorization
US7702680B2 (en) * 2006-11-02 2010-04-20 Microsoft Corporation Document summarization by maximizing informative content words
JP5316158B2 (en) * 2008-05-28 2013-10-16 株式会社リコー Information processing apparatus, full-text search method, full-text search program, and recording medium
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1920814A (en) * 2006-05-09 2007-02-28 上海态格文化传播有限公司 Extensible intelligent internet search system
CN101174273A (en) * 2007-12-04 2008-05-07 清华大学 News event detecting method based on metadata analysis
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval

Also Published As

Publication number Publication date
CN102298576A (en) 2011-12-28

Similar Documents

Publication Publication Date Title
CN102298576B (en) Method and device for generating document keywords
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN109753566B (en) Model training method for cross-domain emotion analysis based on convolutional neural network
Chikersal et al. Modelling public sentiment in Twitter: using linguistic patterns to enhance supervised learning
Barnaghi et al. Opinion mining and sentiment polarity on twitter and correlation between events and sentiment
Ge et al. Improving text classification with word embedding
CN109815336B (en) Text aggregation method and system
CN110472005B (en) Unsupervised keyword extraction method
Ismail et al. Bangla word clustering based on n-gram language model
Tiwari et al. Ensemble approach for twitter sentiment analysis
Chory et al. Sentiment analysis on user satisfaction level of mobile data services using Support Vector Machine (SVM) algorithm
Aquino et al. Keyword identification in spanish documents using neural networks
Liu Sentiment analysis of yelp reviews: a comparison of techniques and models
Tasharofi et al. Evaluation of statistical part of speech tagging of Persian text
CN116403231A (en) Multi-hop reading understanding method and system based on double-view contrast learning and graph pruning
Luo et al. A study on automatic chinese text classification
Menai Word sense disambiguation using an evolutionary approach
Roy et al. Syntactic complexity of Web search queries through the lenses of language models, networks and users
KR20070118154A (en) Information processing device and method, and program recording medium
Zouaghi et al. Contribution to semantic analysis of Arabic language
Deena et al. Keyword extraction using latent semantic analysis for question generation
HaCohen-Kerner et al. Cross-domain Authorship Attribution: Author Identification using char sequences, word unigrams, and POS-tags features
Kim et al. CNN based sentence classification with semantic features using word clustering
CN115249012A (en) Knowledge graph visualization method and system based on key phrases
Lin et al. Predict emoji combination with retrieval strategy

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140702

Termination date: 20200625