CN102298576B

CN102298576B - Method and device for generating document keywords

Info

Publication number: CN102298576B
Application number: CN201010208994.8A
Authority: CN
Inventors: 孙军; 谢宣松; 姜珊珊; 赵利军; 郑继川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2010-06-25
Filing date: 2010-06-25
Publication date: 2014-07-02
Anticipated expiration: 2030-06-25
Also published as: CN102298576A

Abstract

The invention provides a method and device for generating document keywords. The method comprises the following steps of: acquiring a plurality of initial keyword sets and combining the plurality of initial keyword sets to obtain a candidate keyword set; determining a word score for each word in the candidate keyword set; clustering words based on the word score of each word and the similarity of words; and assigning keywords for a document based on each cluster obtained by clustering. According to the method and the device, existing word sets or word extraction algorithm information can be comprehensively utilized, so that more diversified and balanced results are obtained.

Description

Document keyword generation method and device

Technical Field

The invention relates to information processing and information extraction, in particular to a method and a device for generating document keywords.

Background

Keyword extraction or generation generally refers to how words or phrases are automatically extracted or generated from a document (or multiple related documents) that represent well the subject matter of the document. The keyword extraction technology is widely applied to various intelligent text information processing fields such as information retrieval, text classification/clustering, information filtering, document summarization and the like.

Many keyword extraction or generation techniques have been proposed. For example, U.S. patent application publication No. US20050278325A1 entitled "Graph-based text for text processing" proposes a method for constructing a Graph of text units from natural language text and deriving a ranking of the text units from the Graph. For another example, an article entitled "Clustering to FindExempler Terms for Keyphrase Extraction", Zhouyuan Liu, Peng Li, Yabin Zheng, Maosong Sun, Natural language processing related conference EMNLP 2009, page 257 and 266 proposes a method for extracting key phrases from documents by using Clustering technology. The method first obtains a single word (word) and a cluster of the single word. Keywords are then selected from the document based on some simple POS (part of speech) pattern. As another example, U.S. patent No. 7478092(B2), entitled "Key term extraction," proposes a method that attempts to filter a set of candidate words by some simple exclusion technique, such as detecting nearly repeated words by a set of common words.

In addition, for a given document, some specified vocabulary may already exist, such as some key terms specified by authors in some academic papers. In addition, the user may assign some "seed" terms to the document according to some a priori knowledge or personal preference, and the "seed" terms assigned by the user refer to terms related to the user's needs, such as terms input when the user provides a query, so that, for example, in the case of a document with multiple topics, only keywords related to a certain topic are extracted. Further, as noted above, there are already many existing word extraction algorithms available. Applying these algorithms to a given document may result in a number of different sets of terms. Each existing term extraction algorithm generally solves the term assignment problem from a different perspective.

Therefore, there is a need for a method and apparatus that can integrate these existing words or word extraction results obtained by existing word extraction algorithms to obtain a more optimal set of key words for a document.

Disclosure of Invention

The invention is therefore proposed.

According to an aspect of the present invention, a document keyword generation method is provided, which may include: obtaining a plurality of preliminary key word sets, and combining the plurality of preliminary key word sets to obtain a candidate key word set; determining a word score for each word in the candidate keyword set; clustering the words based on the word scores of the words and the similarity between the words; and assigning key terms to the document based on each cluster obtained by clustering.

The objective function of a cluster may be calculated based on three factors: sum of similarity between words in each cluster; entropy of the distribution of the number of words in each cluster; and entropy of the distribution of word score sums within clusters for each cluster.

The plurality of preliminary keyword sets may include at least one of: the method comprises the steps of obtaining a plurality of corresponding key word sets through a plurality of existing keyword extraction algorithms, obtaining a set consisting of nouns and noun phrases in a document, obtaining a set consisting of key words or index words specified by a document author, and obtaining a set consisting of 'seed' words specified by a user.

Determining a term score for each term in the set of candidate key terms may include: the method comprises the steps of constructing a word recommendation graph by using words in an initial candidate key word set, enabling nodes of the graph to correspond to the words, enabling links between word nodes in the word recommendation graph to be established according to recommendation relations among the words and having weights corresponding to the recommendation relations, enabling each word node to have an initial score, iteratively propagating the score of each word node to adjacent word nodes, keeping a part of the original score of each word node after each round of propagation, and stopping the propagation process at convergence or reaching a certain maximum iteration number.

The score obtained by each word node after each round of propagation from a neighbor node may depend on the score of the neighbor node and the weight of the link between the word node and the neighbor node.

If two words belong to a sentence, the weight of the link between the two words may depend on the grammatical relation between the two words in the grammar tree of the sentence.

The similarity between two words may be determined based on co-occurrence statistics of the two words or words contained by the two words in a predetermined set of documents.

Assigning the document with the key terms may include: for each cluster, calculating a selection score of each word in the cluster based on the word score of the word and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting the word with the largest selection score from the words of the domain dictionary and the words of the cluster as the key word of the document.

Assigning the document with the key terms may include: for each cluster, calculating the selection score of each word in the cluster based on the word score of the word in the cluster, the proportion of the word score to the sum of word scores in the cluster and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting a first preset number of words with the largest selection scores from all words of the domain dictionary and all words of the clusters, uniformly sorting the selected first preset number of words aiming at all clusters in a descending order according to the selection scores, and selecting a second preset number of words which are sorted at the top as key words of the document.

According to another aspect of the present invention, there is provided a document keyword generation apparatus, which may include: a candidate keyword set obtaining unit configured to obtain a plurality of preliminary keyword sets, and merge the plurality of preliminary keyword sets to obtain a candidate keyword set; a term score determining component for determining a term score for each term in the set of candidate key terms; the clustering component is used for clustering the words based on the word scores of the words and the similarity among the words; and a keyword assignment section for assigning a keyword to the document based on each cluster obtained by the clustering.

With the method and apparatus of the present invention, because each existing word extraction algorithm generally solves the word assignment problem from a different perspective, the existing word set also has its aspects to be considered with emphasis. Therefore, the method and the device can comprehensively utilize the existing word set or word extraction algorithm information, thereby obtaining more diversified and balanced results.

In addition, the sum of the similarity between words in each cluster is comprehensively considered in the objective function of the cluster; entropy of the distribution of the number of words in each cluster; and entropy of the distribution of the word score sum within each cluster, further more diverse and balanced results can be obtained.

In addition, with the assistance of a domain dictionary, the word assignment method of the present invention can output words that are not in the original document but rather better describe some aspects of the original document.

Drawings

FIG. 1 is an overall flow diagram of a document keyword generation method according to one embodiment of the invention;

FIG. 2 is a flowchart of a specific example of a document keyword generation method according to another embodiment of the present invention;

FIG. 3 is a block diagram illustrating a document keyword generation apparatus according to one embodiment of the present invention; and

fig. 4 is a diagram illustrating an example of a hardware configuration of a document keyword generation apparatus according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, the following detailed description of the invention is provided in conjunction with the accompanying drawings and the detailed description of the invention.

In this document, words and single words (word) refer to a word in chinese or a word in english. The term (term) may refer to a single word or a group (or phrase) of words.

In addition, in order to avoid obscuring the gist of the present invention, well-known features or structures are not described in the present document, for example, in keyword extraction, text is usually segmented first, and in evaluating the importance of a word, several word features such as word frequency, inverted document frequency, word position, word length, word part of speech, etc. are considered. There are many well-known techniques for word segmentation, such as the segmentation technique ICTCLAS and the like, for word feature selection, which are not the focus of the present invention and are therefore not described in detail, but it is to be understood that this does not imply that the present invention may not include these well-known features or structures, but rather that these segmentation and word feature selection techniques may be used with the present invention.

FIG. 1 is an overall flowchart of a document keyword generation method according to one embodiment of the present invention.

As shown in fig. 1, the document keyword generation method according to one embodiment of the present invention may include an initial candidate keyword set obtaining step S110, a word score determining step S120, a clustering step S130, and a keyword assigning step S140. The steps are specifically described below.

In step S110, a plurality of preliminary keyword sets are obtained, and the plurality of preliminary keyword sets are merged to obtain a candidate keyword set.

The plurality of preliminary keyword set sources herein are arbitrary and may include, for example, at least one of: the method comprises the steps of obtaining a plurality of corresponding key word sets through a plurality of existing keyword extraction algorithms, obtaining a set consisting of nouns and noun phrases in a document, obtaining a set consisting of key words or index words specified by a document author, and obtaining a set consisting of 'seed' words specified by a user. The existing keyword extraction algorithms described above are arbitrary, such as the extraction algorithms mentioned in the background section. In addition, the above sources only list a set of nouns and noun phrases, since there is a high probability that a general noun or noun phrase can represent the subject of a document. However, this is merely exemplary, and other parts of speech words or phrases, such as a set of verbs and verb phrases, may also apply to the present invention.

Merging the multiple preliminary key term sets herein may include a simple merge operation, e.g., simply removing redundant terms. However, the merging of multiple preliminary keyword sets may also include some complex operations, such as:

(1) filtering, for example, filtering out words that are unlikely to be key words according to preset conditions, for example, words that appear very frequently in daily life, such as "today", "month and day", etc., or filtering out stop words, such as "but", "however", etc.;

(2) an initial word score is assigned to each candidate keyword. The calculation of the initial term score may vary depending on the source of the candidate keyword. Let T denote the multiple preliminary key term sets that are supposed to be merged₁，T₂，...，T_mEach preliminary set of key terms T_i＝{Term₁，Term₂，...，Term_j,...}. If the preliminary set of key terms Ti is derived from a known keyword extraction algorithm, the known keyword extraction algorithm will typically be directed to the extracted keywords Term_jCalculating a Score or weight Score (Term)_j) Each keyword extraction algorithm k may have different scores or confidence levels Conf according to different performances_kScore (Term) may be based on the Score of an initial keyword in the preliminary set of key terms from which it was derived_j) And confidence Conf of the corresponding keyword extraction algorithm itself_kTo find an initial word score of the initial keyword, e.g. Conf_k*Score(Term_j) In addition, if the keyword extraction algorithm does not calculate the term score, the term score can be manually specified according to experience; if the initial key term is derived from a set of key terms specified by the document author or index terms or a set of user-specified "seed" terms, then such candidate key terms may be assigned a higher initial term score; if a candidate keyword is derived from a set of nouns and noun phrases in a document, such candidate keyword may be assigned a lower initial word score.

(3) When removing redundant words, the initial word score is also increased according to the number of repetitions of the word.

In step S120, word scores are determined for the respective words in the set of candidate key words.

The word score of each word here may be the initial word score of each word obtained in step S110. However, preferably, the word scores of the words may be further refined according to the recommended relationships between the words. A description will be given below of optimizing the word score by constructing the word recommendation map and the word score propagation process with reference to fig. 2. In addition, some a priori knowledge may also be utilized, for example, where a document is known to be a story about the world cup, its word score may be boosted for words related to soccer, such as "goal", "shoot", etc.

In step S130, words are clustered based on the word scores of the respective words and the similarity between the respective words.

The similarity between two terms may be calculated based on co-occurrence statistics of the two terms or words contained by the two terms in a predetermined set of documents.

Suppose two words term₁And term₂Is shown as

And

wherein

Is the jth word term_jWord of_iTerm₁Consisting of p words and the word term2 consisting of q words. The two words term₁And term₂A specific implementation formula (1) of the similarity measure therebetween may be:

similarity ({term}_{1}, {term}_{2}) = ({word}_{1}^{(1)} {word}_{2}^{(1)} . . . {word}_{p}^{(1)}, {word}_{1}^{(2)} {word}_{2}^{(2)} . . . {word}_{q}^{(2)})

(1)

<math> <mrow> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>p</mi> </munderover> <munderover> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>q</mi> </munderover> <mi>similarity</mi> <mrow> <mo>(</mo> <msubsup> <mi>word</mi> <mi>i</mi> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </msubsup> <mo>,</mo> <msubsup> <mi>word</mi> <mi>j</mi> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>/</mo> <mrow> <mo>(</mo> <mi>p</mi> <mo>·</mo> <mi>q</mi> <mo>)</mo> </mrow> </mrow> </math>

wherein

Representing words

And

a measure of similarity therebetween. A similarity measure between two words (words) can be derived from co-occurrence statistics in a certain set of documents.

For example, two words word are calculated by the mutual information method as the following formula (2)_iAnd word_jSimilarity between words_i，word_j)：

similarity ({word}_{i}, {word}_{j}) = \log \frac{p ({word}_{i}, {word}_{j})}{p ({word}_{i}) p ({word}_{j})} - - - (2)

Wherein, p (word)_i)、p(word_j) Respectively representing words word_iAnd word_jProbability of occurrence in a document, p (word)_i，word_j) Representing words word_iAnd word_jProbability of co-occurrence in the same sentence or a window of a predetermined size in the document.

The occurrence probability information of the words, the similarity between the words, and the similarity between the words may be determined in advance and stored in a word similarity database, or may be calculated from the processed target document on site.

The above calculation of the similarity between words by using the mutual information method is merely an example, and may be performed by using a statistical method such as a Log Likelihood Ratio (Log likehood Ratio), Chi-squared test (Chi-squared), and a knowledge method given to a dictionary (e.g., WordNet).

In addition, the similarity between words can be determined before clustering or in the field during clustering.

Clustering is an unsupervised machine learning algorithm that groups individuals or samples into classes, each of which can be considered as a point in the feature space. The basic idea is to cluster closely spaced and dense points in feature space into one class or cluster.

In the word clustering herein, each word is each sample, and the similarity between words can be regarded as the distance between words. Therefore, various existing Clustering algorithms, for example, the background art cited entitled "Clustering to Find Exemplar Terms for Keyphrase Extraction", and the Clustering algorithm mentioned in the article of natural language processing related conference EMNLP 2009, can be applied to the present invention.

The number c of clusters obtained by final clustering may be predetermined, such as the number of key words specified by a user or a system, or may be uncertain and determined according to the final operation result of a clustering algorithm.

According to an embodiment of the present invention, the number of clusters c is the number of key terms specified by the user or the system, and the term clustering process may be as follows:

1) randomly dividing the words into c clusters;

2) for each word, tentatively putting the word into a different cluster, calculating the change of the objective function, and putting the word into the cluster which enables the increase of the objective function to be maximum;

3) and if the value of the objective function does not increase for all the words, terminating the program and returning the current clustering result.

The objective function in the clustering algorithm is very important and can be designed according to actual needs. According to an embodiment of the present invention, in clustering words, both similarity between words and word scores of respective words are considered, and it is desirable to obtain a more balanced clustering. The balance here can be embodied in, for example, two aspects: the scores and the size distribution of the words in each cluster are balanced; the number distribution of elements, namely words, of each cluster is balanced.

Therefore, according to an embodiment of the present invention, in designing the objective function of the cluster, the above-described balanced factors may be considered in addition to the general similarity between elements within the cluster and the similarity between elements between clusters. The targets of the clusters according to an embodiment of the present invention include at least one of the following items:

1. the fraction and size distribution of each cluster are balanced;

2. the number distribution of elements of each cluster is balanced;

3. the greater the similarity between the intra-cluster elements of each cluster, the better;

4. the smaller the similarity between the elements between clusters, the better.

According to one embodiment of the invention, the objective function of balanced clustering that considers both node scores and word similarity contains the following three factors: the sum of intra-cluster similarities for each cluster; entropy of the distribution of the number of words in each cluster; entropy of distribution of word score sums within clusters for each cluster.

The specific form of the objective function may be selected in many ways, for example, any of the following equations (3), (4), (5) may be used:

<math> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <munder> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>α</mi> <mo>·</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <mo>·</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <mfrac> <mrow> <munder> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <msub> <mi>S</mi> <mi>k</mi> </msub> </mfrac> <mo>+</mo> <mi>α</mi> <mo>·</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>

<math> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <mfrac> <mrow> <munder> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mrow> <munder> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>+</mo> <mi>α</mi> <mo>·</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mo>+</mo> <mi>β</mi> <mo>·</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

e (i, j) is the similarity between words i and j;

π_kis the kth word cluster;

S_1～cis S₁，S₂，...，S_cIn which

Namely S_kIs the score s of all words i in the kth word cluster_iThe sum of (1);

n_1～cis n₁，n₂，...，n_cIn short, where n_iIs the number of words in the ith word cluster;

H(S_1～c) Is the entropy of the distribution of the word score sums within the clusters for each cluster.An example of an entropy definition is the classical Shannon (Shannon) entropy, which is defined as shown in equation (6) below:

H(n_1～c) Is the entropy of the distribution of the number of words within a cluster for each cluster. An example of an entropy definition is the classical Shannon (Shannon) entropy, which is defined as shown in the following equation:

alpha is a parameter which is specified in advance and is used for indicating the importance degree of the distribution balance of the number of the words in each cluster;

β is a parameter designated in advance for indicating the degree of importance of the distribution balance of the word score sums in the respective clusters.

By utilizing the clustering method of the embodiment of the invention, the balance of the sizes of the word clusters and the balance of the word score sum are considered at the same time, so that compared with the condition that only the similarity between words in the clusters is considered when the words are directly clustered in the prior art, a more balanced result can be obtained, and each cluster obtained by clustering better embodies the theme of the document.

The clustering process of the above embodiments is merely an example. In fact, a commonly used technique of aggregating one cluster (e.g., a cluster having a smaller number of elements) to another cluster nearby, or dividing one cluster (e.g., a cluster having a larger number of elements) into two clusters, or the like may be used in the present invention.

In addition, the form of the objective function described above is merely an example, and other objective functions, for example, an objective function that also takes into account the inter-element similarity between clusters, may also be used in the present invention.

In addition, the balance of the word score and size distribution of the cluster is characterized by the entropy of the distribution of the word score sum and the entropy of the distribution of the word number in the specific formulas (3), (4), (5) of the objective function, and the balance of the word number distribution of the cluster is characterized by the entropy of the distribution of the word number, which is merely exemplary, and other statistical indexes such as the Standard Deviation (Standard Deviation) may also be used to characterize the balance.

In step S140, a keyword is assigned to the document based on each cluster obtained by clustering.

After clusters are obtained, the term with the largest term score may simply be selected from each cluster and assigned to the document as the key term. Alternatively, the word with the greatest similarity to other words in the cluster may be selected from each cluster as the key word. Alternatively, the term scores for a term and its similarity to other terms in the cluster may be considered together to decide whether to assign it as a key term to a document.

In addition, a domain dictionary may be incorporated to select key terms to be assigned to a document from the clusters or from the domain dictionary. An example of assigning the key words in conjunction with the domain dictionary will be described in detail below with reference to fig. 2.

With the method of the above embodiment of the present invention, the existing specified words or the results of various existing word extraction algorithms can be considered, which has a great advantage that aspects of a document can be covered by the assigned words because the existing specified words or existing word extraction algorithms may take into account the characteristics of the diversification of various aspects of a given document.

In addition, with the method of the above embodiment, in the execution process of clustering, both the similarity between words and the word scores are considered, and the output word cluster can cover diversified topics in a specified document. The method of this embodiment not only makes the element number distribution of each cluster relatively balanced, but also makes the score and size distribution of each cluster relatively balanced, so that the output word cluster has the following characteristics: topics that are more important to the document may cover more refined sub-topics, and topics that are less important to the document may cover more coarse-grained sub-topics. This property enables the assigned terms to more effectively represent and describe the original document.

A flowchart of a specific example of a document keyword generation method according to another embodiment of the present invention is described below with reference to fig. 2. In fig. 2, dark gray boxes represent inputs, intermediate results or outputs, and white boxes represent processes or operations.

As shown in fig. 2, in step S210, a plurality of preliminary key term sets are obtained. Specifically, step S210 may consist of sub-steps S201, S202, S203, S204. Generating all the lowest possible preliminary candidate words, for example, composed of all nouns and noun phrases in the document, in sub-step S201; in substep S202, a plurality of preliminary key term sets are generated using a plurality of existing term extraction algorithms; in sub-step S203, a set of key terms or index terms, if any, specified by the document author is obtained; in sub-step S204, a set of user-specified "seed terms," if any, is obtained.

In step S220, a candidate keyword set with an initial word score is constructed using the plurality of preliminary keyword sets obtained in step S210, i.e., each candidate keyword in the candidate keyword set has an initial word score. As for the initial word score, it may be obtained from the credibility of the keyword extraction algorithm itself and the score of each word in the preliminary keyword set obtained by the algorithm, as described above with reference to fig. 1, or may be previously specified empirically. In addition, as previously described, operations such as removing redundant words, filtering out inactive words, weighting word scores according to the number of times a word occurs, etc. may be performed at this stage.

In step S230, a word recommendation map is constructed using the individual words in the candidate keyword set. Specifically, each node of the word recommendation graph corresponds to each word in the candidate key word set, links between word nodes in the word recommendation graph are established according to recommendation relations among the words and have weights corresponding to the recommendation relations, and each word node has an initial score, that is, an initial word score of the word corresponding to the node.

The word recommendation relationship may be many, four examples are given below:

a high-weight word relationship is formed by child nodes of the same parent node in a sentence grammar tree. Such as the following sentence:

“The My Colors features include positive film，lighter skin tone，darker skintone，vivid blue，vivid green，vivid red，color accent and color swap.”

in the syntax tree of the sentence, the following words have a common parent node, and thus the words have such a relationship therebetween: "positive film", "lighter skin tone", "darker skin", "visual blue", "visual green", "visual red", "color access", and "colorswap".

The second word recommendation relationship consists of between any two words in the same sentence, or between two words in the same sentence with a distance smaller than a certain threshold.

A third word recommendation relationship is a recommendation of words in sentences in the middle of a paragraph to words in the beginning and end subject sentences of the paragraph.

The fourth word recommendation relationship is a recommendation of words in the title of an article by words throughout and a recommendation of words in the content of a chapter by words in the title of the chapter.

Other word recommendation relationships may be considered by those skilled in the art as desired and a predetermined weight may be assigned to each word recommendation relationship.

If one or more word recommendation relationships are complied between two words, a link is established between the two words, and the weight of the link may be set to a weighted average of the different word recommendation relationships.

In step S240, the word score is propagated on the word recommendation map. In a sense, propagating word scores on a word recommendation graph may be considered as a simulation of human social recommendation relationships, and specifically, for example, in human social relationships, a person's rating may include two factors, one being the person's own performance score, another being the person's rating or recommendation by others in the society, and the degree of recommendation by a different person to the person may be determined by two factors, one being the importance of the recommender, and the other being the closeness of the relationship between the recommender and the person, i.e., the person being recommended.

In particular, the term score propagation may be an iterative process in which the term of each node passes the score to its neighbors (the term of the node plays the role of the recommender and its neighbors are recommenders) and also obtains the score from the neighbors (the term of the node plays the role of the recommenders and its neighbors are recommenders) itself. . After each round of propagation, the score for each word node consists of two parts, the first part being the retained part of its original score and the second part being the score obtained from the neighbor node, which may depend on the score of the neighbor node and the weight of the link between the word node and the neighbor node.

After all word scores have gradually stabilized or after the maximum number of iterations has been reached, the process of score propagation may terminate and return the score of the current word as the final word score.

According to an embodiment of the present invention, the score of the node i after each iteration can be calculated by the following score propagation formula:

<math> <mrow> <msubsup> <mi>s</mi> <mi>i</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> </msubsup> <mo>=</mo> <mi>a</mi> <mo>+</mo> <mi>b</mi> <mo>·</mo> <msubsup> <mi>s</mi> <mi>i</mi> <mrow> <mo>(</mo> <mn>0</mn> <mo>)</mo> </mrow> </msubsup> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <mi>a</mi> <mo>-</mo> <mi>b</mi> <mo>)</mo> </mrow> <mo>·</mo> <munder> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <mi>N</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </munder> <mi>g</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>D</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>D</mi> <mi>j</mi> </msub> <mo>,</mo> <msubsup> <mi>s</mi> <mi>j</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein,

a is a smoothing parameter specified in advance;

b is a pre-specified parameter indicating the proportion of the initial fraction;

s_i ⁽⁰⁾is the initial merged score for word i;

s_i ⁽¹⁾the score of the word i after the 1 st iteration;

s_i ⁽¹⁺¹⁾the score of the word i after the (1+1) th iteration;

n (i) is the neighbor of the ith word node in the word graph;

W_j，iis the weight of the link from word node j to word node i in the word graph;

d_ithe degree of the word node i is equal to the number of links pointing to other nodes by taking the word node i as a starting point;

D_iis the weighted degree of the term node i, equal to the sum of the weights of the links that point to other nodes starting at the term node i.

Note that a is 0. ltoreq. a.ltoreq.1, b is 0. ltoreq. b.ltoreq.1, and a + b is 0. ltoreq.1.

Is a predefined function, herein called recommendation function, whichThe recommendation function may satisfy the following characteristics: 1. function of w_j，iAnd

monotonically increasing; 2. function of d_i，d_j，D_i，D_jMonotonically decreasing. There are various implementations of this function, three specific examples are as follows:

g (w_{j, i}, d_{i}, d_{j}, D_{i}, D_{j}, s_{j}^{(l)}) = \frac{w_{j, i} s_{j}^{(l)}}{\sqrt{d_{i} d_{j}}} - - - (9)

g (w_{j, i}, d_{i}, d_{j}, D_{i}, D_{j}, s_{j}^{(l)}) = \frac{w_{j, i} s_{j}^{(l)}}{\sqrt{D_{i} D_{j}}} - - - (10)

g (w_{j, i}, d_{i}, d_{j}, D_{i}, D_{j}, s_{j}^{(l)}) = \frac{2 w_{j, i} s_{j}^{(l)}}{D_{i} + D_{j}} - - - (11)

however, it should be clear to those skilled in the art that the score propagation formula (8) and the recommendation formulas (9), (10) and (11) are only exemplary and not intended to limit the present invention, and other formula forms may be designed as needed.

Through step S240, word scores of the respective words in the candidate keyword set are finally determined.

Then, in step S250, a word similarity map is constructed to obtain the similarity between the words in the candidate keyword set. In the word similarity graph, each node corresponds to each word in the candidate key word set, every two nodes are linked, and the weight of the link represents the similarity of the words of the two nodes. The calculation of the similarity between words of two nodes is substantially the same as the calculation of the similarity between words described with reference to step S130 of fig. 1, and is not repeated here.

Next, proceeding to step S260, the nodes in the word similarity graph are subjected to balanced clustering, where each node in the word similarity graph may have a word score of a corresponding word based on the word recommendation graph obtained in step S240.

Illustratively, as described above with reference to step 130 of FIG. 1, the objective function of the clustering process herein may employ the aforementioned formula (3), i.e.

<math> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>c</mi> </munderover> <munder> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>α</mi> <mo>·</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>n</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>+</mo> <mi>β</mi> <mo>·</mo> <mi>H</mi> <mrow> <mo>(</mo> <msub> <mi>S</mi> <mrow> <mn>1</mn> <mo>~</mo> <mi>c</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> </math>

Three terms in the formula respectively represent the sum of the intra-cluster similarity of each cluster; entropy of the distribution of the number of words in each cluster; entropy of distribution of word score sums within clusters for each cluster.

According to the present example, the number of clusters is determined to be equal to the number of key terms to be obtained, and is set as c, and as described above, the specific clustering process is:

1) randomly dividing the words into c clusters;

3) and if the value of the objective function is not increased or the iteration times reach the maximum times specified in advance for all the words, terminating the program and returning the current clustering result.

The clustering method of the embodiment not only enables the element number distribution of each cluster to be balanced, but also enables the score and size distribution of each cluster to be balanced, so that the output word cluster can cover diversified themes in the appointed document.

After clustering, proceeding to step S270, the document key words are selected based on the clustered individual clusters and the domain dictionary.

The most commonly used terms in a domain dictionary, which are commonly referred to as domain dictionaries, are included in the domain dictionary and can be determined by a domain expert or selected by experiments. Borrowing a domain dictionary here is based on the following considerations: in some cases, it may be that a word or words in the domain dictionary that should have a greater degree of similarity or relevance to existing words in a cluster better describe the topic of the document than existing words in the cluster.

Specifically, as a first example, the process of selecting the document key words based on the clustered clusters and the domain dictionary may be performed as follows:

for each of the clusters it is desirable to have,

calculating the selection score of each word in the cluster based on the word score of the word and the similarity between the word and other words in the cluster;

calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and

from the respective words of the domain dictionary and the respective words of the cluster, the word having the largest selection score is selected as the key word of the document.

For another example, as a second example, the process of selecting the document key words based on the clustered clusters and the domain dictionary may also be performed as follows:

for each of the clusters it is desirable to have,

calculating the selection score of each word in the cluster based on the word score of each word in the cluster, the proportion of the word score to the sum of word scores in the cluster and the similarity between the word and other words in the cluster;

selecting a first predetermined number of words having a maximum selection score from the respective words of the domain dictionary and the respective words of the cluster,

and uniformly sorting the selected first preset number of terms of each cluster in a descending order according to the selection scores, and selecting a second preset number of terms which are sorted at the top as key terms of the document.

In the first example, assuming that 5 clusters are obtained after clustering, each word cluster is concentrated into one key word by means of the domain dictionary, and finally 5 key words are obtained.

In a second example, also assuming that 5 clusters are obtained after clustering, for each cluster, for example, 2 words are selected from the cluster or the domain dictionary, and finally 10 words are obtained, the 10 words are sorted in order from high to low according to word selection scores, and the word with the highest score, for example, the number of words specified by a user or a system, is selected, for example, 8 words with the highest score are selected as the document key words.

A term selection score for a term characterizes an evaluation factor that selects the term as a document keyword.

One example of calculating word selection scores for words in the cluster and words in the domain dictionary is given below.

Let us assume that the set of scored words generated in the previous step S240 is denoted as { (t)₁，s₁)，(t₂，s₂)，...，(t_u，s_u) Where t is_iIs the ith word, s_iIs the score of the ith word, and u is the number of words in the candidate key word set. Suppose the domain dictionary is written as t_u+1，t_u+2，...，t_u+vWhere t is_u+iIs the ith word in the domain dictionary, and v is the number of words in the domain dictionary. Here we assume a word score s for each word in the domain dictionary_u+1＝s_u+2＝...＝s_u+v0. For a word cluster k and a word i in the word cluster or domain dictionary, a definition of a word selection score (i ∈ π @_kOr u + 1. ltoreq. i. ltoreq. u + v, where π_kReferring to the kth word cluster) can be calculated by the following equation (12):

<math> <mrow> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>k</mi> </msub> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>γ</mi> <mo>·</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>+</mo> <mi>η</mi> <mo>·</mo> <mfrac> <msub> <mi>s</mi> <mi>i</mi> </msub> <mrow> <munder> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <msub> <mi>s</mi> <mi>j</mi> </msub> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>12</mn> <mo>)</mo> </mrow> </mrow> </math>

wherein n is_kIs the number of words in the word cluster k; γ is a parameter specified in advance, representing the degree of importance of the word score in the word selection score; η is a parameter specified in advance and represents the degree of importance of the word score in the word selection score to the proportion of the sum of word scores in the cluster.

The above calculation formula (12) of the word selection score is merely an example. Other calculation formulas may be designed as desired. For example, the word selection score may alternatively be calculated using the following calculation formula (13).

<math> <mrow> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>k</mi> </msub> </mfrac> <munder> <mi>Σ</mi> <mrow> <mi>j</mi> <mo>&Element;</mo> <msub> <mi>π</mi> <mi>k</mi> </msub> </mrow> </munder> <mi>e</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>γ</mi> <mo>·</mo> <msub> <mi>s</mi> <mi>i</mi> </msub> <mo>+</mo> <mi>η</mi> <mo>·</mo> <mfrac> <msub> <mi>s</mi> <mi>i</mi> </msub> <msub> <mi>s</mi> <mrow> <mi>max</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>)</mo> </mrow> </mrow> </msub> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> </mrow> </math>

Wherein n is_kIs the number of words in the word cluster k; s_max(k)Is the maximum word score in word cluster k; γ is a parameter specified in advance, representing the degree of importance of the word score in the word selection score; η is a pre-specified parameter representing the degree of importance of the ratio of the word score to the maximum word score in the cluster in the word selection score.

In the above calculation formulas (12) and (13) of the word selection score, three factors are considered, namely the word score of each word, the proportion of the word score to the sum of the word scores in the cluster, and the similarity between the word and other words in the cluster. However, this is merely an example, and only one or two of the factors may be considered, or more other factors may also be considered as desired.

In this embodiment, by selecting document key words from a domain dictionary or cluster, words that are not in the original document but that better describe some aspect of the original document can be output as key words.

FIG. 3 illustrates a block diagram of a document keyword generation apparatus according to one embodiment of the present invention.

The document keyword generation means may include: a candidate keyword set obtaining unit 102, configured to obtain a plurality of preliminary keyword sets, and merge the plurality of preliminary keyword sets to obtain a candidate keyword set; a word score determining component 103 for determining a word score for each word in the set of candidate key words; a clustering component 104 for clustering the words based on the word scores of the words and the similarity between the words; and a keyword assignment section 105 for assigning a keyword to the document based on each cluster obtained by the clustering.

The clustering component 104 can calculate an objective function for the cluster based on three factors: sum of similarity between words in each cluster; entropy of the distribution of the number of words in each cluster; and entropy of the distribution of word score sums within clusters for each cluster.

The term score determination component 103 may determine term scores for individual terms in the set of candidate key terms comprising: the method comprises the steps of constructing a word recommendation graph by using words in an initial candidate key word set, enabling nodes of the graph to correspond to the words, enabling links between word nodes in the word recommendation graph to be established according to recommendation relations among the words and having weights corresponding to the recommendation relations, enabling each word node to have an initial score, iteratively propagating the score of each word node to adjacent word nodes, keeping a part of the original score of each word node after each round of propagation, and stopping the propagation process at convergence or reaching a certain maximum iteration number.

Furthermore, the similarity between two words may be determined based on co-occurrence statistics of the two words or words contained by the two words in a predetermined set of documents.

The keyword assignment component 105 assigning keywords to documents may include: for each cluster, calculating a selection score of each word in the cluster based on the word score of the word and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting the word with the largest selection score from the words of the domain dictionary and the words of the cluster as the key word of the document.

The keyword assignment component 105 assigning keywords to documents may include: for each cluster, calculating the selection score of each word in the cluster based on the word score of the word in the cluster, the proportion of the word score to the sum of word scores in the cluster and the similarity between the word and other words in the cluster; calculating a selection score of each word in the predetermined domain dictionary based on the similarity between the word and each word in the cluster; and selecting a first preset number of words with the largest selection scores from all words of the domain dictionary and all words of the clusters, uniformly sorting the selected first preset number of words aiming at all clusters in a descending order according to the selection scores, and selecting a second preset number of words which are sorted at the top as key words of the document.

Finally, a description will be given, with reference to fig. 4, as an example of a hardware configuration of the document keyword generation apparatus that performs the above-described processing. A CPU (central processing unit) 701 performs various processes in accordance with a program stored in a ROM (read only memory) 702 or a storage section 708. For example, the CPU executes the program of document keyword generation described in the above-described embodiments. A RAM (random access memory) 703 appropriately stores programs, data, and the like executed by the CPU 701. The CPU 301, ROM 702, and RAM 703 are connected to each other via a bus 704.

The CPU701 is connected to an input/output interface 705 via a bus 704. An input section 706 including a keyboard, a mouse, a microphone, and the like, and an output section including a display, a speaker, and the like are connected to the input/output interface 705. The CPU701 executes various processes in accordance with instructions input from the input section 706. The CPU701 outputs the result of the processing to the output section 707.

The storage section 708 connected to the input/output interface 705 includes, for example, a hard disk, and stores programs executed by the CPU701 and various data. The communication section 709 communicates with an external device through a network such as the internet, a local area network, or the like.

A drive 710 connected to the input/output interface 705 drives a removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, and obtains a program, data, and the like recorded therein. The obtained program and data are transferred to the storage section 708 as needed, and stored therein.

While the principles of the invention have been described in connection with specific embodiments thereof, it should be noted that it will be understood by those skilled in the art that all or any of the steps or elements of the method and apparatus of the invention may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or any combination thereof, which will be within the skill of those in the art after reading the description of the invention and applying their basic programming skills.

Thus, the objects of the invention may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. The object of the invention is thus also achieved solely by providing a program product comprising program code for implementing the method or the apparatus. That is, such a program product also constitutes the present invention, and a storage medium storing such a program product also constitutes the present invention. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.

It is further noted that in the apparatus and method of the present invention, it is apparent that each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be regarded as equivalents of the present invention. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other, e.g. the step of determining the word score and the step of calculating the similarity between words may be performed sequentially, in parallel or independently in any order.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A document keyword generation method comprises the following steps:

obtaining a plurality of preliminary key word sets, and combining the plurality of preliminary key word sets to obtain a candidate key word set;

determining a word score for each word in the candidate keyword set;

clustering the words based on the word scores of the words and the similarity between the words, so that the scores and the size distribution of the words in each cluster are balanced, and the number distribution of the words in each cluster is balanced;

and assigning key terms to the documents based on each cluster obtained by clustering.

2. The keyword generation method according to claim 1, calculating an objective function of the cluster based on three factors:

sum of similarity between words in each cluster;

entropy of the distribution of the number of words in each cluster; and

entropy of distribution of word score sums within clusters for each cluster.

3. The keyword generation method of claim 1, the plurality of preliminary keyword sets comprising at least one of: the method comprises the steps of obtaining a plurality of corresponding key word sets through a plurality of existing keyword extraction algorithms, obtaining a set consisting of nouns and noun phrases in a document, obtaining a set consisting of key words or index words specified by a document author, and obtaining a set consisting of 'seed' words specified by a user.

4. The keyword generation method of claim 1, the determining a word score for each word in the set of candidate keyword comprises:

the method comprises the steps of constructing a word recommendation graph by using words in an initial candidate key word set, enabling nodes of the graph to correspond to the words, enabling links between word nodes in the word recommendation graph to be established according to recommendation relations among the words and having weights corresponding to the recommendation relations, enabling each word node to have an initial score, iteratively propagating the score of each word node to adjacent word nodes, keeping a part of the original score of each word node after each round of propagation, and stopping the propagation process at convergence or reaching a certain maximum iteration number.

5. The method of generating key terms according to claim 4, wherein the score obtained by each term node after each round of propagation from a neighboring node depends on the score of the neighboring node and the weight of the link between the term node and the neighboring node.

6. The key word generating method according to claim 5, wherein if two words belong to one sentence, the weight of the link between the two words depends on a grammatical relationship between the two words in a syntax tree of the sentence.

7. The keyword generation method according to claim 1, wherein the similarity between two words is determined based on co-occurrence statistical information of the two words or words contained in the two words in a predetermined document set.

8. The keyword generation method of claim 1, the assigning keywords to the document comprising:

for each of the clusters it is desirable to have,

9. The keyword generation method of claim 1, the assigning keywords to the document comprising:

for each of the clusters it is desirable to have,

and uniformly sorting the selected first preset number of terms aiming at each cluster in a descending order according to the selection scores, and selecting a second preset number of terms which are sorted at the top as key terms of the document.

10. A keyword generation apparatus of a document, comprising:

a candidate keyword set obtaining unit configured to obtain a plurality of preliminary keyword sets, and merge the plurality of preliminary keyword sets to obtain a candidate keyword set;

a term score determining component for determining a term score for each term in the set of candidate key terms;

the clustering component is used for clustering the words based on the word scores of the words and the similarity among the words, so that the scores and the size distribution of the words in each cluster are balanced, and the number distribution of the words in each cluster is balanced; and

and the key term assignment component is used for assigning key terms to the document based on each cluster obtained by clustering.