CN104572736A

CN104572736A - Keyword extraction method and device based on social networking services

Info

Publication number: CN104572736A
Application number: CN201310503897.5A
Authority: CN
Inventors: 赵立永; 于晓明; 杨建武
Original assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University; Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2013-10-23
Filing date: 2013-10-23
Publication date: 2015-04-29

Abstract

The present invention provides a method and device for extracting keywords based on a social network. The method includes: segmenting the text to be extracted, and counting the frequency of the word and the number of texts corresponding to the word; according to the frequency of the word and the number of texts corresponding to the word, Calculate the weight of words, select the first preset value of words with higher weight as candidate keywords, and extract the second preset value of candidate keywords with higher frequency of occurrence in the text to be extracted as keywords from the candidate keywords . The present invention performs noise filtering, text deduplication, word segmentation and word weight calculation on the text to be extracted, and then extracts keywords according to the word weight. Since a large amount of historical search information is not needed, the extraction speed is improved.

Description

Keyword extraction method and device based on social network

技术领域technical field

本发明涉及关键词提取技术领域，特别涉及一种基于社交网络的关键词提取方法及装置。The present invention relates to the technical field of keyword extraction, in particular to a method and device for extracting keywords based on a social network.

背景技术Background technique

关键词作为广大社交用户共同关注和使用的主题词，能够涵盖大量的信息。通过提取海量社交文本中的关键词信息，不仅能够及时了解广大社交用户共同关注的主题，而且能够帮助社交用户及时掌握当前的热点信息。因此，关键词提取能够有效应对信息过载问题，并为广大社交用户提供快捷便利的资讯服务。Keywords, as the subject words that the majority of social users pay attention to and use, can cover a large amount of information. By extracting keyword information from massive social texts, not only can timely understand the topics that the majority of social users are concerned about, but also help social users to grasp current hot information in a timely manner. Therefore, keyword extraction can effectively deal with the problem of information overload and provide fast and convenient information services for the majority of social users.

普遍存在的关键词抽取方法为：获取大量用户的历史搜索信息，根据用户的历史搜索信息以及网页内容中频繁出现的主题词，提取关键词。The ubiquitous keyword extraction method is as follows: obtain historical search information of a large number of users, and extract keywords according to the historical search information of users and subject words frequently appearing in webpage content.

然而，目前的方法在很大程度上依赖于用户的搜索信息，需要获取到大量的历史搜索信息，才能够准确提取出关键词，提取速度低。However, the current method relies heavily on the user's search information, and needs to obtain a large amount of historical search information to accurately extract keywords, and the extraction speed is low.

发明内容Contents of the invention

（一）解决的技术问题(1) Solved technical problems

本发明解决的技术问题是：如何解决在提取关键词过程中需要获取大量历史搜索信息问题。The technical problem solved by the invention is: how to solve the problem of obtaining a large amount of historical search information in the process of extracting keywords.

（二）技术方案(2) Technical solutions

为解决上述技术问题，本发明提供了一种基于社交网络的关键词提取方法，包括：In order to solve the above technical problems, the present invention provides a method for extracting keywords based on social networks, including:

对待提取文本进行分词，并统计词的词频和该词对应的文本数；Segment the text to be extracted, and count the word frequency of the word and the number of texts corresponding to the word;

根据所述词频和该词对应的文本数，计算词权重，选取第一预设值个词权重较大的词作为候选关键词，从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。According to the number of texts corresponding to the word frequency and the word, calculate the weight of the word, select the word with the larger word weight of the first preset value as the candidate keyword, and extract the second preset value in the text to be extracted from the candidate keyword Candidate keywords with higher frequency appearing as keywords.

优选地，在所述对待提取文本进行分词之前，进一步包括：对待提取文本进行噪声过滤，并将过滤后的文本去重；Preferably, before performing word segmentation on the text to be extracted, further comprising: performing noise filtering on the text to be extracted, and deduplicating the filtered text;

和/或，and / or,

所述对待提取文本进行分词的步骤进一步包括：对分词进行词性标注，该词性为符合提取规则的第一词性或不符合提取规则的第二词性；则所述选取第一预设值个词权重较大的词作为候选关键词包括：从词性为第一词性的词中，选取第一预设值个词权重较大的词作为候选关键词。The step of performing word segmentation on the text to be extracted further includes: performing part-of-speech tagging on the part-of-speech, and the part-of-speech is the first part-of-speech that meets the extraction rules or the second part-of-speech that does not meet the extraction rules; Using larger words as candidate keywords includes: selecting a first preset value of words with larger weights as candidate keywords from the words whose part of speech is the first part of speech.

优选地，所述对待提取文本进行噪声过滤，具体包括：Preferably, the noise filtering of the text to be extracted specifically includes:

根据设定的噪声过滤规则，遍历待提取文本，对待提取文本中的字符进行匹配，若待提取文本中的字符属于所述噪声过滤规则，则匹配成功，将匹配成功的字符删除；Traversing the text to be extracted according to the noise filtering rules set, matching the characters in the text to be extracted, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the characters that are successfully matched are deleted;

和/或，and / or,

所述将过滤后的文本去重，具体包括：The deduplication of the filtered text specifically includes:

将当前过滤后的文本映射成指纹信息，并将该当前过滤后的文本与指纹信息库进行比较，若比较结果中存在差异的指纹个数小于等于第三预设值，则将当前过滤后的文本删除，否则，将当前过滤后的文本的指纹信息加入所述指纹信息库中。Map the currently filtered text into fingerprint information, and compare the currently filtered text with the fingerprint information base, if the number of fingerprints with differences in the comparison result is less than or equal to the third preset value, then the current filtered text The text is deleted, otherwise, the fingerprint information of the currently filtered text is added to the fingerprint information database.

优选地，所述统计词的词频和该词对应的文本数，进一步包括：Preferably, the word frequency of said statistical word and the text number corresponding to the word further include:

为每个不重复的词分配词索引号，并将词的索引号以及与索引号对应的词的特征保存到词索引表中；Assign a word index number to each word that does not repeat, and save the word index number and the characteristics of the word corresponding to the index number in the word index table;

为去重后的文本分配文本索引号，根据去重后的文本中词的位置关系，将去重后的文本的文本索引号以及该去重后的文本中词的词索引号保存到文本索引表中；Assign a text index number to the deduplicated text, and save the text index number of the deduplicated text and the word index number of the word in the deduplicated text to the text index according to the positional relationship of the words in the deduplicated text table;

其中，所述词的特征包括：词的词频、该词对应的文本数、词性和词权重。Wherein, the features of the word include: word frequency of the word, number of texts corresponding to the word, part of speech and word weight.

优选地，所述计算词权重具体包括：Preferably, the calculation of word weight specifically includes:

根据以下公式计算词的词权重：The word weight of a word is calculated according to the following formula:

$weight weight ((term term)) = = b b ((term term)) * * a a ((term term)) * * tf tf ((term term)) * * log log \frac{| | d d | |}{11 + + df df ((term term))},,$

其中，weight(term)为词权重，b(term)和a(term)为经验修正值，tf(term)为词的词频，df(term)为词对应的文本数，|d|为文本总数。Among them, weight(term) is the word weight, b(term) and a(term) are empirical correction values, tf(term) is the word frequency of the word, df(term) is the number of texts corresponding to the word, |d| is the total number of texts .

为解决上述技术问题，本发明还提供了一种基于社交网络的关键词提取装置，包括：In order to solve the above technical problems, the present invention also provides a keyword extraction device based on social network, including:

分词模块，用于对待提取文本进行分词，并将分词后的词传输给统计模块；The word segmentation module is used to perform word segmentation on the text to be extracted, and transmit the word after word segmentation to the statistics module;

所述统计模块，用于统计词的词频和该词对应的文本数，并将统计结果传输给计算模块；The statistical module is used to count the word frequency of the word and the text number corresponding to the word, and transmit the statistical result to the calculation module;

所述计算模块，用于根据所述词频和该词对应的文本数，计算词权重，并将计算结果传输给选取模块；The calculation module is used to calculate the word weight according to the word frequency and the text number corresponding to the word, and transmit the calculation result to the selection module;

所述选取模块，用于选取第一预设值个词权重较大的词作为候选关键词，并将选取结果传输给提前模块；The selection module is used to select the first preset word with larger word weight as the candidate keyword, and transmit the selection result to the advance module;

所述提取模块，用于从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。The extraction module is configured to extract a second preset value of candidate keywords that appear more frequently in the text to be extracted from the candidate keywords as keywords.

优选地，所述装置还包括：Preferably, the device also includes:

噪声过滤模块，用于对待提取文本进行噪声过滤，并将过滤后的文本传输给文本去重模块；A noise filtering module, configured to perform noise filtering on the text to be extracted, and transmit the filtered text to the text deduplication module;

所述文本去重模块，用于将过滤后的文本进行去重；The text deduplication module is used to deduplicate the filtered text;

和/或，and / or,

词性标注模块，用于对分词进行词性标注，该词性为符合提取规则的第一词性或不符合提取规则的第二词性，并将标注结果传输给所述选取模块；The part-of-speech tagging module is used to carry out part-of-speech tagging on the word segmentation, the part-of-speech is the first part-of-speech that meets the extraction rules or the second part-of-speech that does not meet the extraction rules, and transmits the tagging results to the selection module;

所述选取模块，还用于从词性为第一词性的词中，选取第一预设值个词权重较大的词作为候选关键词。The selection module is also used to select a first preset word with a higher weight as a candidate keyword from the words whose part of speech is the first part of speech.

优选地，所述噪声过滤模块包括：Preferably, the noise filtering module includes:

设定子模块，用于设定噪声过滤规则，并将设定的噪声过滤规则传输给匹配子模块；The setting sub-module is used to set the noise filtering rules, and transmit the set noise filtering rules to the matching sub-module;

遍历子模块，用于遍历待提取文本，并将遍历结果传输给所述匹配子模块；A traversal submodule, configured to traverse the text to be extracted, and transmit the traversal result to the matching submodule;

所述匹配子模块，用于根据设定的噪声过滤规则，对待提取文本中的字符进行匹配，若待提取文本中的字符属于所述噪声过滤规则，则匹配成功，并将匹配成功的字符传输给第一删除子模块；The matching submodule is used to match the characters in the text to be extracted according to the set noise filtering rules, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the successfully matched characters are transmitted to the first remove submodule;

所述第一删除子模块，用于将匹配成功的字符删除；The first deletion submodule is used to delete characters that match successfully;

和/或，and / or,

所述文本去重模块包括：The text deduplication module includes:

映射子模块，用于将当前过滤后的文本映射成指纹信息，将映射结果传输给比较子模块；The mapping submodule is used to map the currently filtered text into fingerprint information, and transmit the mapping result to the comparison submodule;

所述比较子模块，用于将该当前过滤后的文本与指纹信息库进行比较，并将比较结果中存在差异的指纹个数小于等于第三预设值的当前过滤的文本传输给第二删除子模块，以及将比较结果中存在差异的指纹个数不小于第三预设值的当前过滤的文本传输给保存子模块；The comparison submodule is used to compare the currently filtered text with the fingerprint information database, and transmit the currently filtered text whose fingerprint number is less than or equal to the third preset value to the second deletion The submodule, and the currently filtered text whose number of fingerprints with differences in the comparison result is not less than the third preset value is transmitted to the saving submodule;

所述第二删除子模块，用于将当前过滤后的文本删除；The second deletion submodule is used to delete the currently filtered text;

所述保存子模块，用于将当前过滤后的文本的指纹信息加入所述指纹信息库中。The saving submodule is used to add the fingerprint information of the currently filtered text into the fingerprint information database.

优选地，所述装置还包括：Preferably, the device also includes:

分配模块，用于为每个不重复的词分配词索引号，以及为去重后的文本分配文本索引号，并将分配结果传输给保存模块；An allocation module is used to allocate a word index number for each non-repetitive word, and allocate a text index number for the deduplicated text, and transmit the allocation result to the preservation module;

所述保存模块，用于将词的索引号以及与索引号对应的词的特征保存到词索引表中，以及根据去重后的文本中词的位置关系，将去重后的文本的文本索引号以及该去重后的文本中词的词索引号保存到文本索引表中；The preservation module is used to store the index number of the word and the features of the word corresponding to the index number in the word index table, and according to the positional relationship of the word in the text after the deduplication, the text index of the text after the deduplication No. and the word index number of the word in the deduplicated text are stored in the text index table;

优选地，所述计算模块，用于根据以下公式计算词的词权重：Preferably, the calculation module is used to calculate the word weight of the word according to the following formula:

（三）有益效果(3) Beneficial effects

本发明通过提供一种于社交网络的关键词提取方法及装置，不需要根据大量历史搜索信息，而是直接在待提取文本中提取关键词，通过对待提取文本进行分词以及计算词权重，进而根据词权重提取关键词，由于不需要大量的历史搜索信息，从而提高了提取速度。The present invention provides a method and device for extracting keywords in a social network, which does not need to search for information based on a large amount of history, but directly extracts keywords from the text to be extracted, and performs word segmentation and calculation of word weights on the text to be extracted. Word weight extracts keywords, and since a large amount of historical search information is not required, the extraction speed is improved.

附图说明Description of drawings

图1是本发明实施例一提供的方法流程图；Fig. 1 is a flow chart of the method provided by Embodiment 1 of the present invention;

图2是本发明实施例二提供的方法流程图；Fig. 2 is a flow chart of the method provided by Embodiment 2 of the present invention;

图3是本发明实施例二提供的词索引表示意图；FIG. 3 is a schematic diagram of a word index table provided by Embodiment 2 of the present invention;

图4是本发明实施例二提供的文本索引表示意图；FIG. 4 is a schematic diagram of a text index table provided by Embodiment 2 of the present invention;

图5是本发明实施例三提供的装置结构示意图；Fig. 5 is a schematic structural diagram of the device provided by Embodiment 3 of the present invention;

图6是本发明实施例三提供的噪声过滤模块结构示意图；FIG. 6 is a schematic structural diagram of a noise filtering module provided by Embodiment 3 of the present invention;

图7是本发明实施例三提供的文本去重模块结构示意图。FIG. 7 is a schematic structural diagram of a text deduplication module provided by Embodiment 3 of the present invention.

具体实施方式Detailed ways

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.

实施例1：Example 1:

为解决现有技术中提取关键字需要获取大量历史数据的问题，本发明实施例提供了一种基于社交网络的关键词提取方法，如图1所示，该方法包括：In order to solve the problem in the prior art that a large amount of historical data needs to be obtained for extracting keywords, an embodiment of the present invention provides a method for extracting keywords based on a social network, as shown in FIG. 1 , the method includes:

步骤101：对待提取文本进行分词，并统计词的词频和该词对应的文本数；Step 101: Segment the text to be extracted, and count the word frequency of the word and the number of texts corresponding to the word;

步骤102：根据所述词频和该词对应的文本数，计算词权重，选取第一预设值个词权重较大的词作为候选关键词，从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。Step 102: According to the word frequency and the number of texts corresponding to the word, calculate the weight of the word, select a word with a higher weight of the first preset value as a candidate keyword, and extract a second preset value of words from the candidate keywords Candidate keywords that appear more frequently in the text to be extracted are used as keywords.

本发明实施例不需要根据大量历史搜索信息，而是直接在待提取文本中提取关键词，通过对待提取文本进行分词以及计算词权重，进而根据词权重提取关键词，由于不需要大量的历史搜索信息，从而提高了提取速度。The embodiment of the present invention does not need to search information based on a large amount of history, but directly extracts keywords from the text to be extracted, and then extracts keywords according to the weight of words by performing word segmentation and calculating word weights on the text to be extracted. Since a large number of historical searches are not required information, thereby increasing the extraction speed.

在本发明实施例中，通过提前对待提取文本进行噪声过滤、去重，从而提高了关键词提取的公平性。根据标注的词性，从符合提取规则的第一词性的词中，选取第一预设值个词权重较大的词作为候选关键词。从而提高了关键词提取的准确性。In the embodiment of the present invention, the fairness of keyword extraction is improved by performing noise filtering and deduplication on the text to be extracted in advance. According to the tagged part of speech, select a first preset word with a higher weight as a candidate keyword from the words meeting the first part of speech of the extraction rule. Therefore, the accuracy of keyword extraction is improved.

在本发明实施例中，通过设定噪声过滤规则，遍历待提取文本，对待提取文本中的字符进行匹配，若待提取文本中的字符属于所述噪声过滤规则，则匹配成功，将匹配成功的字符删除，因此通过噪声过滤后，提高了处理效率和关键词提取的准确性。由于经过噪声过滤后的文本中有一些重复文本，因此通过去重处理后，剩余文本均为不重复文本，从而减少了文本的复杂性，提高了关键词的提取精度。In the embodiment of the present invention, by setting the noise filtering rules, traversing the text to be extracted, and matching the characters in the text to be extracted, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the successfully matched Characters are removed, so after filtering through noise, the processing efficiency and accuracy of keyword extraction are improved. Since there are some repetitive texts in the noise-filtered text, the remaining texts are non-repetitive texts after deduplication processing, thereby reducing the complexity of the text and improving the extraction accuracy of keywords.

在本发明实施例中，通过建立词索引表和文本索引表，为后续提取关键词提供了方便。In the embodiment of the present invention, by establishing a word index table and a text index table, it provides convenience for subsequent keyword extraction.

实施例2Example 2

为了解决现有技术的问题，本发明实施例第二个实施例提供了一种基于社交网络的关键词提取方法，如图2所示，该方法包括：In order to solve the problems of the prior art, the second embodiment of the present invention provides a method for extracting keywords based on a social network, as shown in FIG. 2 , the method includes:

步骤201：对待提取文本进行噪声过滤；Step 201: performing noise filtering on the text to be extracted;

在本发明实施例中，由于待提取文本中包括大量的无效信息，不仅降低了处理效率，而且影响关键词提取的效果。因此，In the embodiment of the present invention, because the text to be extracted contains a large amount of invalid information, not only the processing efficiency is reduced, but also the effect of keyword extraction is affected. therefore,

首先，设定噪声过滤规则；First, set the noise filtering rules;

在本发明实施例中，设定的噪声过滤规则包括：a、表情符号（一般以“[文本]”形式出现）噪声；b、“html标签”噪声；c、“用户名”噪声；d、“//用户名”噪声。In the embodiment of the present invention, the set noise filtering rules include: a, emoticon (generally appearing in the form of “[text]”) noise; b, “html tag” noise; c, “username” noise; d, "//username" noise.

其次，遍历待提取文本，根据上述噪声过滤规则，对待提取文本中的字符进行匹配；Secondly, traverse the text to be extracted, and match the characters in the text to be extracted according to the above noise filtering rules;

最后，若待提取文本中的字符属于上述提取规则中一种，则匹配成功，将匹配成功的字符删除。Finally, if the characters in the text to be extracted belong to one of the above extraction rules, the matching is successful, and the matching characters are deleted.

根据上述步骤得到过滤后的文本，利于该过滤后的文本，提高了处理效率和关键词提取的准确性。The filtered text is obtained according to the above steps, which is beneficial to the filtered text and improves the processing efficiency and the accuracy of keyword extraction.

步骤202：将过滤后的文本去重；Step 202: deduplicate the filtered text;

由于文本之间的转发关系，过滤后的文本存在大量重复的现象，为了降低重复内容给词权重计算带来的不公平性，需要对过滤的文本去重。Due to the forwarding relationship between texts, there are a lot of repetitions in the filtered texts. In order to reduce the unfairness of word weight calculation caused by repeated content, it is necessary to deduplicate the filtered texts.

该去重方法包括：将当前过滤后的文本映射成指纹信息，并将该当前过滤后的文本与指纹信息库进行比较，若比较结果中存在差异的指纹个数小于等于预设值，则将当前过滤后的文本删除，否则，将当前过滤后的文本的指纹信息加入所述指纹信息库中。The deduplication method includes: mapping the currently filtered text into fingerprint information, and comparing the currently filtered text with the fingerprint information database, if the number of fingerprints with differences in the comparison result is less than or equal to the preset value, then The currently filtered text is deleted; otherwise, the fingerprint information of the currently filtered text is added to the fingerprint information database.

例如，将当前过滤后的文本映射成6位指纹信息，将比较结果中存在差异的指纹个数小于等于3位的文本删除，否则将当前过滤后的文本的指纹信息加入指纹信息库中。For example, the currently filtered text is mapped to 6-digit fingerprint information, and the text with a difference of 3 or less fingerprints in the comparison result is deleted; otherwise, the fingerprint information of the currently filtered text is added to the fingerprint information database.

步骤203：对去重后的文本进行分词和词性标注，该词性为符合提取规则的第一词性或不符合提取规则的第二词性；Step 203: Perform word segmentation and part-of-speech tagging on the deduplicated text, where the part-of-speech is the first part-of-speech that meets the extraction rules or the second part-of-speech that does not meet the extraction rules;

步骤204：统计词的词频和该词对应的文本数；Step 204: Count the word frequency of the word and the number of texts corresponding to the word;

步骤205：利用词的词性、词频以及该词对应的文本数，建立词索引表和文本索引表；Step 205: using the word's part of speech, word frequency, and the number of texts corresponding to the word, to establish a word index table and a text index table;

首先，为每个不重复的词分配词索引号，并将词的索引号以及与索引号对应的词的特征保存到词索引表中，如图3所示；First, assign a word index number for each word that does not repeat, and save the index number of the word and the feature of the word corresponding to the index number in the word index table, as shown in Figure 3;

所述词的特征包括：词的词频、该词对应的文本数、词性和权重。其中，权重为后续步骤得到的值。The features of the word include: word frequency of the word, number of texts corresponding to the word, part of speech and weight. Among them, the weight is the value obtained in the subsequent steps.

其次，为去重后的文本分配文本索引号，根据去重后的文本中词的位置关系，将去重后的文本的文本索引号以及该去重后的文本中词的词索引号保存到文本索引表中，如图4所示。Secondly, assign a text index number for the deduplicated text, and save the text index number of the deduplicated text and the word index number of the word in the deduplicated text according to the positional relationship of the word in the deduplicated text. In the text index table, as shown in Figure 4.

步骤206：根据所述词频和该词对应的文本数，计算词权重；Step 206: Calculate the word weight according to the word frequency and the number of texts corresponding to the word;

根据用户的不同背景，建立用户关注词典。According to the different backgrounds of users, a user-focused dictionary is established.

例如，财经相关、体育相关、娱乐相关等。For example, financial related, sports related, entertainment related, etc.

根据用户关注词典、词频、该词对应的文本数，以及下式计算词权重：Calculate the word weight according to the user's attention dictionary, word frequency, the number of texts corresponding to the word, and the following formula:

其中，weight(term)为词权重，b(term)为基于用户关注词典的经验修正值，a(term)为基于词性判断的经验修正值，tf(term)为词的词频，df(term)为词对应的文本数，|d|为文本总数。Among them, weight(term) is the word weight, b(term) is the empirical correction value based on the user-focused dictionary, a(term) is the empirical correction value based on part-of-speech judgment, tf(term) is the word frequency of the word, df(term) is the number of texts corresponding to the word, |d| is the total number of texts.

a(term)的取值为：The value of a(term) is:

其中，nr为人名，nt为机构名；Among them, nr is the name of the person, and nt is the name of the organization;

b(term)的取值为：当该词属于用户关注词典中的词，则b(term)为1.5；当该词不属于用户关注词典中的词，则b(term)为1。The value of b(term) is: when the word belongs to the word in the user-focused dictionary, then b(term) is 1.5; when the word does not belong to the word in the user-focused dictionary, then b(term) is 1.

步骤207：对词权重进行从大到小排序，并从词性为第一词性的词中，选取第一预设值个词权重较大的词作为候选关键词；Step 207: Sorting the word weights from large to small, and selecting a word with a first preset value of word weight as a candidate keyword from the words whose part of speech is the first part of speech;

其中，第一词性包括：/a形容词，/v动词，/j简称，/ns地名,/nr人名,/nt机构名,/nz专有名词。Among them, the first part of speech includes: /a adjective, /v verb, /j abbreviation, /ns place name, /nr person name, /nt organization name, /nz proper noun.

第二词性为不属于第一词性的其他词性的词。The second part of speech is a word of another part of speech that does not belong to the first part of speech.

步骤208：从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。Step 208: Extract a second preset value of candidate keywords that appear more frequently in the text to be extracted from the candidate keywords as keywords.

该步骤可根据文本索引表进行提取，即提取预设值个在所述文本索引表中词索引号出现频率较大的候选关键词作为关键词。This step can be extracted according to the text index table, that is, a preset number of candidate keywords whose word index numbers appear frequently in the text index table are extracted as keywords.

如图4所示，在多个文本中词索引号为7的词出现频率较大，若词索引号为7的词为候选关键词，则将该词索引号为7的词作为关键词。As shown in FIG. 4 , the word whose index number is 7 appears frequently in multiple texts, and if the word whose index number is 7 is a candidate keyword, then the word whose index number is 7 is used as a keyword.

其中，本发明实施例应用于微博、空间等所有社交网络平台。Among them, the embodiment of the present invention is applied to all social networking platforms such as Weibo and Space.

本发明实施例通过提供一种于社交网络的关键词提取方法，不需要根据大量历史搜索信息，而是直接在待提取文本中提取关键词，通过对待提取文本进行噪声过滤、文本去重、分词以及计算词权重，进而根据词权重提取关键词，由于不需要大量的历史搜索信息，从而提高了提取速度。The embodiment of the present invention provides a keyword extraction method for social networks, which does not need to search for information based on a large amount of history, but directly extracts keywords from the text to be extracted, and performs noise filtering, text deduplication, and word segmentation on the text to be extracted. And calculate the word weight, and then extract keywords according to the word weight. Since a large amount of historical search information is not required, the extraction speed is improved.

实施例3Example 3

本发明实施例还提供了一种基于社交网络的关键词提取装置，如图5所示，包括：The embodiment of the present invention also provides a social network-based keyword extraction device, as shown in Figure 5, including:

分词模块501，用于对待提取文本进行分词，并将分词后的词传输给统计模块；The word segmentation module 501 is used to perform word segmentation on the text to be extracted, and transmit the word after the word segmentation to the statistics module;

所述统计模块502，用于统计词的词频和该词对应的文本数，并将统计结果传输给计算模块；The statistical module 502 is used to count the word frequency of the word and the text number corresponding to the word, and transmit the statistical result to the calculation module;

所述计算模块503，用于根据所述词频和该词对应的文本数，计算词权重，并将计算结果传输给选取模块；The calculation module 503 is used to calculate the word weight according to the word frequency and the corresponding text number of the word, and transmit the calculation result to the selection module;

所述选取模块504，用于选取第一预设值个词权重较大的词作为候选关键词，并将选取结果传输给提前模块；The selection module 504 is used to select a word with a larger weight of the first preset value as a candidate keyword, and transmit the selection result to the advance module;

所述提取模块505，用于从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。The extraction module 505 is configured to extract a second preset value of candidate keywords that appear more frequently in the text to be extracted from the candidate keywords as keywords.

进一步的，所述装置还包括：Further, the device also includes:

和/或，and / or,

进一步的，所述噪声过滤模块如图6所示，包括：Further, the noise filtering module is shown in Figure 6, including:

设定子模块601，用于设定噪声过滤规则，并将设定的噪声过滤规则传输给匹配子模块；A setting submodule 601, configured to set noise filtering rules, and transmit the set noise filtering rules to the matching submodule;

遍历子模块602，用于遍历待提取文本，并将遍历结果传输给所述匹配子模块；The traversal submodule 602 is configured to traverse the text to be extracted, and transmit the traversal result to the matching submodule;

所述匹配子模块603，用于根据设定的噪声过滤规则，对待提取文本中的字符进行匹配，若待提取文本中的字符属于所述噪声过滤规则，则匹配成功，并将匹配成功的字符传输给第一删除子模块；The matching submodule 603 is configured to match the characters in the text to be extracted according to the set noise filtering rules, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the characters that are successfully matched transmitted to the first deletion submodule;

所述第一删除子模块604，用于将匹配成功的字符删除；The first deletion submodule 604 is configured to delete characters that match successfully;

进一步的，所述文本去重模块如图7所示，包括：Further, the text deduplication module is shown in Figure 7, including:

映射子模块701，用于将当前过滤后的文本映射成指纹信息，将映射结果传输给比较子模块；The mapping submodule 701 is used to map the currently filtered text into fingerprint information, and transmit the mapping result to the comparison submodule;

所述比较子模块702，用于将该当前过滤后的文本与指纹信息库进行比较，并将比较结果中存在差异的指纹个数小于等于第三预设值的当前过滤的文本传输给第二删除子模块，以及将比较结果中存在差异的指纹个数不小于第三预设值的当前过滤的文本传输给保存子模块；The comparison sub-module 702 is used to compare the currently filtered text with the fingerprint information database, and transmit the currently filtered text whose fingerprint number differs in the comparison result is less than or equal to the third preset value to the second Deleting the submodule, and transferring the currently filtered text whose fingerprint number differs in the comparison result is not less than the third preset value to the saving submodule;

所述第二删除子模块703，用于将当前过滤后的文本删除；The second deletion submodule 703 is used to delete the currently filtered text;

所述保存子模块704，用于将当前过滤后的文本的指纹信息加入所述指纹信息库中。The saving submodule 704 is configured to add the fingerprint information of the currently filtered text into the fingerprint information database.

进一步的，所述装置还包括：Further, the device also includes:

进一步的，所述计算模块，用于根据以下公式计算词的词权重：Further, the calculation module is used to calculate the word weight of the word according to the following formula:

本发明通过提供一种于社交网络的关键词提取装置，不需要根据大量历史搜索信息，而是直接通过提取模块在待提取文本中提取关键词，通过噪声过滤模块、文本去重模块、分词模块、计算模块，分别对待提取文本进行噪声过滤、文本去重、分词以及计算词权重，进而根据词权重提取关键词，由于不需要大量的历史搜索信息，从而提高了提取速度。By providing a keyword extraction device for social networks, the present invention does not need to search for information based on a large amount of history, but directly extracts keywords from the text to be extracted through the extraction module, through the noise filtering module, text deduplication module, word segmentation module , Calculation module, respectively perform noise filtering, text deduplication, word segmentation and word weight calculation on the text to be extracted, and then extract keywords according to the word weight. Since a large amount of historical search information is not required, the extraction speed is improved.

以上实施方式仅用于说明本发明，而并非对本发明的限制，有关技术领域的普通技术人员，在不脱离本发明的精神和范围的情况下，还可以做出各种变化和变型，因此所有等同的技术方案也属于本发明的范畴，本发明的专利保护范围应由权利要求限定。The above embodiments are only used to illustrate the present invention, but not to limit the present invention. Those of ordinary skill in the relevant technical field can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, all Equivalent technical solutions also belong to the category of the present invention, and the scope of patent protection of the present invention should be defined by the claims.

Claims

1. based on a keyword extracting method for social networks, it is characterized in that, comprising:

Participle is carried out to text to be extracted, and the textual data that the word frequency of adding up word is corresponding with this word;

The textual data corresponding according to described word frequency and this word, calculate word weight, choose the first preset value word that word weight is larger alternatively keyword, from candidate keywords, extract the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword.

2. the method for claim 1, is characterized in that, described participle is carried out to text to be extracted before, to comprise further: noise filtering is carried out to text to be extracted, and by the text duplicate removal after filtering;

And/or,

Described the step that text to be extracted carries out participle to be comprised further: carry out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule; Choose then the first preset value word that word weight is larger alternatively keyword comprise: be the word of the first part of speech from part of speech, choose the first preset value word that word weight is larger alternatively keyword.

3. method as claimed in claim 2, is characterized in that, describedly carries out noise filtering to text to be extracted, specifically comprises:

According to the noise filtering rule of setting, travel through text to be extracted, mate the character in text to be extracted, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, by the character deletion that the match is successful;

And/or,

Described by the text duplicate removal after filtration, specifically comprise:

Text mapping after current filter is become finger print information, and the text after this current filter and finger print information storehouse are compared, if the fingerprint number that there are differences in comparative result is less than or equal to the 3rd preset value, then by the text suppression after current filter, otherwise, the finger print information of the text after current filter is added in described finger print information storehouse.

4. the method for claim 1, is characterized in that, the word frequency of described statistics word and textual data corresponding to this word, comprise further:

For each unduplicated word distributes glossarial index number, and the feature of the call number of word and the word corresponding with call number is saved in glossarial index table;

For the text after duplicate removal distributes text index number, according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;

Wherein, the feature of institute's predicate comprises: the word frequency of word, textual data, part of speech and word weight that this word is corresponding.

5. the method for claim 1, is characterized in that, described calculating word weight specifically comprises:

Word weight according to following formulae discovery word:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},

Wherein, weight (term) is word weight, and b (term) and a (term) is experiential modification value, the word frequency that tf (term) is word, and df (term) is textual data corresponding to word, | d| is text sum.

6. based on a keyword extracting device for social networks, it is characterized in that, comprising:

Word-dividing mode, for carrying out participle to text to be extracted, and is transferred to statistical module by the word after participle;

Described statistical module, the textual data that word frequency and this word for adding up word are corresponding, and statistics is transferred to computing module;

Described computing module, for according to described word frequency and textual data corresponding to this word, calculates word weight, and result of calculation is transferred to and chooses module;

Describedly choose module, for choosing the first preset value word that word weight is larger alternatively keyword, and result will be chosen be transferred to module in advance;

Described extraction module, for extracting the second preset value candidate keywords that the frequency of occurrences is larger in text to be extracted as keyword from candidate keywords.

7. device as claimed in claim 6, it is characterized in that, described device also comprises:

Noise filtering module, for carrying out noise filtering to text to be extracted, and by the File Transfer after filtration to text duplicate removal module;

Described text duplicate removal module, for carrying out duplicate removal by the text after filtration;

And/or,

Part-of-speech tagging module, for carrying out part-of-speech tagging to participle, this part of speech is meet the first part of speech of extracting rule or do not meet the second part of speech of extracting rule, and chooses module described in annotation results being transferred to;

Describedly choose module, also for from part of speech be the first part of speech word in, choose the first preset value word that word weight is larger alternatively keyword.

8. device as claimed in claim 7, it is characterized in that, described noise filtering module comprises:

Setting submodule, for setting noise filtering rule, and by the noise filtering regular transmission of setting to matched sub-block;

Traversal submodule, for traveling through text to be extracted, and is transferred to described matched sub-block by traversing result;

Described matched sub-block, for the noise filtering rule according to setting, the character in text to be extracted is mated, if the character in text to be extracted belongs to described noise filtering rule, then the match is successful, and the character transmission that the match is successful is deleted submodule to first;

Described first deletes submodule, for by the character deletion that the match is successful;

And/or,

Described text duplicate removal module comprises:

Mapping submodule, for the text mapping after current filter is become finger print information, is transferred to comparison sub-module by mapping result;

Described comparison sub-module, for the text after this current filter and finger print information storehouse are compared, and the File Transfer fingerprint number that there are differences in comparative result being less than or equal to the current filter of the 3rd preset value deletes submodule to second, and the File Transfer of the current filter fingerprint number that there are differences in comparative result being not less than the 3rd preset value is to preserving submodule;

Described second deletes submodule, for by the text suppression after current filter;

Described preservation submodule, for adding the finger print information of the text after current filter in described finger print information storehouse.

9. device as claimed in claim 6, it is characterized in that, described device also comprises:

Distribution module, for distributing glossarial index number for each unduplicated word, and is the text distribution text index number after duplicate removal, and allocation result is transferred to preservation module;

Described preservation module, for the feature of the call number of word and the word corresponding with call number is saved in glossarial index table, and according to the position relationship of word in the text after duplicate removal, the glossarial index number of word in the text after the text index of the text after duplicate removal number and this duplicate removal is saved in text index table;

10. device as claimed in claim 6, is characterized in that, described computing module, the word weight for according to following formulae discovery word:

weight (term) = b (term) * a (term) * tf (term) * \log \frac{| d |}{1 + df (term)},