CN104572736A - Keyword extraction method and device based on social networking services - Google Patents
Keyword extraction method and device based on social networking services Download PDFInfo
- Publication number
- CN104572736A CN104572736A CN201310503897.5A CN201310503897A CN104572736A CN 104572736 A CN104572736 A CN 104572736A CN 201310503897 A CN201310503897 A CN 201310503897A CN 104572736 A CN104572736 A CN 104572736A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- term
- module
- extracted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 35
- 230000006855 networking Effects 0.000 title description 2
- 238000001914 filtration Methods 0.000 claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 20
- 239000000284 extract Substances 0.000 claims abstract description 16
- 238000013507 mapping Methods 0.000 claims description 9
- 238000012217 deletion Methods 0.000 claims description 8
- 230000037430 deletion Effects 0.000 claims description 8
- 238000004321 preservation Methods 0.000 claims description 7
- 238000012986 modification Methods 0.000 claims description 3
- 230000004048 modification Effects 0.000 claims description 3
- 230000000052 comparative effect Effects 0.000 claims 3
- 238000012546 transfer Methods 0.000 claims 3
- 230000005540 biological transmission Effects 0.000 claims 2
- 230000001629 suppression Effects 0.000 claims 2
- 230000011218 segmentation Effects 0.000 abstract description 17
- 238000012937 correction Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000003252 repetitive effect Effects 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
本发明提供一种基于社交网络的关键词提取方法及装置,方法包括:对待提取文本进行分词,并统计词的词频和该词对应的文本数;根据所述词频和该词对应的文本数,计算词权重,选取第一预设值个词权重较大的词作为候选关键词,从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。本发明通过对待提取文本进行噪声过滤、文本去重、分词以及计算词权重,进而根据词权重提取关键词,由于不需要大量的历史搜索信息,从而提高了提取速度。
The present invention provides a method and device for extracting keywords based on a social network. The method includes: segmenting the text to be extracted, and counting the frequency of the word and the number of texts corresponding to the word; according to the frequency of the word and the number of texts corresponding to the word, Calculate the weight of words, select the first preset value of words with higher weight as candidate keywords, and extract the second preset value of candidate keywords with higher frequency of occurrence in the text to be extracted as keywords from the candidate keywords . The present invention performs noise filtering, text deduplication, word segmentation and word weight calculation on the text to be extracted, and then extracts keywords according to the word weight. Since a large amount of historical search information is not needed, the extraction speed is improved.
Description
技术领域technical field
本发明涉及关键词提取技术领域,特别涉及一种基于社交网络的关键词提取方法及装置。The present invention relates to the technical field of keyword extraction, in particular to a method and device for extracting keywords based on a social network.
背景技术Background technique
关键词作为广大社交用户共同关注和使用的主题词,能够涵盖大量的信息。通过提取海量社交文本中的关键词信息,不仅能够及时了解广大社交用户共同关注的主题,而且能够帮助社交用户及时掌握当前的热点信息。因此,关键词提取能够有效应对信息过载问题,并为广大社交用户提供快捷便利的资讯服务。Keywords, as the subject words that the majority of social users pay attention to and use, can cover a large amount of information. By extracting keyword information from massive social texts, not only can timely understand the topics that the majority of social users are concerned about, but also help social users to grasp current hot information in a timely manner. Therefore, keyword extraction can effectively deal with the problem of information overload and provide fast and convenient information services for the majority of social users.
普遍存在的关键词抽取方法为:获取大量用户的历史搜索信息,根据用户的历史搜索信息以及网页内容中频繁出现的主题词,提取关键词。The ubiquitous keyword extraction method is as follows: obtain historical search information of a large number of users, and extract keywords according to the historical search information of users and subject words frequently appearing in webpage content.
然而,目前的方法在很大程度上依赖于用户的搜索信息,需要获取到大量的历史搜索信息,才能够准确提取出关键词,提取速度低。However, the current method relies heavily on the user's search information, and needs to obtain a large amount of historical search information to accurately extract keywords, and the extraction speed is low.
发明内容Contents of the invention
(一)解决的技术问题(1) Solved technical problems
本发明解决的技术问题是:如何解决在提取关键词过程中需要获取大量历史搜索信息问题。The technical problem solved by the invention is: how to solve the problem of obtaining a large amount of historical search information in the process of extracting keywords.
(二)技术方案(2) Technical solutions
为解决上述技术问题,本发明提供了一种基于社交网络的关键词提取方法,包括:In order to solve the above technical problems, the present invention provides a method for extracting keywords based on social networks, including:
对待提取文本进行分词,并统计词的词频和该词对应的文本数;Segment the text to be extracted, and count the word frequency of the word and the number of texts corresponding to the word;
根据所述词频和该词对应的文本数,计算词权重,选取第一预设值个词权重较大的词作为候选关键词,从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。According to the number of texts corresponding to the word frequency and the word, calculate the weight of the word, select the word with the larger word weight of the first preset value as the candidate keyword, and extract the second preset value in the text to be extracted from the candidate keyword Candidate keywords with higher frequency appearing as keywords.
优选地,在所述对待提取文本进行分词之前,进一步包括:对待提取文本进行噪声过滤,并将过滤后的文本去重;Preferably, before performing word segmentation on the text to be extracted, further comprising: performing noise filtering on the text to be extracted, and deduplicating the filtered text;
和/或,and / or,
所述对待提取文本进行分词的步骤进一步包括:对分词进行词性标注,该词性为符合提取规则的第一词性或不符合提取规则的第二词性;则所述选取第一预设值个词权重较大的词作为候选关键词包括:从词性为第一词性的词中,选取第一预设值个词权重较大的词作为候选关键词。The step of performing word segmentation on the text to be extracted further includes: performing part-of-speech tagging on the part-of-speech, and the part-of-speech is the first part-of-speech that meets the extraction rules or the second part-of-speech that does not meet the extraction rules; Using larger words as candidate keywords includes: selecting a first preset value of words with larger weights as candidate keywords from the words whose part of speech is the first part of speech.
优选地,所述对待提取文本进行噪声过滤,具体包括:Preferably, the noise filtering of the text to be extracted specifically includes:
根据设定的噪声过滤规则,遍历待提取文本,对待提取文本中的字符进行匹配,若待提取文本中的字符属于所述噪声过滤规则,则匹配成功,将匹配成功的字符删除;Traversing the text to be extracted according to the noise filtering rules set, matching the characters in the text to be extracted, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the characters that are successfully matched are deleted;
和/或,and / or,
所述将过滤后的文本去重,具体包括:The deduplication of the filtered text specifically includes:
将当前过滤后的文本映射成指纹信息,并将该当前过滤后的文本与指纹信息库进行比较,若比较结果中存在差异的指纹个数小于等于第三预设值,则将当前过滤后的文本删除,否则,将当前过滤后的文本的指纹信息加入所述指纹信息库中。Map the currently filtered text into fingerprint information, and compare the currently filtered text with the fingerprint information base, if the number of fingerprints with differences in the comparison result is less than or equal to the third preset value, then the current filtered text The text is deleted, otherwise, the fingerprint information of the currently filtered text is added to the fingerprint information database.
优选地,所述统计词的词频和该词对应的文本数,进一步包括:Preferably, the word frequency of said statistical word and the text number corresponding to the word further include:
为每个不重复的词分配词索引号,并将词的索引号以及与索引号对应的词的特征保存到词索引表中;Assign a word index number to each word that does not repeat, and save the word index number and the characteristics of the word corresponding to the index number in the word index table;
为去重后的文本分配文本索引号,根据去重后的文本中词的位置关系,将去重后的文本的文本索引号以及该去重后的文本中词的词索引号保存到文本索引表中;Assign a text index number to the deduplicated text, and save the text index number of the deduplicated text and the word index number of the word in the deduplicated text to the text index according to the positional relationship of the words in the deduplicated text table;
其中,所述词的特征包括:词的词频、该词对应的文本数、词性和词权重。Wherein, the features of the word include: word frequency of the word, number of texts corresponding to the word, part of speech and word weight.
优选地,所述计算词权重具体包括:Preferably, the calculation of word weight specifically includes:
根据以下公式计算词的词权重:The word weight of a word is calculated according to the following formula:
其中,weight(term)为词权重,b(term)和a(term)为经验修正值,tf(term)为词的词频,df(term)为词对应的文本数,|d|为文本总数。Among them, weight(term) is the word weight, b(term) and a(term) are empirical correction values, tf(term) is the word frequency of the word, df(term) is the number of texts corresponding to the word, |d| is the total number of texts .
为解决上述技术问题,本发明还提供了一种基于社交网络的关键词提取装置,包括:In order to solve the above technical problems, the present invention also provides a keyword extraction device based on social network, including:
分词模块,用于对待提取文本进行分词,并将分词后的词传输给统计模块;The word segmentation module is used to perform word segmentation on the text to be extracted, and transmit the word after word segmentation to the statistics module;
所述统计模块,用于统计词的词频和该词对应的文本数,并将统计结果传输给计算模块;The statistical module is used to count the word frequency of the word and the text number corresponding to the word, and transmit the statistical result to the calculation module;
所述计算模块,用于根据所述词频和该词对应的文本数,计算词权重,并将计算结果传输给选取模块;The calculation module is used to calculate the word weight according to the word frequency and the text number corresponding to the word, and transmit the calculation result to the selection module;
所述选取模块,用于选取第一预设值个词权重较大的词作为候选关键词,并将选取结果传输给提前模块;The selection module is used to select the first preset word with larger word weight as the candidate keyword, and transmit the selection result to the advance module;
所述提取模块,用于从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。The extraction module is configured to extract a second preset value of candidate keywords that appear more frequently in the text to be extracted from the candidate keywords as keywords.
优选地,所述装置还包括:Preferably, the device also includes:
噪声过滤模块,用于对待提取文本进行噪声过滤,并将过滤后的文本传输给文本去重模块;A noise filtering module, configured to perform noise filtering on the text to be extracted, and transmit the filtered text to the text deduplication module;
所述文本去重模块,用于将过滤后的文本进行去重;The text deduplication module is used to deduplicate the filtered text;
和/或,and / or,
词性标注模块,用于对分词进行词性标注,该词性为符合提取规则的第一词性或不符合提取规则的第二词性,并将标注结果传输给所述选取模块;The part-of-speech tagging module is used to carry out part-of-speech tagging on the word segmentation, the part-of-speech is the first part-of-speech that meets the extraction rules or the second part-of-speech that does not meet the extraction rules, and transmits the tagging results to the selection module;
所述选取模块,还用于从词性为第一词性的词中,选取第一预设值个词权重较大的词作为候选关键词。The selection module is also used to select a first preset word with a higher weight as a candidate keyword from the words whose part of speech is the first part of speech.
优选地,所述噪声过滤模块包括:Preferably, the noise filtering module includes:
设定子模块,用于设定噪声过滤规则,并将设定的噪声过滤规则传输给匹配子模块;The setting sub-module is used to set the noise filtering rules, and transmit the set noise filtering rules to the matching sub-module;
遍历子模块,用于遍历待提取文本,并将遍历结果传输给所述匹配子模块;A traversal submodule, configured to traverse the text to be extracted, and transmit the traversal result to the matching submodule;
所述匹配子模块,用于根据设定的噪声过滤规则,对待提取文本中的字符进行匹配,若待提取文本中的字符属于所述噪声过滤规则,则匹配成功,并将匹配成功的字符传输给第一删除子模块;The matching submodule is used to match the characters in the text to be extracted according to the set noise filtering rules, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the successfully matched characters are transmitted to the first remove submodule;
所述第一删除子模块,用于将匹配成功的字符删除;The first deletion submodule is used to delete characters that match successfully;
和/或,and / or,
所述文本去重模块包括:The text deduplication module includes:
映射子模块,用于将当前过滤后的文本映射成指纹信息,将映射结果传输给比较子模块;The mapping submodule is used to map the currently filtered text into fingerprint information, and transmit the mapping result to the comparison submodule;
所述比较子模块,用于将该当前过滤后的文本与指纹信息库进行比较,并将比较结果中存在差异的指纹个数小于等于第三预设值的当前过滤的文本传输给第二删除子模块,以及将比较结果中存在差异的指纹个数不小于第三预设值的当前过滤的文本传输给保存子模块;The comparison submodule is used to compare the currently filtered text with the fingerprint information database, and transmit the currently filtered text whose fingerprint number is less than or equal to the third preset value to the second deletion The submodule, and the currently filtered text whose number of fingerprints with differences in the comparison result is not less than the third preset value is transmitted to the saving submodule;
所述第二删除子模块,用于将当前过滤后的文本删除;The second deletion submodule is used to delete the currently filtered text;
所述保存子模块,用于将当前过滤后的文本的指纹信息加入所述指纹信息库中。The saving submodule is used to add the fingerprint information of the currently filtered text into the fingerprint information database.
优选地,所述装置还包括:Preferably, the device also includes:
分配模块,用于为每个不重复的词分配词索引号,以及为去重后的文本分配文本索引号,并将分配结果传输给保存模块;An allocation module is used to allocate a word index number for each non-repetitive word, and allocate a text index number for the deduplicated text, and transmit the allocation result to the preservation module;
所述保存模块,用于将词的索引号以及与索引号对应的词的特征保存到词索引表中,以及根据去重后的文本中词的位置关系,将去重后的文本的文本索引号以及该去重后的文本中词的词索引号保存到文本索引表中;The preservation module is used to store the index number of the word and the features of the word corresponding to the index number in the word index table, and according to the positional relationship of the word in the text after the deduplication, the text index of the text after the deduplication No. and the word index number of the word in the deduplicated text are stored in the text index table;
其中,所述词的特征包括:词的词频、该词对应的文本数、词性和词权重。Wherein, the features of the word include: word frequency of the word, number of texts corresponding to the word, part of speech and word weight.
优选地,所述计算模块,用于根据以下公式计算词的词权重:Preferably, the calculation module is used to calculate the word weight of the word according to the following formula:
其中,weight(term)为词权重,b(term)和a(term)为经验修正值,tf(term)为词的词频,df(term)为词对应的文本数,|d|为文本总数。Among them, weight(term) is the word weight, b(term) and a(term) are empirical correction values, tf(term) is the word frequency of the word, df(term) is the number of texts corresponding to the word, |d| is the total number of texts .
(三)有益效果(3) Beneficial effects
本发明通过提供一种于社交网络的关键词提取方法及装置,不需要根据大量历史搜索信息,而是直接在待提取文本中提取关键词,通过对待提取文本进行分词以及计算词权重,进而根据词权重提取关键词,由于不需要大量的历史搜索信息,从而提高了提取速度。The present invention provides a method and device for extracting keywords in a social network, which does not need to search for information based on a large amount of history, but directly extracts keywords from the text to be extracted, and performs word segmentation and calculation of word weights on the text to be extracted. Word weight extracts keywords, and since a large amount of historical search information is not required, the extraction speed is improved.
附图说明Description of drawings
图1是本发明实施例一提供的方法流程图;Fig. 1 is a flow chart of the method provided by Embodiment 1 of the present invention;
图2是本发明实施例二提供的方法流程图;Fig. 2 is a flow chart of the method provided by Embodiment 2 of the present invention;
图3是本发明实施例二提供的词索引表示意图;FIG. 3 is a schematic diagram of a word index table provided by Embodiment 2 of the present invention;
图4是本发明实施例二提供的文本索引表示意图;FIG. 4 is a schematic diagram of a text index table provided by Embodiment 2 of the present invention;
图5是本发明实施例三提供的装置结构示意图;Fig. 5 is a schematic structural diagram of the device provided by Embodiment 3 of the present invention;
图6是本发明实施例三提供的噪声过滤模块结构示意图;FIG. 6 is a schematic structural diagram of a noise filtering module provided by Embodiment 3 of the present invention;
图7是本发明实施例三提供的文本去重模块结构示意图。FIG. 7 is a schematic structural diagram of a text deduplication module provided by Embodiment 3 of the present invention.
具体实施方式Detailed ways
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
实施例1:Example 1:
为解决现有技术中提取关键字需要获取大量历史数据的问题,本发明实施例提供了一种基于社交网络的关键词提取方法,如图1所示,该方法包括:In order to solve the problem in the prior art that a large amount of historical data needs to be obtained for extracting keywords, an embodiment of the present invention provides a method for extracting keywords based on a social network, as shown in FIG. 1 , the method includes:
步骤101:对待提取文本进行分词,并统计词的词频和该词对应的文本数;Step 101: Segment the text to be extracted, and count the word frequency of the word and the number of texts corresponding to the word;
步骤102:根据所述词频和该词对应的文本数,计算词权重,选取第一预设值个词权重较大的词作为候选关键词,从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。Step 102: According to the word frequency and the number of texts corresponding to the word, calculate the weight of the word, select a word with a higher weight of the first preset value as a candidate keyword, and extract a second preset value of words from the candidate keywords Candidate keywords that appear more frequently in the text to be extracted are used as keywords.
本发明实施例不需要根据大量历史搜索信息,而是直接在待提取文本中提取关键词,通过对待提取文本进行分词以及计算词权重,进而根据词权重提取关键词,由于不需要大量的历史搜索信息,从而提高了提取速度。The embodiment of the present invention does not need to search information based on a large amount of history, but directly extracts keywords from the text to be extracted, and then extracts keywords according to the weight of words by performing word segmentation and calculating word weights on the text to be extracted. Since a large number of historical searches are not required information, thereby increasing the extraction speed.
在本发明实施例中,通过提前对待提取文本进行噪声过滤、去重,从而提高了关键词提取的公平性。根据标注的词性,从符合提取规则的第一词性的词中,选取第一预设值个词权重较大的词作为候选关键词。从而提高了关键词提取的准确性。In the embodiment of the present invention, the fairness of keyword extraction is improved by performing noise filtering and deduplication on the text to be extracted in advance. According to the tagged part of speech, select a first preset word with a higher weight as a candidate keyword from the words meeting the first part of speech of the extraction rule. Therefore, the accuracy of keyword extraction is improved.
在本发明实施例中,通过设定噪声过滤规则,遍历待提取文本,对待提取文本中的字符进行匹配,若待提取文本中的字符属于所述噪声过滤规则,则匹配成功,将匹配成功的字符删除,因此通过噪声过滤后,提高了处理效率和关键词提取的准确性。由于经过噪声过滤后的文本中有一些重复文本,因此通过去重处理后,剩余文本均为不重复文本,从而减少了文本的复杂性,提高了关键词的提取精度。In the embodiment of the present invention, by setting the noise filtering rules, traversing the text to be extracted, and matching the characters in the text to be extracted, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the successfully matched Characters are removed, so after filtering through noise, the processing efficiency and accuracy of keyword extraction are improved. Since there are some repetitive texts in the noise-filtered text, the remaining texts are non-repetitive texts after deduplication processing, thereby reducing the complexity of the text and improving the extraction accuracy of keywords.
在本发明实施例中,通过建立词索引表和文本索引表,为后续提取关键词提供了方便。In the embodiment of the present invention, by establishing a word index table and a text index table, it provides convenience for subsequent keyword extraction.
实施例2Example 2
为了解决现有技术的问题,本发明实施例第二个实施例提供了一种基于社交网络的关键词提取方法,如图2所示,该方法包括:In order to solve the problems of the prior art, the second embodiment of the present invention provides a method for extracting keywords based on a social network, as shown in FIG. 2 , the method includes:
步骤201:对待提取文本进行噪声过滤;Step 201: performing noise filtering on the text to be extracted;
在本发明实施例中,由于待提取文本中包括大量的无效信息,不仅降低了处理效率,而且影响关键词提取的效果。因此,In the embodiment of the present invention, because the text to be extracted contains a large amount of invalid information, not only the processing efficiency is reduced, but also the effect of keyword extraction is affected. therefore,
首先,设定噪声过滤规则;First, set the noise filtering rules;
在本发明实施例中,设定的噪声过滤规则包括:a、表情符号(一般以“[文本]”形式出现)噪声;b、“html标签”噪声;c、“用户名”噪声;d、“//用户名”噪声。In the embodiment of the present invention, the set noise filtering rules include: a, emoticon (generally appearing in the form of “[text]”) noise; b, “html tag” noise; c, “username” noise; d, "//username" noise.
其次,遍历待提取文本,根据上述噪声过滤规则,对待提取文本中的字符进行匹配;Secondly, traverse the text to be extracted, and match the characters in the text to be extracted according to the above noise filtering rules;
最后,若待提取文本中的字符属于上述提取规则中一种,则匹配成功,将匹配成功的字符删除。Finally, if the characters in the text to be extracted belong to one of the above extraction rules, the matching is successful, and the matching characters are deleted.
根据上述步骤得到过滤后的文本,利于该过滤后的文本,提高了处理效率和关键词提取的准确性。The filtered text is obtained according to the above steps, which is beneficial to the filtered text and improves the processing efficiency and the accuracy of keyword extraction.
步骤202:将过滤后的文本去重;Step 202: deduplicate the filtered text;
由于文本之间的转发关系,过滤后的文本存在大量重复的现象,为了降低重复内容给词权重计算带来的不公平性,需要对过滤的文本去重。Due to the forwarding relationship between texts, there are a lot of repetitions in the filtered texts. In order to reduce the unfairness of word weight calculation caused by repeated content, it is necessary to deduplicate the filtered texts.
该去重方法包括:将当前过滤后的文本映射成指纹信息,并将该当前过滤后的文本与指纹信息库进行比较,若比较结果中存在差异的指纹个数小于等于预设值,则将当前过滤后的文本删除,否则,将当前过滤后的文本的指纹信息加入所述指纹信息库中。The deduplication method includes: mapping the currently filtered text into fingerprint information, and comparing the currently filtered text with the fingerprint information database, if the number of fingerprints with differences in the comparison result is less than or equal to the preset value, then The currently filtered text is deleted; otherwise, the fingerprint information of the currently filtered text is added to the fingerprint information database.
例如,将当前过滤后的文本映射成6位指纹信息,将比较结果中存在差异的指纹个数小于等于3位的文本删除,否则将当前过滤后的文本的指纹信息加入指纹信息库中。For example, the currently filtered text is mapped to 6-digit fingerprint information, and the text with a difference of 3 or less fingerprints in the comparison result is deleted; otherwise, the fingerprint information of the currently filtered text is added to the fingerprint information database.
步骤203:对去重后的文本进行分词和词性标注,该词性为符合提取规则的第一词性或不符合提取规则的第二词性;Step 203: Perform word segmentation and part-of-speech tagging on the deduplicated text, where the part-of-speech is the first part-of-speech that meets the extraction rules or the second part-of-speech that does not meet the extraction rules;
步骤204:统计词的词频和该词对应的文本数;Step 204: Count the word frequency of the word and the number of texts corresponding to the word;
步骤205:利用词的词性、词频以及该词对应的文本数,建立词索引表和文本索引表;Step 205: using the word's part of speech, word frequency, and the number of texts corresponding to the word, to establish a word index table and a text index table;
首先,为每个不重复的词分配词索引号,并将词的索引号以及与索引号对应的词的特征保存到词索引表中,如图3所示;First, assign a word index number for each word that does not repeat, and save the index number of the word and the feature of the word corresponding to the index number in the word index table, as shown in Figure 3;
所述词的特征包括:词的词频、该词对应的文本数、词性和权重。其中,权重为后续步骤得到的值。The features of the word include: word frequency of the word, number of texts corresponding to the word, part of speech and weight. Among them, the weight is the value obtained in the subsequent steps.
其次,为去重后的文本分配文本索引号,根据去重后的文本中词的位置关系,将去重后的文本的文本索引号以及该去重后的文本中词的词索引号保存到文本索引表中,如图4所示。Secondly, assign a text index number for the deduplicated text, and save the text index number of the deduplicated text and the word index number of the word in the deduplicated text according to the positional relationship of the word in the deduplicated text. In the text index table, as shown in Figure 4.
步骤206:根据所述词频和该词对应的文本数,计算词权重;Step 206: Calculate the word weight according to the word frequency and the number of texts corresponding to the word;
根据用户的不同背景,建立用户关注词典。According to the different backgrounds of users, a user-focused dictionary is established.
例如,财经相关、体育相关、娱乐相关等。For example, financial related, sports related, entertainment related, etc.
根据用户关注词典、词频、该词对应的文本数,以及下式计算词权重:Calculate the word weight according to the user's attention dictionary, word frequency, the number of texts corresponding to the word, and the following formula:
其中,weight(term)为词权重,b(term)为基于用户关注词典的经验修正值,a(term)为基于词性判断的经验修正值,tf(term)为词的词频,df(term)为词对应的文本数,|d|为文本总数。Among them, weight(term) is the word weight, b(term) is the empirical correction value based on the user-focused dictionary, a(term) is the empirical correction value based on part-of-speech judgment, tf(term) is the word frequency of the word, df(term) is the number of texts corresponding to the word, |d| is the total number of texts.
a(term)的取值为:The value of a(term) is:
其中,nr为人名,nt为机构名;Among them, nr is the name of the person, and nt is the name of the organization;
b(term)的取值为:当该词属于用户关注词典中的词,则b(term)为1.5;当该词不属于用户关注词典中的词,则b(term)为1。The value of b(term) is: when the word belongs to the word in the user-focused dictionary, then b(term) is 1.5; when the word does not belong to the word in the user-focused dictionary, then b(term) is 1.
步骤207:对词权重进行从大到小排序,并从词性为第一词性的词中,选取第一预设值个词权重较大的词作为候选关键词;Step 207: Sorting the word weights from large to small, and selecting a word with a first preset value of word weight as a candidate keyword from the words whose part of speech is the first part of speech;
其中,第一词性包括:/a形容词,/v动词,/j简称,/ns地名,/nr人名,/nt机构名,/nz专有名词。Among them, the first part of speech includes: /a adjective, /v verb, /j abbreviation, /ns place name, /nr person name, /nt organization name, /nz proper noun.
第二词性为不属于第一词性的其他词性的词。The second part of speech is a word of another part of speech that does not belong to the first part of speech.
步骤208:从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。Step 208: Extract a second preset value of candidate keywords that appear more frequently in the text to be extracted from the candidate keywords as keywords.
该步骤可根据文本索引表进行提取,即提取预设值个在所述文本索引表中词索引号出现频率较大的候选关键词作为关键词。This step can be extracted according to the text index table, that is, a preset number of candidate keywords whose word index numbers appear frequently in the text index table are extracted as keywords.
如图4所示,在多个文本中词索引号为7的词出现频率较大,若词索引号为7的词为候选关键词,则将该词索引号为7的词作为关键词。As shown in FIG. 4 , the word whose index number is 7 appears frequently in multiple texts, and if the word whose index number is 7 is a candidate keyword, then the word whose index number is 7 is used as a keyword.
其中,本发明实施例应用于微博、空间等所有社交网络平台。Among them, the embodiment of the present invention is applied to all social networking platforms such as Weibo and Space.
本发明实施例通过提供一种于社交网络的关键词提取方法,不需要根据大量历史搜索信息,而是直接在待提取文本中提取关键词,通过对待提取文本进行噪声过滤、文本去重、分词以及计算词权重,进而根据词权重提取关键词,由于不需要大量的历史搜索信息,从而提高了提取速度。The embodiment of the present invention provides a keyword extraction method for social networks, which does not need to search for information based on a large amount of history, but directly extracts keywords from the text to be extracted, and performs noise filtering, text deduplication, and word segmentation on the text to be extracted. And calculate the word weight, and then extract keywords according to the word weight. Since a large amount of historical search information is not required, the extraction speed is improved.
实施例3Example 3
本发明实施例还提供了一种基于社交网络的关键词提取装置,如图5所示,包括:The embodiment of the present invention also provides a social network-based keyword extraction device, as shown in Figure 5, including:
分词模块501,用于对待提取文本进行分词,并将分词后的词传输给统计模块;The word segmentation module 501 is used to perform word segmentation on the text to be extracted, and transmit the word after the word segmentation to the statistics module;
所述统计模块502,用于统计词的词频和该词对应的文本数,并将统计结果传输给计算模块;The statistical module 502 is used to count the word frequency of the word and the text number corresponding to the word, and transmit the statistical result to the calculation module;
所述计算模块503,用于根据所述词频和该词对应的文本数,计算词权重,并将计算结果传输给选取模块;The calculation module 503 is used to calculate the word weight according to the word frequency and the corresponding text number of the word, and transmit the calculation result to the selection module;
所述选取模块504,用于选取第一预设值个词权重较大的词作为候选关键词,并将选取结果传输给提前模块;The selection module 504 is used to select a word with a larger weight of the first preset value as a candidate keyword, and transmit the selection result to the advance module;
所述提取模块505,用于从候选关键词中提取第二预设值个在待提取文本中出现频率较大的候选关键词作为关键词。The extraction module 505 is configured to extract a second preset value of candidate keywords that appear more frequently in the text to be extracted from the candidate keywords as keywords.
进一步的,所述装置还包括:Further, the device also includes:
噪声过滤模块,用于对待提取文本进行噪声过滤,并将过滤后的文本传输给文本去重模块;A noise filtering module, configured to perform noise filtering on the text to be extracted, and transmit the filtered text to the text deduplication module;
所述文本去重模块,用于将过滤后的文本进行去重;The text deduplication module is used to deduplicate the filtered text;
和/或,and / or,
词性标注模块,用于对分词进行词性标注,该词性为符合提取规则的第一词性或不符合提取规则的第二词性,并将标注结果传输给所述选取模块;The part-of-speech tagging module is used to carry out part-of-speech tagging on the word segmentation, the part-of-speech is the first part-of-speech that meets the extraction rules or the second part-of-speech that does not meet the extraction rules, and transmits the tagging results to the selection module;
所述选取模块,还用于从词性为第一词性的词中,选取第一预设值个词权重较大的词作为候选关键词。The selection module is also used to select a first preset word with a higher weight as a candidate keyword from the words whose part of speech is the first part of speech.
进一步的,所述噪声过滤模块如图6所示,包括:Further, the noise filtering module is shown in Figure 6, including:
设定子模块601,用于设定噪声过滤规则,并将设定的噪声过滤规则传输给匹配子模块;A setting submodule 601, configured to set noise filtering rules, and transmit the set noise filtering rules to the matching submodule;
遍历子模块602,用于遍历待提取文本,并将遍历结果传输给所述匹配子模块;The traversal submodule 602 is configured to traverse the text to be extracted, and transmit the traversal result to the matching submodule;
所述匹配子模块603,用于根据设定的噪声过滤规则,对待提取文本中的字符进行匹配,若待提取文本中的字符属于所述噪声过滤规则,则匹配成功,并将匹配成功的字符传输给第一删除子模块;The matching submodule 603 is configured to match the characters in the text to be extracted according to the set noise filtering rules, if the characters in the text to be extracted belong to the noise filtering rules, the matching is successful, and the characters that are successfully matched transmitted to the first deletion submodule;
所述第一删除子模块604,用于将匹配成功的字符删除;The first deletion submodule 604 is configured to delete characters that match successfully;
进一步的,所述文本去重模块如图7所示,包括:Further, the text deduplication module is shown in Figure 7, including:
映射子模块701,用于将当前过滤后的文本映射成指纹信息,将映射结果传输给比较子模块;The mapping submodule 701 is used to map the currently filtered text into fingerprint information, and transmit the mapping result to the comparison submodule;
所述比较子模块702,用于将该当前过滤后的文本与指纹信息库进行比较,并将比较结果中存在差异的指纹个数小于等于第三预设值的当前过滤的文本传输给第二删除子模块,以及将比较结果中存在差异的指纹个数不小于第三预设值的当前过滤的文本传输给保存子模块;The comparison sub-module 702 is used to compare the currently filtered text with the fingerprint information database, and transmit the currently filtered text whose fingerprint number differs in the comparison result is less than or equal to the third preset value to the second Deleting the submodule, and transferring the currently filtered text whose fingerprint number differs in the comparison result is not less than the third preset value to the saving submodule;
所述第二删除子模块703,用于将当前过滤后的文本删除;The second deletion submodule 703 is used to delete the currently filtered text;
所述保存子模块704,用于将当前过滤后的文本的指纹信息加入所述指纹信息库中。The saving submodule 704 is configured to add the fingerprint information of the currently filtered text into the fingerprint information database.
进一步的,所述装置还包括:Further, the device also includes:
分配模块,用于为每个不重复的词分配词索引号,以及为去重后的文本分配文本索引号,并将分配结果传输给保存模块;An allocation module is used to allocate a word index number for each non-repetitive word, and allocate a text index number for the deduplicated text, and transmit the allocation result to the preservation module;
所述保存模块,用于将词的索引号以及与索引号对应的词的特征保存到词索引表中,以及根据去重后的文本中词的位置关系,将去重后的文本的文本索引号以及该去重后的文本中词的词索引号保存到文本索引表中;The preservation module is used to store the index number of the word and the features of the word corresponding to the index number in the word index table, and according to the positional relationship of the word in the text after the deduplication, the text index of the text after the deduplication No. and the word index number of the word in the deduplicated text are stored in the text index table;
其中,所述词的特征包括:词的词频、该词对应的文本数、词性和词权重。Wherein, the features of the word include: word frequency of the word, number of texts corresponding to the word, part of speech and word weight.
进一步的,所述计算模块,用于根据以下公式计算词的词权重:Further, the calculation module is used to calculate the word weight of the word according to the following formula:
其中,weight(term)为词权重,b(term)和a(term)为经验修正值,tf(term)为词的词频,df(term)为词对应的文本数,|d|为文本总数。Among them, weight(term) is the word weight, b(term) and a(term) are empirical correction values, tf(term) is the word frequency of the word, df(term) is the number of texts corresponding to the word, |d| is the total number of texts .
本发明通过提供一种于社交网络的关键词提取装置,不需要根据大量历史搜索信息,而是直接通过提取模块在待提取文本中提取关键词,通过噪声过滤模块、文本去重模块、分词模块、计算模块,分别对待提取文本进行噪声过滤、文本去重、分词以及计算词权重,进而根据词权重提取关键词,由于不需要大量的历史搜索信息,从而提高了提取速度。By providing a keyword extraction device for social networks, the present invention does not need to search for information based on a large amount of history, but directly extracts keywords from the text to be extracted through the extraction module, through the noise filtering module, text deduplication module, word segmentation module , Calculation module, respectively perform noise filtering, text deduplication, word segmentation and word weight calculation on the text to be extracted, and then extract keywords according to the word weight. Since a large amount of historical search information is not required, the extraction speed is improved.
以上实施方式仅用于说明本发明,而并非对本发明的限制,有关技术领域的普通技术人员,在不脱离本发明的精神和范围的情况下,还可以做出各种变化和变型,因此所有等同的技术方案也属于本发明的范畴,本发明的专利保护范围应由权利要求限定。The above embodiments are only used to illustrate the present invention, but not to limit the present invention. Those of ordinary skill in the relevant technical field can make various changes and modifications without departing from the spirit and scope of the present invention. Therefore, all Equivalent technical solutions also belong to the category of the present invention, and the scope of patent protection of the present invention should be defined by the claims.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310503897.5A CN104572736A (en) | 2013-10-23 | 2013-10-23 | Keyword extraction method and device based on social networking services |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310503897.5A CN104572736A (en) | 2013-10-23 | 2013-10-23 | Keyword extraction method and device based on social networking services |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104572736A true CN104572736A (en) | 2015-04-29 |
Family
ID=53088820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310503897.5A Pending CN104572736A (en) | 2013-10-23 | 2013-10-23 | Keyword extraction method and device based on social networking services |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104572736A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105608627A (en) * | 2016-02-01 | 2016-05-25 | 广东欧珀移动通信有限公司 | Information update method and device based on social network platform |
CN106097113A (en) * | 2016-06-21 | 2016-11-09 | 仲兆满 | A kind of social network user sound interest digging method |
CN106294396A (en) * | 2015-05-20 | 2017-01-04 | 北京大学 | Keyword expansion method and keyword expansion system |
CN107577671A (en) * | 2017-09-19 | 2018-01-12 | 中央民族大学 | A kind of key phrases extraction method based on multi-feature fusion |
CN108628875A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of extracting method of text label, device and server |
CN108984596A (en) * | 2018-06-01 | 2018-12-11 | 阿里巴巴集团控股有限公司 | A kind of keyword excavates and the method, device and equipment of risk feedback |
CN113495954A (en) * | 2020-03-20 | 2021-10-12 | 北京沃东天骏信息技术有限公司 | Text data determination method and device |
CN114782101A (en) * | 2022-04-28 | 2022-07-22 | 重庆锐云科技有限公司 | Customer transaction probability analysis method, system and equipment based on voice recognition |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090300007A1 (en) * | 2008-05-28 | 2009-12-03 | Takuya Hiraoka | Information processing apparatus, full text retrieval method, and computer-readable encoding medium recorded with a computer program thereof |
CN101872363A (en) * | 2010-06-24 | 2010-10-27 | 北京邮电大学 | A method of extracting keywords |
CN102033919A (en) * | 2010-12-07 | 2011-04-27 | 北京新媒传信科技有限公司 | Method and system for extracting text key words |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
-
2013
- 2013-10-23 CN CN201310503897.5A patent/CN104572736A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090300007A1 (en) * | 2008-05-28 | 2009-12-03 | Takuya Hiraoka | Information processing apparatus, full text retrieval method, and computer-readable encoding medium recorded with a computer program thereof |
CN101872363A (en) * | 2010-06-24 | 2010-10-27 | 北京邮电大学 | A method of extracting keywords |
CN102033919A (en) * | 2010-12-07 | 2011-04-27 | 北京新媒传信科技有限公司 | Method and system for extracting text key words |
CN103164471A (en) * | 2011-12-15 | 2013-06-19 | 盛乐信息技术(上海)有限公司 | Recommendation method and system of video text labels |
CN103257957A (en) * | 2012-02-15 | 2013-08-21 | 深圳市腾讯计算机系统有限公司 | Chinese word segmentation based text similarity identifying method and device |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106294396A (en) * | 2015-05-20 | 2017-01-04 | 北京大学 | Keyword expansion method and keyword expansion system |
CN105608627A (en) * | 2016-02-01 | 2016-05-25 | 广东欧珀移动通信有限公司 | Information update method and device based on social network platform |
CN106097113A (en) * | 2016-06-21 | 2016-11-09 | 仲兆满 | A kind of social network user sound interest digging method |
CN106097113B (en) * | 2016-06-21 | 2020-11-27 | 江苏海洋大学 | A method for mining dynamic and static interests of social network users |
CN108628875A (en) * | 2017-03-17 | 2018-10-09 | 腾讯科技(北京)有限公司 | A kind of extracting method of text label, device and server |
CN107577671A (en) * | 2017-09-19 | 2018-01-12 | 中央民族大学 | A kind of key phrases extraction method based on multi-feature fusion |
CN107577671B (en) * | 2017-09-19 | 2020-09-22 | 中央民族大学 | A Keyword Extraction Method Based on Multi-feature Fusion |
CN108984596A (en) * | 2018-06-01 | 2018-12-11 | 阿里巴巴集团控股有限公司 | A kind of keyword excavates and the method, device and equipment of risk feedback |
CN113495954A (en) * | 2020-03-20 | 2021-10-12 | 北京沃东天骏信息技术有限公司 | Text data determination method and device |
CN114782101A (en) * | 2022-04-28 | 2022-07-22 | 重庆锐云科技有限公司 | Customer transaction probability analysis method, system and equipment based on voice recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104572736A (en) | Keyword extraction method and device based on social networking services | |
CN104615593B (en) | Hot microblog topic automatic testing method and device | |
CN104809117B (en) | Video data aggregation processing method, paradigmatic system and video search platform | |
CN104765729B (en) | A kind of cross-platform microblogging community account matching process | |
US9164980B2 (en) | Name identification rule generating apparatus and name identification rule generating method | |
CN105138558B (en) | The real time individual information collecting method of content is accessed based on user | |
WO2021052148A1 (en) | Contract sensitive word checking method and apparatus based on artificial intelligence, computer device, and storage medium | |
WO2021051934A1 (en) | Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium | |
CN102495892A (en) | Webpage information extraction method | |
CN104915443B (en) | A kind of abstracting method of Chinese microblogging evaluation object | |
WO2017157200A1 (en) | Characteristic keyword extraction method and device | |
CN108268554A (en) | A kind of method and apparatus for generating filtering junk short messages strategy | |
CN107895024A (en) | The user model construction method and recommendation method recommended for web page news classification | |
CN111726336A (en) | A method and system for extracting identification information of a networked intelligent device | |
WO2014114175A1 (en) | Method and apparatus for providing search engine tags | |
CN102945246A (en) | Method and device for processing network information data | |
CN104298732A (en) | Personalized text sequencing and recommending method for network users | |
CN102306177A (en) | Multi-strategy combined ontology or instance matching method | |
CN106569989A (en) | De-weighting method and apparatus for short text | |
CN104462061B (en) | Term extraction method and extraction element | |
CN106933878B (en) | Information processing method and device | |
CN114707003B (en) | A method, device and storage medium for disambiguating the name of a paper author | |
CN112328735A (en) | Hot topic determination method and device and terminal equipment | |
CN101369275A (en) | A Product Attribute Mining Method in Unstructured Text | |
CN107704763A (en) | Multi-source heterogeneous leak information De-weight method, stage division and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20150429 |
|
WD01 | Invention patent application deemed withdrawn after publication |