[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN106776881A - A kind of realm information commending system and method based on microblog - Google Patents

A kind of realm information commending system and method based on microblog Download PDF

Info

Publication number
CN106776881A
CN106776881A CN201611075431.XA CN201611075431A CN106776881A CN 106776881 A CN106776881 A CN 106776881A CN 201611075431 A CN201611075431 A CN 201611075431A CN 106776881 A CN106776881 A CN 106776881A
Authority
CN
China
Prior art keywords
user
keyword
word
microblog
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611075431.XA
Other languages
Chinese (zh)
Inventor
杨燕
王帅
徐良
徐罡
田申
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN201611075431.XA priority Critical patent/CN106776881A/en
Publication of CN106776881A publication Critical patent/CN106776881A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于微博平台的领域信息推荐系统及方法,包括:数据获取与预处理模块、领域关键词提取模块、用户自定义关键词扩展模块、线性合并模块、相似度计算与个性化推荐模块以及主题获取模块;本发明针对微博平台的特性设计并实现了一种领域信息推荐方法,将关键词提取与关键词扩展进行无缝的结合,从而既保证了领域特征的提取又保证了推荐结果的动态性,通过对应系统基于新浪微博的实验,验证了本方法的有效性。本发明能够辅助企业微博营销,有效地提高企业微博营销的效率。

The invention discloses a domain information recommendation system and method based on a microblog platform, including: a data acquisition and preprocessing module, a domain keyword extraction module, a user-defined keyword expansion module, a linear merging module, similarity calculation and personality Recommendation module and topic acquisition module; the present invention designs and implements a domain information recommendation method for the characteristics of the microblog platform, seamlessly combines keyword extraction and keyword expansion, thereby ensuring both domain feature extraction and The dynamics of the recommendation results are guaranteed, and the effectiveness of this method is verified through the experiment of the corresponding system based on Sina Weibo. The invention can assist enterprise microblog marketing and effectively improve the efficiency of enterprise microblog marketing.

Description

一种基于微博平台的领域信息推荐系统及方法A domain information recommendation system and method based on microblog platform

技术领域technical field

本发明涉及一种在微博平台下的领域微博推荐系统及方法,支持无指导的领域特征提取以及用户自定义关键词,属于计算机技术领域。The invention relates to a field microblog recommendation system and method under a microblog platform, which supports unguided field feature extraction and user-defined keywords, and belongs to the field of computer technology.

背景技术Background technique

随着互联网进入后WEB2.0时代,社交功能已经成为互联网巨变的典范。各大社交网站如雨后春笋般出现并迅速占据了互联网的统治地位,早在2010年3月,美国著名社交网站FacebookTM就在访问量上超过谷歌跃居美国最大网站。在国内,新浪微博等新兴社交媒体也迅速崛起,截止2012年5月16日,新浪微博的用户数已经达到3亿,在国内互联网市场的用户规模上仅次于经过十多年积累的腾讯QQ产品。另一方面,据全球最具权威的IT研究与顾问咨询公司Gartner发布的2011年IT产业十大战略性技术报告,与社交直接相关的技术就占了两个,分别是Social Communications and Collaboration和Social Analytics。也是所有技术中唯一占据两个席位的技术大类。由此可见人们对社交产品的关注度和其前景。As the Internet enters the post-WEB2.0 era, social functions have become a model for the great changes in the Internet. Major social networking sites have sprung up and quickly occupied the dominance of the Internet. As early as March 2010, the famous American social networking site Facebook TM surpassed Google in terms of visits and became the largest website in the United States. In China, emerging social media such as Sina Weibo are also rising rapidly. As of May 16, 2012, the number of users of Sina Weibo has reached 300 million, which is second only to the number of users accumulated over ten years in the domestic Internet market. Tencent QQ products. On the other hand, according to the 2011 IT Industry Top Ten Strategic Technology Report released by Gartner, the world's most authoritative IT research and consulting company, technologies directly related to social networking accounted for two, namely Social Communications and Collaboration and Social Communications. Analytics. It is also the only technology category that occupies two seats among all technologies. This shows people's attention to social products and their prospects.

另一方面,人们对互联网使用的普及以及对互联网使用粘性的增强使得大数据成为了近些年IT界关注的热点。但是,要将大数据转化为对人类有用的价值,需要数据挖掘等相关技术的支持。因此,近年来数据挖掘与分析的热度也是一路飙升。尤其是面向企业的数据挖掘与分析,因为其能够为企业带来直接的利益。而且基于大量用户真实数据产生的分析结果相比于传统的分析技术具有更强的可靠性和说服力,而且这是一种实时的分析方法,能够更好地适应市场变化,更好地捕捉市场机遇。On the other hand, the popularization of Internet use and the enhancement of people's stickiness to Internet use have made big data a hot topic in the IT industry in recent years. However, in order to transform big data into useful values for human beings, the support of related technologies such as data mining is needed. Therefore, the popularity of data mining and analysis has also soared in recent years. Especially data mining and analysis for enterprises, because it can bring direct benefits to enterprises. Moreover, the analysis results based on a large number of real user data are more reliable and convincing than traditional analysis techniques, and this is a real-time analysis method that can better adapt to market changes and better capture the market opportunity.

虽然微博平台包含了各种领域信息,而且对于领域事件的反应很快,但是在其上获取较为全面的领域信息依然面临诸多困难。微博平台的兴起和用户的快速增长带来了信息过载问题,随着用户关注数量的增加,与领域无关的内容也越来越多地出现在用户订阅的微博中;同时,如果用户只关注少量的领域高相关用户,必将导致不能及时地获得全面的领域信息。即用户在微博平台上的数据获取存在准确率与召回率不可兼得的问题。而且微博平台本身也具有主题分散和信息碎片化的特点,这就要求一种能够识别企业微博用户的领域兴趣,并对微博信息依照领域相关度进行提取和推荐的方法。然而,现有的社交媒体管理与分析软件在领域信息提取方面大多仅仅利用用户自定义关键词来进行简单的文字匹配,这种方法存在很大缺陷。首先,个别关键词并不能全面刻画领域信息需求;其次,语言的丰富性使得简单的文字匹配效果受限。Although the Weibo platform contains a variety of domain information and responds quickly to domain events, there are still many difficulties in obtaining more comprehensive domain information on it. The rise of the Weibo platform and the rapid growth of users have brought about the problem of information overload. As the number of users’ attention increases, more and more domain-independent content appears in the Weibo subscribed by users; at the same time, if users only Focusing on a small number of highly relevant users in the field will inevitably lead to failure to obtain comprehensive field information in a timely manner. That is to say, there is a problem that the accuracy rate and the recall rate of the user's data acquisition on the Weibo platform cannot be achieved at the same time. Moreover, the microblog platform itself has the characteristics of topic dispersion and information fragmentation, which requires a method that can identify the domain interests of enterprise microblog users, and extract and recommend microblog information according to the domain relevance. However, most of the existing social media management and analysis software only use user-defined keywords for simple text matching in domain information extraction, and this method has great defects. First of all, individual keywords cannot fully describe the information needs of the domain; second, the richness of language limits the effect of simple text matching.

传统的通过关键词提取建模用户兴趣的方法与通过用户自定义关键词表示用户兴趣的方法都存在各自的缺陷。Both the traditional method of modeling user interest through keyword extraction and the method of expressing user interest through user-defined keywords have their own defects.

关键词提取方法主要表现在:这是一种无指导的提取算法,用户无法动态调整算法结果,因此不能满足用户对于某个领域主题的临时动态需求。而且算法在用户历史微博较少的情况下不能进行领域兴趣建模,即所谓的冷启动现象。The keyword extraction method is mainly manifested in: this is an unguided extraction algorithm, and the user cannot dynamically adjust the algorithm results, so it cannot meet the user's temporary dynamic needs for a topic in a certain field. Moreover, the algorithm cannot perform domain interest modeling when the user's history of Weibo is small, which is the so-called cold start phenomenon.

用户自定义关键词的缺陷主要表现在:其一,用户自定义的关键词很难保证完全涵盖本领域的所有信息;其二,在微博短文本中,仅仅利用关键词是否出现来判断相似性将使得很多同义但用词不同的信息无法被提取出来。The defects of user-defined keywords are mainly manifested in: first, it is difficult to ensure that user-defined keywords completely cover all the information in this field; Sex will make it impossible to extract a lot of information that is synonymous but uses different words.

另一方面,相关的研究工作应用在本应用场景中也存在各自的问题。首先,社交网络的兴起为数据挖掘与数据分析相关技术提供了新的具有巨大价值的应用场景,社交网络为数据挖掘与分析提供了新的视角,也使得传统数据挖掘技术面临新的挑战。On the other hand, the application of related research work also has its own problems in this application scenario. First of all, the rise of social networks provides new application scenarios with great value for data mining and data analysis related technologies. Social networks provide new perspectives for data mining and analysis, and also make traditional data mining technologies face new challenges.

综上所述,以上方法应用在微博平台的现实系统中,存在如下几大问题:To sum up, when the above method is applied to the real system of Weibo platform, there are several major problems as follows:

(1)关键词提取与关键词扩展均存在各自的问题,单一策略均无法很好的满足微博平台领域相关信息推荐的需求。(1) Both keyword extraction and keyword expansion have their own problems, and neither single strategy can well meet the needs of relevant information recommendation in the microblog platform field.

(2)在相关的研究工作中,由于社交平台上信息碎片化的特点,一类算法在训练语料方面需要借助外部语料进行。这大大降低了算法的实用性和可移植性。(2) In related research work, due to the characteristics of information fragmentation on social platforms, a class of algorithms need to rely on external corpus for training corpus. This greatly reduces the practicality and portability of the algorithm.

(3)另一类算法基于全局语料进行建立模型,该类算法计算成本很大且不具有普遍适用性。(3) Another type of algorithm builds a model based on the global corpus, which is computationally expensive and not universally applicable.

因此需要一种推荐方法,能够基于关键词,提供领域信息个性化推荐,以帮助用户快速准确地获得领域相关信息。该方法需要解决微博平台上企业用户获取领域相关信息过程中准确率与召回率不能兼得的矛盾。同时基于实用性的要求,该算法不能依赖于外部语料,而且需要基于用户局部数据进行计算。因此,提出一种具有以上性质的推荐方法,是本发明的关注点。Therefore, there is a need for a recommendation method that can provide personalized recommendations for domain information based on keywords, so as to help users obtain domain-related information quickly and accurately. This method needs to solve the contradiction that the accuracy rate and the recall rate can not have both in the process of obtaining domain-related information by enterprise users on the Weibo platform. At the same time, based on the requirements of practicability, the algorithm cannot rely on external corpus, and needs to be calculated based on the user's local data. Therefore, proposing a recommendation method with the above properties is the focus of the present invention.

发明内容Contents of the invention

本发明的目的在于:克服现有技术中存在的不足,提供一种基于微博平台的领域信息推荐系统及方法,基于用户的历史微博,提出了关键词提取与关键词扩展相结合来建模用户兴趣的方法,既保证了领域信息的全面识别又使得用户能够根据需求动态调整自己的领域兴趣;采用基于图的关键词提取算法TextRank,不依赖于其他语料,而且避免了提取结果受语言模型中存在的齐夫定律现象的影响,并提出一种优化的P-IOW算法实现了关键字更好地扩展。本方法保证了用户的动态兴趣需求能够实时得到满足并且极大增强了用户自定义关键词的表述能力。通过将关键词提取与扩展的结果根据用户自定义权重进行线性合并,能够为用户个性化推荐相关领域微博信息和主题,帮助用户快速准确地获得领域相关信息。The purpose of the present invention is to overcome the deficiencies in the prior art and provide a domain information recommendation system and method based on the microblog platform. The method of modeling user interests not only ensures the comprehensive identification of domain information but also enables users to dynamically adjust their domain interests according to their needs; the graph-based keyword extraction algorithm TextRank is adopted, which does not depend on other corpus and avoids the extraction results being affected by language. Influenced by the phenomenon of Zif's law in the model, an optimized P-IOW algorithm is proposed to achieve better expansion of keywords. This method ensures that the user's dynamic interest needs can be met in real time and greatly enhances the expression ability of user-defined keywords. By linearly merging the results of keyword extraction and expansion according to user-defined weights, it can personalize microblog information and topics in related fields for users, and help users obtain field-related information quickly and accurately.

本发明技术解决方案:一种基于微博平台的领域信息推荐系统,包括:数据获取与预处理模块、领域关键词提取模块、用户自定义关键词扩展模块、线性合并模块、相似度计算与个性化推荐模块以及主题获取模块;其中:Technical solution of the present invention: a domain information recommendation system based on a microblog platform, including: data acquisition and preprocessing module, domain keyword extraction module, user-defined keyword expansion module, linear merge module, similarity calculation and personality Optimized recommendation module and topic acquisition module; among them:

数据获取与预处理模块:获取用户相关微博信息数据,并进行预处理;预处理包括数据的停用词过滤、分词和词性标注;预处理结果即为用户的历史微博数据,传给领域关键词提取模块;如果用户自定义了领域兴趣关键词,则预处理结果同时传给用户自定义关键词扩展模块;Data acquisition and preprocessing module: Obtain user-related microblog information data and perform preprocessing; preprocessing includes data stop word filtering, word segmentation and part-of-speech tagging; the preprocessing result is the user's historical microblog data, which is passed to the domain Keyword extraction module; if the user defines domain interest keywords, the preprocessing result is simultaneously passed to the user-defined keyword expansion module;

领域关键词提取模块:基于预处理结果,关键词提取采用基于TextRank算法修改的TextRank for Weibo算法无指导地进行,该算法包括基于共现关系的无向图的构造和基于图的节点权重计算两个阶段;基于共现关系的无向图的构造阶段,首先将用户历史微博中出现的分词转化为对应的节点;在节点间连接边的构造时,使用节点之间是否有边以及边的权重由两个词语在同一篇微博中的共现次数判定共现的构图,边的权重即为词语在同一微博中的共现次数,如果两个词语在用户的某条微博中共现,则两个词语所对应节点之间的边之权值加1,边的最终权值为其对应两个词语在微博中的共现次数;然后再基于图的节点权重计算阶段,迭代计算每个阶段的权重,直到节点权重的变化量收敛到某个阀值为止;迭代结束后,每个节点的权重即为其所代表的分词的重要程度,将用户的所有分词按照重要度进行排序即获得关键词提取的结果,从而自动识别用户所在的领域特征;Field keyword extraction module: Based on the preprocessing results, the keyword extraction is carried out unguided by the TextRank for Weibo algorithm modified based on the TextRank algorithm. first stage; in the construction stage of the undirected graph based on the co-occurrence relationship, firstly, the word segmentation appearing in the user’s historical Weibo is converted into the corresponding node; when constructing the connecting edge between the nodes, whether there is an edge between the nodes and the edge The weight determines the co-occurrence composition by the number of co-occurrences of two words in the same Weibo. The weight of the edge is the number of co-occurrences of words in the same Weibo. If two words co-occur in a user’s Weibo , then the weight of the edge between the nodes corresponding to the two words is increased by 1, and the final weight of the edge is the number of co-occurrences of the corresponding two words in Weibo; then based on the node weight calculation stage of the graph, the iterative calculation The weight of each stage, until the variation of the node weight converges to a certain threshold; after the iteration, the weight of each node is the importance of the participle it represents, and all the participle of the user are sorted according to the importance That is, the result of keyword extraction is obtained, so as to automatically identify the characteristics of the field where the user is located;

用户自定义关键词扩展模块:基于关键词的共现、分布以及其所属用户的属性信息来计算关键词之间的相似度,将高相关度的词语作为目标关键词的扩展结果;本模块支持用户输入多个自定义关键词,对于每个自定义关键词,会对关键词扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量;用户自定义关键词扩展功能保证了用户的动态兴趣需求能够实时得到满足,同时极大增强了用户自定义关键词的表述能力;User-defined keyword expansion module: Calculate the similarity between keywords based on the co-occurrence and distribution of keywords and the attribute information of their users, and use highly relevant words as the expansion results of target keywords; this module supports The user enters multiple custom keywords, and for each custom keyword, the extended word vectors obtained from the keywords will be linearly summed to obtain the final extended vector; the user-defined keyword expansion function ensures the user's Dynamic interest needs can be met in real time, and at the same time, the expression ability of user-defined keywords is greatly enhanced;

线性合并模块:在领域关键词自动提取和基于用户自定义关键词的扩展均完成后,采用最大值归一化方法对两个结果向量进行归一化,使关键词提取与关键词扩展的结果向量映射到一个统一的取值范围之中;归一化后,对两个归一化后的向量进行线性合并,合并过程支持用户自定义关键词提取和关键词扩展的权重;该模块输出一个代表用户最终领域兴趣的词向量;Linear merge module: After the automatic extraction of domain keywords and the expansion based on user-defined keywords are completed, the maximum value normalization method is used to normalize the two result vectors, so that the results of keyword extraction and keyword expansion The vector is mapped to a unified value range; after normalization, the two normalized vectors are linearly merged, and the merge process supports user-defined keyword extraction and keyword expansion weights; the module outputs a A word vector representing the user's final domain interests;

相关度计算与个性化推荐模块:线性合并模块刻画出用户领域兴趣的关键词向量之后,对每条待过滤微博进行分词以及词频统计以生成词频向量,然后将用户兴趣关键词向量、待推荐微博生成的词频向量以及IDF信息向量进行点乘运算,得到该微博与用户兴趣的相关度,该相关度即为该条微博的领域相关度。通过计算出每个用户微博的领域相关度,按照领域相关度由高到低进行排序,将微博信息呈现给用户,实现对用户的个性化领域微博推荐;Correlation calculation and personalized recommendation module: After the linear merging module describes the keyword vectors of user interests in the field, word segmentation and word frequency statistics are performed on each to-be-filtered microblog to generate word frequency vectors, and then the user interest keyword vectors, to-be-recommended The word frequency vector generated by the microblog and the IDF information vector are dot-multiplied to obtain the correlation between the microblog and the user's interest, which is the domain correlation of the microblog. By calculating the domain relevance of each user's microblog, sorting the domain correlation from high to low, presenting the microblog information to the user, and realizing the personalized domain microblog recommendation for the user;

主题获取模块:以推荐给用户的领域微博文本为输入训练LDA模型,根据主题的词项分布将词项聚类成主题;将主题词项集合与线性合并模块中得到的用户领域兴趣关键词项进行相关度计算,获得主题重要性,并按照重要性排序呈现给用户,从而完成主题发现和推荐。Topic Acquisition Module: Train the LDA model with the domain microblog text recommended to the user as input, and cluster the terms into topics according to the distribution of terms in the topic; combine the subject term set with the user domain interest keywords obtained in the linear combination module Items are calculated to obtain the importance of topics, and presented to users in order of importance, thereby completing topic discovery and recommendation.

所述数据获取与预处理模块实现过程如下:The implementation process of the data acquisition and preprocessing module is as follows:

(1)用户登录微博系统后,首先进行用户验证,验证通过后,自动使用该用户所关联的微博平台凭证与微博平台交互,以验证用户身份在微博平台上的合法性;(1) After the user logs in to the Weibo system, the user verification is first performed. After the verification is passed, the Weibo platform credentials associated with the user are automatically used to interact with the Weibo platform to verify the legitimacy of the user's identity on the Weibo platform;

(2)获取用户相关微博信息数据,利用本地数据库将获取的数据结构化地持久化起来,以便随时读取;(2) Obtain user-related microblog information data, and use the local database to persist the acquired data in a structured manner so that it can be read at any time;

(3)对持久化的微博文本进行预处理工作,包括停用词过滤、分词和词性标注三部分;针对微博文本特性,采用模式匹配的方法,对停用词首先进行了过滤,然后针对微博场景进行优化了的中文分词以及词性标注,使用分词器产品ICTCLAS5.0进行分词和词性标注,同时在关键词提取与关键词扩展之前均对用户微博分词后的结果进行词性过滤,只保留名词。预处理结果即为用户的历史微博数据。(3) Preprocessing the persistent microblog text, including stop word filtering, word segmentation and part-of-speech tagging; according to the characteristics of microblog text, the stop words are first filtered by pattern matching method, and then Optimized Chinese word segmentation and part-of-speech tagging for the Weibo scene, use the word segmentation product ICTCLAS5.0 for word segmentation and part-of-speech tagging, and filter the part-of-speech results of user Weibo word segmentation before keyword extraction and keyword expansion. Only keep nouns. The preprocessing result is the user's historical microblog data.

所述领域关键词提取模块中,计算每个阶段的权重依照PageRank的算法思想迭代计算,公式如下:In the domain keyword extraction module, the weight of each stage is calculated iteratively according to the algorithm idea of PageRank, and the formula is as follows:

其中:Vi为第i个节点,TR(Vi)为节点Vi的权重,wij是节点Vi和Vj之间边的权重;E(Vi)为Vi所连接的边的集合;d为迭代的阻尼系数,设置为0.85,可以以任意初始值开始迭代,迭代直至收敛为止,收敛的条件为本次迭代与上次迭代之间各节点权值之和的绝对差小于指定数值。Among them: V i is the i-th node, TR(V i ) is the weight of node V i , w ij is the weight of the edge between nodes V i and V j ; E(V i ) is the weight of the edge connected by V i Set; d is the damping coefficient of the iteration, which is set to 0.85. You can start the iteration with any initial value and iterate until it converges. The convergence condition is that the absolute difference between the sum of the weights of each node between this iteration and the last iteration is less than the specified value.

所述用户自定义关键词扩展模块中,采用改进的P-IOW算法计算关键词之间的相似度,实现过程如下:In the user-defined keyword expansion module, the improved P-IOW algorithm is used to calculate the similarity between keywords, and the implementation process is as follows:

对于给定的用户自定义关键词s,词t关于s的领域相关度的计算方法如下:For a given user-defined keyword s, the calculation method of the domain correlation of word t with respect to s is as follows:

其中:in:

其中:s为用户自定义关键词;wf(t)为包含词t的微博数;tM为领域相关语料中被最多条微博所包含的词,为tM所在微博的条数;为不包含词s的微博数;wf(t∧s)为同时包含词t和词s的微博数;N为用户微博总数;sp是值域为(0,1)的平滑系数使得为零时不会出现除数为零的情况。由公式可以得出,在(1)式乘以降权因子之前,P-IOLogW的值域在log(sp)到log(1/sp)之间。计算结果P-IOW的值越大,说明词w与用户自定义关键词s具有越高的领域相似度。对于用户自定义关键词本身,将赋予P-IOLogW值域的上限,即log(1/sp)。Among them: s is a user-defined keyword; wf(t) is the number of microblogs containing the word t; t M is the word contained in the most microblogs in the domain-related corpus, is the number of microblogs where t M is located; is the number of microblogs that do not contain word s; wf(t∧s) is the number of microblogs that contain both word t and word s; N is the total number of user microblogs; s p is a smoothing coefficient with a range of (0, 1) make Divisor by zero does not occur when it is zero. From the formula, it can be concluded that the value range of P-IOLogW is between log(s p ) and log(1/s p ) before formula (1) is multiplied by the down-weighting factor. The larger the value of the calculation result P-IOW, the higher the domain similarity between the word w and the user-defined keyword s. For the user-defined keyword itself, the upper limit of the value range of P-IOLogW will be assigned, that is, log(1/s p ).

所述用户自定义关键词扩展模块中,对关键词扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量时,扩展过程具体算法如下:In the user-defined keyword expansion module, the extended word vectors expanded by the keywords are linearly summed, so as to obtain the final extended vector, the specific algorithm of the expanded process is as follows:

(1)对每一个用户自定义关键词集合关键词,首先,基于P-IOW计算关键词的所有扩展词权重;(1) for each user-defined keyword collection keyword, at first, calculate all expansion word weights of keyword based on P-IOW;

(2)将与关键词相关的扩展词权重向量映射到领域相关语料分词的空间上,形成扩展词向量;(2) Map the extended word weight vector related to the keyword to the space of domain-related corpus word segmentation to form an extended word vector;

(3)线性累加所有的扩展词向量,得到最终的扩展向量,即关键词扩展模块的输出结果。(3) Linearly accumulate all the expanded word vectors to obtain the final expanded vector, which is the output result of the keyword expanded module.

所述线性合并模块的实现过程如下:The implementation process of the linear merging module is as follows:

(1)采用最大值归一化法对关键词提取与关键词扩展生成的向量分别进行归一化处理;(1) Use the maximum value normalization method to normalize the vectors generated by keyword extraction and keyword expansion;

(2)对于向量中的每一分量,具体的归一化方法如下:(2) For each component in the vector, the specific normalization method is as follows:

vnormal=v/vmax v normal = v/v max

其中:v为向量某一分量的初始值;vmax为向量所有分量中的最大值,经过最大值归一化后,向量所有分量均在(0,1]之间,而且所有分量均为非零,之后进行向量线性加权合并:Among them: v is the initial value of a certain component of the vector; v max is the maximum value of all components of the vector. After normalization by the maximum value, all components of the vector are between (0, 1], and all components are non- Zero, followed by vector linear weighted merging:

Vcombine=r×Vkw-extract+(1-r)×Vkw-expand V combine =r×V kw-extract +(1-r)×V kw-expand

其中:r为用户自定义的合并权重比例,Vkw-extract为关键词提取的结果向量,Vkw-expand为关键词扩展生成的结果向量,合并结果为刻画用户领域兴趣的关键词向量VcombineAmong them: r is the user-defined combined weight ratio, V kw-extract is the result vector of keyword extraction, V kw-expand is the result vector generated by keyword expansion, and the combined result is the keyword vector V combine that describes the interest of the user field .

所述的相似度计算与个性化推荐模块中,领域相关度计算具体公式如下:In the similarity calculation and personalized recommendation module, the specific formula for domain correlation calculation is as follows:

其中:用户领域兴趣的关键词向量为Vcombine;待推荐微博T对应的词频向量为W;分词的IDF向量为VIDF;L为用户历史微博分词后的向量空间维度总数;分别为Vcombine和W上ti所对应的分量;IDF(ti)为ti的IDF值。Wherein: the keyword vector of interest in the user field is V combine ; the word frequency vector corresponding to microblog T to be recommended is W; the IDF vector of word segmentation is V IDF ; L is the total number of vector space dimensions after word segmentation of user history microblogs; with are the components corresponding to t i above V combine and W respectively; IDF(t i ) is the IDF value of t i .

一种基于微博平台的领域信息推荐方法,分为数据获取与预处理、领域关键词提取、用户自定义关键词扩展、线性合并、相似度计算与个性化推荐以及主题获取六个步骤,实现如下:A domain information recommendation method based on the Weibo platform, which is divided into six steps: data acquisition and preprocessing, domain keyword extraction, user-defined keyword expansion, linear merger, similarity calculation and personalized recommendation, and topic acquisition. as follows:

(1)获取用户相关微博信息进行数据预处理;预处理工作包括采用模式匹配的方法对停用词进行过滤;使用分词系统ICTCLAS5.0进行分词和词性标注;预处理结果即为用户的历史微博数据。(1) Obtain user-related microblog information for data preprocessing; preprocessing includes filtering stop words using pattern matching; using word segmentation system ICTCLAS5.0 for word segmentation and part-of-speech tagging; the preprocessing result is the user's history Weibo data.

(2)预处理结果进行领域关键词提取,关键词提取将使用本发明基于TextRank算法修改的TextRank for Weibo算法无指导地进行;该过程分为基于共现关系的无向图的构造和基于图的节点权重计算两个阶段;基于共现关系的无向图的构造阶段首先将用户历史微博中出现的分词转化为对应的节点;然后对每个微博分词的结果中,所有出现的二元分词对进行边的构造,边的权值即为对应词对在历史微博中的共现次数;(2) The preprocessing result carries out the field keyword extraction, and the keyword extraction will use the TextRank for Weibo algorithm modified based on the TextRank algorithm of the present invention to carry out without guidance; There are two stages of node weight calculation; the construction stage of the undirected graph based on the co-occurrence relationship first converts the word segmentations that appear in the user's historical microblogs into corresponding nodes; The meta-segment pair is constructed as an edge, and the weight of the edge is the number of co-occurrences of the corresponding word pair in historical microblogs;

基于图的节点权重计算阶段依照PageRank的算法思想迭代计算每个阶段的权重,直到节点权重的变化量收敛到某个阀值为止;The graph-based node weight calculation stage iteratively calculates the weight of each stage according to the algorithm idea of PageRank until the change of node weight converges to a certain threshold;

(3)如果用户自定义了领域兴趣关键词,预处理结果将另传给关键词扩展模块;关键词扩展将使用基于P-IOLog算法改进的P-IOW算法进行,使用基于关键词的共现、分布以及其所属用户的属性等信息来计算关键词之间的相似度,将高相关度的词语作为目标关键词的扩展结果;同时支持用户输入多个自定义关键词;对于每个自定义关键词,关键词扩展模块会对其扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量;(3) If the user defines the domain interest keywords, the preprocessing results will be passed to the keyword expansion module; the keyword expansion will be performed using the improved P-IOW algorithm based on the P-IOLog algorithm, using co-occurrence based on keywords , distribution, and the attributes of the users they belong to to calculate the similarity between keywords, and use highly relevant words as the extended results of target keywords; at the same time, users are supported to input multiple custom keywords; for each custom Keywords, the keyword expansion module will linearly add the expanded word vectors to obtain the final expanded vector;

(4)将二者的结果根据用户自定义权重,采用最大值归一化方法对两个结果向量进行归一化,使其映射到一个统一的取值范围之中;归一化后,对两个归一化后的向量进行线性合并,合并过程支持用户自定义关键词提取和关键词扩展的权重,合并结果供相关推荐模块进行相关度比较,以对待推荐微博生成相关性评分;(4) The results of the two are normalized according to the user-defined weights, and the two result vectors are normalized by the maximum value normalization method, so that they are mapped to a unified value range; after normalization, the The two normalized vectors are linearly merged. The merge process supports user-defined keyword extraction and keyword expansion weights. The merged results are used for correlation comparison by related recommendation modules to generate correlation scores for recommended microblogs;

(5)将用户订阅的待推荐微博进行分词并且根据词频将其向量化,然后根据识别出的用户兴趣,将用户兴趣关键词向量、待推荐微博生成的词频向量以及IDF信息向量进行点乘运算,利用向量空间点乘的方法计算出领域相关度;(5) Segment the microblogs to be recommended subscribed by the user and vectorize them according to the word frequency, and then according to the identified user interest, point the user interest keyword vector, the word frequency vector generated by the microblog to be recommended, and the IDF information vector Multiplication operation, using the method of vector space point multiplication to calculate the domain correlation;

(6)将推荐给用户的领域微博文本作为输入集合,基于LDA策略实现词项聚类,完成主题的发现,再将主题词项集合与线性合并模块中得到的用户领域兴趣关键词项进行相关度计算,确立主题重要性,并根据主题重要性进行用户推荐。(6) Take the field microblog text recommended to the user as the input set, realize term clustering based on the LDA strategy, complete the topic discovery, and then combine the subject term set with the user domain interest keywords obtained in the linear merge module Correlation calculation, establishing the importance of topics, and recommending users based on the importance of topics.

本发明与现有技术相比的优点在于:The advantage of the present invention compared with prior art is:

(1)领域语料的来源方面,针对关键词提取与关键词扩展方法均存在各自的问题,单一策略均无法很好的满足微博平台领域相关信息推荐这一问题,本发明提出了对领域语料采用基于图的关键词提取和基于共现信息的用户自定义关键词扩展技术相融合的方法建模用户的领域兴趣,保证了用户的动态兴趣需求能够实时得到满足并且极大增强了用户自定义关键词的表述能力,从而兼顾了领域信息的全面刻画与用户兴趣的动态改变,解决了微博平台上领域用户获取领域相关信息过程中准确率与召回率不能兼得的矛盾。(1) In terms of the source of domain corpus, there are respective problems in keyword extraction and keyword expansion methods, and a single strategy cannot well satisfy the problem of domain-related information recommendation on the microblog platform. Using graph-based keyword extraction and user-defined keyword expansion technology based on co-occurrence information to model users' domain interests ensures that users' dynamic interest needs can be met in real time and greatly enhances user-defined The expressive ability of keywords takes into account the comprehensive description of domain information and the dynamic changes of user interests, and solves the contradiction between accuracy and recall when domain users obtain domain-related information on the Weibo platform.

(2)同时基于实用性的要求,本发明中的算法不依赖于外部语料,不需使用大规模语料进行分析,而是基于用户局部数据,即用户的历史微博数据进行计算,因此算法具有较快的反应时间,具有更大的实用性和可移植性。(2) Based on the requirements of practicality, the algorithm in the present invention does not depend on external corpus, and does not need to use large-scale corpus for analysis, but calculates based on the user's local data, that is, the user's historical microblog data, so the algorithm has Faster response times for greater utility and portability.

(3)针对推荐给用户的领域微博文本再进行主题建模,过滤掉领域无关主题的干扰,实现了更精准的主题推荐,进一步提高了领域信息推荐的准确性。(3) Topic modeling is performed on the domain microblog text recommended to users to filter out the interference of domain-independent topics, realize more accurate topic recommendation, and further improve the accuracy of domain information recommendation.

附图说明Description of drawings

图1是本发明的体系结构图;Fig. 1 is an architecture diagram of the present invention;

图2是本发明的系统总体流程图;Fig. 2 is the overall flow chart of the system of the present invention;

图3是本发明的系统时序图;Fig. 3 is a system sequence diagram of the present invention;

图4是本发明的系统框架图;Fig. 4 is a system frame diagram of the present invention;

图5是本发明基于图的关键词提取子模块UML图主要部分节选;Fig. 5 is an excerpt of the main part of the UML diagram of the keyword extraction submodule based on diagrams of the present invention;

图6是本发明基于共现信息的关键词扩展子模块UML图主要部分节选。Fig. 6 is an excerpt of the main part of the UML diagram of the keyword expansion sub-module based on co-occurrence information in the present invention.

具体实施方式detailed description

以下结合具体实施例和附图对本发明进行详细说明。The present invention will be described in detail below in conjunction with specific embodiments and accompanying drawings.

如图1所示,本发明的一种基于微博平台的领域信息推荐系统,包括:数据获取与预处理模块、领域关键词提取模块、用户自定义关键词扩展模块、线性合并模块、相似度计算与个性化推荐模块以及主题获取模块。As shown in Figure 1, a field information recommendation system based on the microblog platform of the present invention includes: data acquisition and preprocessing module, field keyword extraction module, user-defined keyword expansion module, linear merging module, similarity Calculation and personalized recommendation module and topic acquisition module.

数据获取与预处理模块:针对新浪微博平台上的企业微博用户,通过分析其指定的若干领域内相关微博账户的历史微博文本,利用新浪微博开放平台获取用户相关历史微博信息同时完成数据预处理工作,即采用模式匹配的方法,对停用词进行过滤;使用分词系统ICTCLAS5.0进行分词和词性标注。预处理结果即为用户的历史微博数据,传给领域关键词提取模块,如果用户自定义了领域兴趣关键词,则预处理结果同时传给用户自定义关键词扩展模块。Data acquisition and preprocessing module: for enterprise microblog users on the Sina Weibo platform, by analyzing the historical microblog texts of relevant microblog accounts in several designated fields, use the Sina Weibo open platform to obtain user-related historical microblog information At the same time, the data preprocessing work is completed, that is, the method of pattern matching is used to filter the stop words; the word segmentation system ICTCLAS5.0 is used for word segmentation and part-of-speech tagging. The preprocessing result is the user's historical microblog data, which is passed to the field keyword extraction module. If the user defines the field interest keywords, the preprocessing result is also passed to the user-defined keyword expansion module.

领域关键词提取模块:根据预处理结果,首先进行基于共现关系的无向图构造,将用户历史微博中出现的分词转化为对应的节点,对每个微博分词的结果中所有出现的二元分词对进行边的构造,边的权值即为对应词对在历史微博中的共现次数。然后再进行基于图的节点权重计算,依照PageRank的算法思想迭代计算每个阶段的权重,直到节点权重的变化量收敛到某个阀值为止。迭代结束后,每个节点的权重即为其所代表的分词的重要程度。将用户的所有分词按照重要度进行排序即可获得关键词提取的结果,从而自动识别用户所在的领域特征。Domain keyword extraction module: According to the preprocessing results, first construct an undirected graph based on the co-occurrence relationship, convert the word segmentations that appear in the user's historical microblogs into corresponding nodes, and analyze all the occurrences of the word segmentation results in each microblog Edges are constructed for binary word pairs, and the weight of edges is the number of co-occurrences of corresponding word pairs in historical microblogs. Then calculate the node weight based on the graph, iteratively calculate the weight of each stage according to the PageRank algorithm idea, until the change of the node weight converges to a certain threshold. After the iteration, the weight of each node is the importance of the word segment it represents. The results of keyword extraction can be obtained by sorting all the user's word segmentation according to the importance, so as to automatically identify the characteristics of the user's field.

用户自定义关键词扩展模块:使用改进的P-IOW(Probabilistic Inside-OutsideLog for Weibo)方法,基于关键词的共现、分布以及其所属用户的属性等信息来计算关键词之间的相似度,将高相关度的词语作为目标关键词的扩展结果。本模块支持用户输入多个自定义关键词。对于每个自定义关键词,关键词扩展模块会对其扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量。用户自定义关键词扩展功能保证了用户的动态兴趣需求能够实时得到满足,同时极大增强了用户自定义关键词的表述能力。User-defined keyword expansion module: use the improved P-IOW (Probabilistic Inside-OutsideLog for Weibo) method to calculate the similarity between keywords based on information such as the co-occurrence and distribution of keywords and the attributes of users to which they belong. Use highly relevant words as the expanded results of the target keywords. This module supports users to input multiple custom keywords. For each custom keyword, the keyword expansion module will linearly add the expanded word vectors to obtain the final expanded vector. The user-defined keyword expansion function ensures that the dynamic interest needs of users can be met in real time, and at the same time greatly enhances the expression ability of user-defined keywords.

线性合并模块:在领域关键词自动提取和基于用户自定义关键词的扩展均完成后,采用最大值归一化方法对两个结果向量进行归一化,使其映射到一个统一的取值范围之中。归一化后,对两个归一化后的向量进行线性合并,合并过程支持用户自定义关键词提取和关键词扩展的权重。模块输出一个代表用户最终领域兴趣的词向量。Linear merge module: After the automatic extraction of domain keywords and the expansion based on user-defined keywords are completed, the maximum value normalization method is used to normalize the two result vectors so that they are mapped to a unified value range among. After normalization, linearly merge the two normalized vectors. The merging process supports user-defined weights for keyword extraction and keyword expansion. The module outputs a word vector representing the user's final domain interests.

相关度计算与个性化推荐模块:在刻画用户领域兴趣的关键词向量生成之后,相关度计算与个性化推荐模块将对每条待过滤微博进行分词以及词频统计以生成词频向量,然后将用户兴趣关键词向量、待推荐微博生成的词频向量以及IDF信息向量进行点乘运算,以得到该微博与用户兴趣的相关度。通过计算出每个用户微博的领域相关度,按照领域相关度由高到低进行排序,实现对用户的个性化领域微博推荐。Relevance Calculation and Personalized Recommendation Module: After generating keyword vectors that characterize the user’s interest in the field, the Relevance Calculation and Personalized Recommendation module will perform word segmentation and word frequency statistics on each microblog to be filtered to generate word frequency vectors, and then user The keyword vector of interest, the word frequency vector generated by the microblog to be recommended, and the IDF information vector are subjected to dot multiplication to obtain the correlation between the microblog and the user's interest. By calculating the domain relevance of each user's microblogs and sorting the domain correlations from high to low, personalized domain microblog recommendations for users are realized.

主题获取模块:以推荐给用户的领域微博文本为输入训练LDA模型,根据主题的词项分布将词项聚类成主题;将主题词项集合与线性合并模块中得到的用户领域兴趣关键词项进行相关度计算,获得主题重要性,并按照重要性排序呈现给用户,从而完成主题发现和推荐。Topic Acquisition Module: Train the LDA model with the domain microblog text recommended to the user as input, and cluster the terms into topics according to the distribution of terms in the topic; combine the subject term set with the user domain interest keywords obtained in the linear combination module Items are calculated to obtain the importance of topics, and presented to users in order of importance, thereby completing topic discovery and recommendation.

如图2所示,数据获取与预处理模块实现过程如下:As shown in Figure 2, the implementation process of the data acquisition and preprocessing module is as follows:

(1)用户登录微博系统后,首先进行用户验证。验证通过后,系统会自动使用该用户所关联的新浪微博开放平台OAuth2.0凭证与开放平台交互以验证用户身份在新浪微博平台上的合法性。如果凭证没有过期,则用户验证工作完成。如果凭证不存在或者已经过期,则系统会自动转到开放平台的OAuth2.0验证页面,该页面要求用户输入其在新浪微博上的用户名与密码。用户输入正确信息后,开放平台会将更新后的凭证传回本系统,本系统将持久化该凭证以保证在凭证有效期内,用户仅凭本系统用户名和密码即可登录开放平台并获取微博数据以及进行相关操作。(1) After the user logs into the Weibo system, user verification is first performed. After the verification is passed, the system will automatically use the Sina Weibo open platform OAuth2.0 credentials associated with the user to interact with the open platform to verify the legitimacy of the user's identity on the Sina Weibo platform. If the credentials have not expired, user authentication is complete. If the certificate does not exist or has expired, the system will automatically go to the OAuth2.0 verification page of the open platform, which requires the user to enter their username and password on Sina Weibo. After the user enters the correct information, the open platform will send the updated credential back to the system, and the system will persist the credential to ensure that within the validity period of the credential, the user can log in to the open platform and obtain Weibo with only the user name and password of the system. data and related operations.

(2)获取的微博文本数据利用本地数据库将获取的数据结构化地持久化起来,以便上层分析模块随时读取。在数据更新方面,本模块支持增量式的更新方法。即每次更新只传输用户新增加的相关微博信息,从而提高系统的响应速度,最大限度地节省网络带宽。(2) The obtained microblog text data is persisted in a structured way by using the local database, so that the upper analysis module can read it at any time. In terms of data update, this module supports incremental update method. That is, each update only transmits the newly added relevant microblog information of the user, thereby improving the response speed of the system and saving network bandwidth to the greatest extent.

(3)接下来对持久化的微博文本进行预处理工作,包括停用词过滤、分词和词性标注三部分。微博停用词主要包含以下格式:“#话题词#”格式的话题标签、“@用户名”格式的对某用户的定向通知、“[表情文字]”格式的表情符号以及微博中包含的URL链接等。针对微博文本特性,采用模式匹配的方法,对这些停用词首先进行了过滤。然后针对微博场景进行优化了的中文分词以及词性标注。分词是中文等少数语言所特有的文本处理程序,因为中文不像其他大多数语言那样拥有明显的分隔符。在自然语言处理中,分词是将本文转化为计算机能够理解的格式的必然工序。尤其本发明采用了空间向量模型,分词更是不可缺少的步骤。而且,分词结果的好坏将直接影响算法结果的质量。词性标注是指为给定句子中的每个词赋予正确的词法标记,对于后续的自然语言处理工作是一个非常有用的预处理过程。本模块使用分词器产品ICTCLAS5.0进行分词和词性标注,因为导入的新词语料也包括了词性标注信息,因此并不会影响词性标注的准确性。在对新浪微博平台上的领域相关微博的调研发现,名词能够较准确地刻画用户领域兴趣,并且其他词性的词语作为关键词往往会引入歧义而造成推荐准确率的下降。因此在关键词提取与关键词扩展之前均对用户微博分词后的结果进行了词性过滤,只保留名词。预处理结果即为用户的历史微博数据。(3) Next, preprocess the persistent microblog text, including stop word filtering, word segmentation and part-of-speech tagging. Weibo stop words mainly include the following formats: hashtags in the format of "#话话词#", directional notifications to a user in the format of "@username", emoticons in the format of "[emoji text]" and microblogs containing URL links, etc. According to the characteristics of microblog text, the method of pattern matching is used to filter these stop words first. Then optimize the Chinese word segmentation and part-of-speech tagging for the Weibo scene. Word segmentation is a text processing procedure specific to a few languages like Chinese, which does not have distinct delimiters like most other languages. In natural language processing, word segmentation is an inevitable process of converting text into a format that computers can understand. In particular, the present invention adopts a space vector model, and word segmentation is an indispensable step. Moreover, the quality of word segmentation results will directly affect the quality of algorithm results. Part-of-speech tagging refers to assigning correct lexical tags to each word in a given sentence, which is a very useful preprocessing process for subsequent natural language processing. This module uses the word segmentation product ICTCLAS5.0 for word segmentation and part-of-speech tagging, because the imported new word data also includes part-of-speech tagging information, so it will not affect the accuracy of part-of-speech tagging. A survey of domain-related microblogs on the Sina Weibo platform found that nouns can more accurately describe users’ domain interests, and words with other parts of speech as keywords often introduce ambiguity, resulting in a decline in recommendation accuracy. Therefore, before keyword extraction and keyword expansion, part-of-speech filtering is performed on the results of user microblog word segmentation, and only nouns are retained. The preprocessing result is the user's historical microblog data.

如图2所示,领域关键词提取模块实现过程如下:As shown in Figure 2, the implementation process of the domain keyword extraction module is as follows:

(1)本发明在关键词提取方面经过实验比较选用目前较为先进的TextRank算法,原因如下:TextRank算法克服了TFIDF方法的缺陷,其不需要计算TF信息,因此不必将微博进行合并,而且其不依赖外部IDF信息。TextRank在计算时考虑了与高领域相关度关键词共现的词语具有较高的领域相关度这一假设,使得关键词权重的计算不再是线性计算,从而在一定程度上克服了TFMF算法的幂律分布问题。故其相对其他算法更适合于本应用场景。(1) The present invention compares and selects the comparatively advanced TextRank algorithm at present through experiments in terms of keyword extraction, and the reasons are as follows: the TextRank algorithm overcomes the defect of the TFIDF method, and it does not need to calculate TF information, so it is not necessary to merge microblogs, and its Does not rely on external IDF information. In the calculation of TextRank, the assumption that words co-occurring with keywords with high domain relevance have high domain relevance is considered, so that the calculation of keyword weight is no longer linear calculation, thus overcoming the limitation of TFMF algorithm to a certain extent. Power law distribution problem. Therefore, it is more suitable for this application scenario than other algorithms.

(2)本模块吸取了TextRank关键词提取算法思想,并结合微博应用场景的特性提出了TextRank for Weibo算法。TextRank for Weibo算法是基于图的关键词提取算法。其灵感来源于PageRank算法思想。在图的构造方面,传统的TextRank算法基于词语在文档中固定长度的滑动窗口内的共现次数定义词语之间的连接边的权重。考虑到微博短文本的特性,本发明使用词语在一条微博内的共现次数作为词语之间边的权重。后在此无向图上使用PageRank算法计算每个词语作为关键词的权重。本发明定义图中的每个节点代表一个分词,如果两个词语在用户的某条微博中共现,则其所对应节点之间的边的权值加1,节点间的最终权值为其对应两个词语在微博中的共现次数。(2) This module absorbs the idea of TextRank keyword extraction algorithm, and puts forward the TextRank for Weibo algorithm in combination with the characteristics of Weibo application scenarios. The TextRank for Weibo algorithm is a graph-based keyword extraction algorithm. Its inspiration comes from the idea of PageRank algorithm. In terms of graph construction, the traditional TextRank algorithm defines the weight of the connection edges between words based on the number of co-occurrences of words in a fixed-length sliding window in the document. Considering the characteristics of short microblog texts, the present invention uses the co-occurrence times of words in a microblog as the weight of the edges between words. Then use the PageRank algorithm on this undirected graph to calculate the weight of each word as a keyword. Each node in the definition diagram of the present invention represents a participle. If two words co-occur in a certain microblog of the user, the weight of the edge between the corresponding nodes is added by 1, and the final weight between the nodes is its Corresponding to the co-occurrence times of two words in Weibo.

该无向图确定后,使用类似于PageRank的算法思想迭代产生各个节点的权重。节点Vi的权值按照如下公式进行更新:After the undirected graph is determined, an algorithm similar to PageRank is used to iteratively generate the weight of each node. The weight of node V i is updated according to the following formula:

其中:Vi为第i个节点,TR(Vi)为节点Vi的权重,wij是节点Vi和Vj之间边的权重;E(Vi)为Vi所连接的边的集合;d为迭代的阻尼系数,设置为0.85,可以以任意初始值开始迭代,迭代直至收敛为止,收敛的条件为本次迭代与上次迭代之间各节点权值之和的绝对差小于指定数值。Among them: V i is the i-th node, TR(V i ) is the weight of node V i , w ij is the weight of the edge between nodes V i and V j ; E(V i ) is the weight of the edge connected by V i Set; d is the damping coefficient of the iteration, which is set to 0.85. You can start the iteration with any initial value and iterate until it converges. The convergence condition is that the absolute difference between the sum of the weights of each node between this iteration and the last iteration is less than the specified value.

该模块的UML类图主要部分节选见图5,主要分为基于共现关系的无向图的构造和基于图的节点权重计算两个阶段。The main part of the UML class diagram of this module is shown in Figure 5. It is mainly divided into two stages: the construction of the undirected graph based on the co-occurrence relationship and the node weight calculation based on the graph.

基于共现关系的无向图的构造阶段首先将用户历史微博中出现的分词转化为对应的节点。然后对每个微博分词的结果中,所有出现的二元分词对进行边的构造,边的权值即为对应词对在历史微博中的共现次数。The construction stage of the undirected graph based on the co-occurrence relationship first converts the word segmentations that appear in the user's historical microblogs into corresponding nodes. Then, in the word segmentation results of each microblog, all the binary word pairs that appear are constructed as edges, and the weight of the edges is the number of co-occurrences of the corresponding word pairs in the historical microblogs.

如图2所示,用户自定义关键词扩展模块实现过程如下:As shown in Figure 2, the implementation process of the user-defined keyword expansion module is as follows:

(1)本发明基于关键词提取动态性不足的缺陷,引入了用户自定义关键词以增强兴趣建模的动态性。同时为了解决自定义关键词表达能力不足的问题,本发明提出了关键词提取与用户自定义关键词扩展相结合的方法来对用户兴趣进行建模。(1) Based on the defect of insufficient dynamicity of keyword extraction, the present invention introduces user-defined keywords to enhance the dynamicity of interest modeling. At the same time, in order to solve the problem of insufficient expression ability of self-defined keywords, the present invention proposes a method combining keyword extraction and user-defined keyword expansion to model user interests.

(2)考虑到本发明的应用场景需求是:在领域相关的语料背景下,将用户自定义的若干领域相关关键词基于领域相关度进行扩展。因此本发明采用了词语在领域相关语料中的共现信息计算词语间的相似度。这种方法基于与用户自定义关键词在同一微博中共现的词与该关键词具有较强的领域相似度的假设。(2) Considering that the application scenario requirement of the present invention is: under the background of the domain-related corpus, several domain-related keywords defined by the user are expanded based on the domain correlation degree. Therefore, the present invention uses the co-occurrence information of words in domain-related corpus to calculate the similarity between words. This method is based on the assumption that words co-occurring in the same Weibo with user-defined keywords have a strong domain similarity with the keyword.

(3)本发明吸纳了针对话题标签的扩展方法P-IOLog,但是考虑到该方法生成的扩展词权重的置信度与样本空间的大小即自定义关键词在待分析语料中的频度成正比,即在包含用户自定义关键词的微博个数较少的情况下,P-IOLog算法生成的扩展结果误差较大。故本发明针对本应用场景对P-IOLog进行了改进,引入了降权因子,该因子随着用户自定义关键词的出现频次的增大而增大;同时去除了原方法对于主题层的考虑,提出了Probabilistic Inside-Outside Log for Weibo(简称P-IOW)方法。(3) The present invention absorbs the extended method P-IOLog for topic tags, but considering that the confidence degree of the extended word weight generated by this method is directly proportional to the size of the sample space, that is, the frequency of self-defined keywords in the corpus to be analyzed , that is, when the number of microblogs containing user-defined keywords is small, the error of the extended result generated by the P-IOLog algorithm is relatively large. Therefore, the present invention improves P-IOLog for this application scenario, and introduces a weight reduction factor, which increases as the frequency of user-defined keywords increases; at the same time, the consideration of the topic layer in the original method is removed , proposed the Probabilistic Inside-Outside Log for Weibo (referred to as P-IOW) method.

(4)具体方法如下,对于给定的用户自定义关键词s,词t关于s的领域相关度的计算方法如下:(4) The specific method is as follows. For a given user-defined keyword s, the calculation method of the domain correlation of word t with respect to s is as follows:

其中:in:

其中:s为用户自定义关键词;wf(t)为包含词t的微博数;tM为领域相关语料中被最多条微博所包含的词,为tM所在微博的条数;为不包含词s的微博数;wf(t∧s)为同时包含词t和词s的微博数;N为用户微博总数;sp是值域为(0,1)的平滑系数使得为零时不会出现除数为零的情况。由公式可以得出,在(1)式乘以降权因子之前,P-IOLogW的值域在log(sp)到log(1/sp)之间。计算结果P-IOW的值越大,说明词w与用户自定义关键词s具有越高的领域相似度。对于用户自定义关键词本身,将赋予P-IOLogW值域的上限,即log(1/sp)。Among them: s is a user-defined keyword; wf(t) is the number of microblogs containing the word t; t M is the word contained in the most microblogs in the domain-related corpus, is the number of microblogs where t M is located; is the number of microblogs that do not contain word s; wf(t∧s) is the number of microblogs that contain both word t and word s; N is the total number of user microblogs; s p is a smoothing coefficient with a range of (0, 1) make Divisor by zero does not occur when it is zero. From the formula, it can be concluded that the value range of P-IOLogW is between log(s p ) and log(1/s p ) before formula (1) is multiplied by the down-weighting factor. The larger the value of the calculation result P-IOW, the higher the domain similarity between the word w and the user-defined keyword s. For the user-defined keyword itself, the upper limit of the value range of P-IOLogW will be assigned, that is, log(1/s p ).

(5)同时,本发明支持用户输入多个自定义关键词。对于每个自定义关键词,关键词扩展模块会对其扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量。扩展过程具体算法如下:(5) At the same time, the present invention supports the user to input multiple self-defined keywords. For each custom keyword, the keyword expansion module will linearly add the expanded word vectors to obtain the final expanded vector. The specific algorithm of the expansion process is as follows:

1)对每一个用户自定义关键词集合关键词,首先,基于P-IOW计算关键词的所有扩展词权重;1) For each user-defined keyword collection keyword, at first, calculate all expansion word weights of keyword based on P-IOW;

2)将与关键词相关的扩展词权重向量映射到领域相关语料分词的空间上,形成扩展词向量;2) Map the extended word weight vector related to the keyword to the space of domain-related corpus word segmentation to form the extended word vector;

3)线性累加所有的扩展词向量,得到最终的扩展向量,即关键词扩展模块的输出结果。3) Linearly accumulate all the expanded word vectors to obtain the final expanded vector, which is the output result of the keyword expanded module.

该模块的UML类图主要部分节选见图6,基于其他分词与用户自定义关键词的共现信息对用户自定义关键词进行扩展,扩展结果以向量的形式表示,从而能够无缝地与关键词提取结果进行合并。The main part of the UML class diagram of this module is shown in Figure 6. Based on the co-occurrence information of other word segmentation and user-defined keywords, the user-defined keywords are extended, and the extended results are expressed in the form of vectors, so that they can be seamlessly integrated with key words. Word extraction results are combined.

如图2所示,线性合并模块实现过程如下:As shown in Figure 2, the implementation process of the linear merge module is as follows:

(1)根据用户自定义权重进行线性合并是指在关键词自动提取和基于用户自定义关键词的扩展均完成后,将两个结果向量进行合并的过程。(1) Linear merging according to user-defined weights refers to the process of merging two result vectors after automatic keyword extraction and expansion based on user-defined keywords are completed.

传统的通过关键词提取建模用户兴趣的方法与通过用户自定义关键词表示用户兴趣的方法都存在各自的缺陷。关键词提取方面主要表现在:用户无法人工调整算法结果,不能满足用户临时对于某个领域主题的动态需求。而且算法在用户历史微博较少的情况下不能进行领域兴趣建模,即所谓的冷启动现象。用户自定义关键词的缺陷主要表现在:其一,用户自定义的关键词很难保证完全涵盖本领域的所有信息;其二,在微博短文本中,仅仅利用关键词是否出现来判断相似性将使得很多同义但用词不同的信息无法被提取出来。Both the traditional method of modeling user interest through keyword extraction and the method of expressing user interest through user-defined keywords have their own defects. Keyword extraction is mainly manifested in that users cannot manually adjust the algorithm results, and cannot meet users' temporary dynamic needs for topics in a certain field. Moreover, the algorithm cannot perform domain interest modeling when the user's history of Weibo is small, which is the so-called cold start phenomenon. The defects of user-defined keywords are mainly manifested in: first, it is difficult to ensure that user-defined keywords completely cover all the information in this field; Sex will make it impossible to extract a lot of information that is synonymous but uses different words.

考虑到以上两种方法的缺陷,本发明提出领域关键词自动提取与用户自定义关键词相合并的方法,并对用户自定义关键词基于词语的共现信息进行扩展。本方法既兼顾了对领域信息的全面识别,又能够根据用户兴趣的变化随时动态调整关键词;同时,用户自定义关键词能够在一定程度上解决冷启动现象。Considering the defects of the above two methods, the present invention proposes a method of combining automatic field keyword extraction and user-defined keywords, and expands the word-based co-occurrence information of user-defined keywords. This method not only takes into account the comprehensive identification of domain information, but also dynamically adjusts keywords at any time according to changes in user interests; at the same time, user-defined keywords can solve the cold start phenomenon to a certain extent.

(2)因为关键词提取与关键词扩展的结果向量并不在一个取值范围之内,需要将这两个向量映射到同一个取值范围之中。本模块采用了最大值归一化方法对两个结果向量进行归一化,使其映射到一个统一的取值范围之中。归一化后,对两个归一化后的向量进行线性合并,合并过程支持用户自定义关键词提取和关键词扩展的权重。模块输出一个代表用户最终领域兴趣的词向量。(2) Because the result vectors of keyword extraction and keyword expansion are not in the same value range, the two vectors need to be mapped to the same value range. This module uses the maximum value normalization method to normalize the two result vectors so that they are mapped to a unified value range. After normalization, linearly merge the two normalized vectors. The merging process supports user-defined weights for keyword extraction and keyword expansion. The module outputs a word vector representing the user's final domain interests.

对于向量中的每一分量,具体的归一化方法如下:For each component in the vector, the specific normalization method is as follows:

vnormal=v/vmax v normal = v/v max

其中:v为向量某一分量的初始值;vmax为向量所有分量中的最大值。经过最大值归一化后,向量所有分量均在(0,1]之间,而且所有分量均为非零。之后进行向量线性加权合并:Among them: v is the initial value of a certain component of the vector; v max is the maximum value of all components of the vector. After normalization by the maximum value, all components of the vector are between (0, 1], and all components are non-zero. Then the vectors are linearly weighted and merged:

Vcombine=r×Vkw-extract+(1-r)×Vkw-expand V combine =r×V kw-extract +(1-r)×V kw-expand

其中:r为用户自定义的合并权重比例,Vkw-exract为关键词提取的结果向量,Vkw-expand为关键词扩展生成的结果向量,合并结果为刻画用户领域兴趣的关键词向量VcombineAmong them: r is the user-defined combination weight ratio, V kw-exract is the result vector of keyword extraction, V kw-expand is the result vector generated by keyword expansion, and the combined result is the keyword vector V combine that describes the interest of the user field .

如图2所示,相似度计算与个性化推荐模块实现过程如下:As shown in Figure 2, the implementation process of the similarity calculation and personalized recommendation module is as follows:

(1)在完成了刻画用户领域兴趣的关键词向量Vcombine的生成之后,相关度比较模块将对每条待过滤微博进行分词以及词频统计以生成词频向量,然后将用户兴趣关键词向量、待推荐微博生成的词频向量以及IDF信息向量进行点乘运算,以得到该微博与用户兴趣的相关度。相关度计算具体公式如下:(1) After completing the generation of the keyword vector V combine that characterizes the user's interest in the field, the correlation comparison module will perform word segmentation and word frequency statistics on each microblog to be filtered to generate a word frequency vector, and then combine the user interest keyword vector, The term frequency vector generated by the microblog to be recommended and the IDF information vector are subjected to dot multiplication to obtain the correlation between the microblog and the user's interest. The specific formula for calculating the correlation is as follows:

其中:用户领域兴趣的关键词向量为Vcombine;待推荐微博T对应的词频向量为W;分词的IDF向量为VIDF;L为用户历史微博分词后的向量空间维度总数;分别为Vcombine和W上ti所对应的分量;IDF(ti)为ti的IDF值。Wherein: the keyword vector of interest in the user field is V combine ; the word frequency vector corresponding to microblog T to be recommended is W; the IDF vector of word segmentation is V IDF ; L is the total number of vector space dimensions after word segmentation of user history microblogs; with are the components corresponding to t i above V combine and W respectively; IDF(t i ) is the IDF value of t i .

(2)最终,模型计算出每个用户微博的领域相关度。系统可根据已得的领域相关度进行微博推荐。推荐的形式不限,可以按照领域相关度由高到低进行排序,或者过滤出领域相关度大于指定阀值的微博信息等。本发明采用了第一种呈现方法。(2) Finally, the model calculates the domain relevance of each user's microblog. The system can recommend microblogs according to the obtained domain correlation. The form of recommendation is not limited, and it can be sorted according to the domain correlation degree from high to low, or filter out the microblog information whose domain correlation degree is greater than the specified threshold, etc. The present invention adopts the first presentation method.

如图2所示,主题获取模块实现过程如下:As shown in Figure 2, the implementation process of the topic acquisition module is as follows:

(1)对推荐的领域微博进行文本预处理工作。预处理工作同模块1,包括停用词过滤、分词和词性标注三部分。(1) Perform text preprocessing on the microblogs in the recommended domain. The preprocessing work is the same as module 1, including stop word filtering, word segmentation and part-of-speech tagging.

(2)用处理好的数据作为输入训练LDA模型,根据主题的词项分布将词项聚类成主题;(2) Use the processed data as input to train the LDA model, and cluster the terms into topics according to the term distribution of the topics;

(3)将主题词项集合与线性合并模块中得到的用户领域兴趣关键词项进行相关度计算,计算依据是利用词项之间的共现信息来衡量相关性,给定用户领域兴趣关键词后,所有含有该关键词的主题词项集合被看作是相关的,不含任何领域兴趣关键词的主题被看作是相关度小或不相关的,具体算法使用模块5的相关度计算方法。(3) Calculate the correlation between the set of subject terms and the interest keywords in the user domain obtained in the linear combination module. The calculation basis is to use the co-occurrence information between the terms to measure the correlation. Finally, all subject term sets containing this keyword are considered relevant, and topics that do not contain any field interest keywords are regarded as having little relevance or irrelevance. The specific algorithm uses the correlation calculation method of module 5 .

(4)相关度代表着主题的重要程度,按重要性对主题进行排序,并呈现给用户。(4) The degree of relevance represents the importance of the topic, and the topics are sorted according to the importance and presented to the user.

本发明基于新浪微博开放平台设计了原型系统以进行结果验证,本发明的系统时序图如图3所示,它表示了本发明的总体使用流程:The present invention has designed a prototype system based on the Sina Weibo open platform to verify the results. The system sequence diagram of the present invention is shown in Figure 3, which shows the overall use process of the present invention:

1)首先用户登录系统进行用户名密码验证后,系统会调用新浪微博的OAuth验证模块进行微博账户的授权验证;如果用户没有绑定微博账户,则会利用OAuth进行微博账户的绑定。1) First, after the user logs in to the system to verify the username and password, the system will call the OAuth verification module of Sina Weibo to verify the authorization of the Weibo account; if the user has not bound the Weibo account, it will use OAuth to bind the Weibo account Certainly.

2)登录与授权验证完成后,系统会对用户数据进行更新,将用户相关的微博信息进行本地的持久化,以保证后续分析工作的时效性。2) After the login and authorization verification are completed, the system will update the user data, and locally persist the user-related Weibo information to ensure the timeliness of subsequent analysis work.

3)数据完成本地持久化以后,会由分词与词性过滤器对数据库中的用户相关领域微博进行分词与词性过滤,以便于后续的关键词提取与关键词扩展工作。这里的用户相关领域微博主要指所绑定的企业微博所发的微博信息,也可以是用户所指定的若干领域相关用户所发的微博信息,甚至可以是外部的领域相关语料。本发明并不限制分词器和词性标注器,理论上中文的分词器和词性标注器均可。在本发明所对应的系统中使用了中科院计算所研发的ICTCLAS5.0分词与词性标注系统,并且针对该分词器词库较老的问题,引入了结巴分词器的外部词库数据。预处理结果即为用户的历史微博数据。3) After the data is persisted locally, word segmentation and part-of-speech filters will be used to perform word segmentation and part-of-speech filtering on user-related microblogs in the database, so as to facilitate subsequent keyword extraction and keyword expansion. The user-related field microblog here mainly refers to the microblog information sent by the bound enterprise microblog, or it can be the microblog information sent by several field-related users specified by the user, or even external field-related corpus. The present invention does not limit the word breaker and part-of-speech tagger, and theoretically Chinese word breakers and part-of-speech taggers can be used. The ICTCLAS5.0 word segmentation and part-of-speech tagging system developed by the Computing Institute of the Chinese Academy of Sciences is used in the system corresponding to the present invention, and the external thesaurus data of the stuttering word segmenter is introduced to solve the problem that the thesaurus of the word segmenter is relatively old. The preprocessing result is the user's historical microblog data.

4)分词与词性过滤后的领域相关微博将以分词词频的形式交给关键词提取模块进行关键词提取,系统使用TextRank for Weibo的方法进行关键词提取,具体提取结果代表了本领域的兴趣特征。4) The field-related Weibo after word segmentation and part-of-speech filtering will be handed over to the keyword extraction module in the form of word segmentation and word frequency for keyword extraction. The system uses the method of TextRank for Weibo to extract keywords. The specific extraction results represent the interests of the field feature.

5)如果用户自定义了兴趣关键词,为了进一步增强用户自定义关键词的召回能力,本系统会根据用户自定义的关键词使用P-IOW关键词扩展方法进行基于领域信息的关键词扩展。扩展结果以分词向量的形式表述。如果用户没有自定义兴趣关键词,该步将输出空结果。5) If the user defines the keyword of interest, in order to further enhance the recall ability of the user-defined keyword, the system will use the P-IOW keyword expansion method to expand the keyword based on domain information according to the user-defined keyword. The expansion results are expressed in the form of word segmentation vectors. If the user does not define interest keywords, this step will output an empty result.

6)基于TextRank for Weibo的关键词提取与基于P-IOW的关键词扩展完成后,将对其结果进行基于用户自定义权重的线性加权合并。用户可以自定义关键词提取与关键词扩展各自结果所占的比重。输出形式为合并后的分词向量,其用来表示用户的最终领域兴趣。6) After the keyword extraction based on TextRank for Weibo and the keyword expansion based on P-IOW are completed, the results will be linearly weighted and merged based on user-defined weights. Users can customize the proportion of keyword extraction and keyword expansion results. The output form is the merged word segmentation vector, which is used to represent the user's final domain interest.

7)该步进行用户兴趣与待推荐微博的相似度计算,首先需要将待推荐微博进行分词与词性过滤(同步骤3),以将其转化为分词词频向量的形式。然后计算用户领域兴趣向量与待推荐微博转化后的向量以及分词IDF值的点乘积,详见上文论述。乘积即为待推荐微博与用户领域兴趣的相似度。7) In this step, the similarity calculation between the user interest and the microblog to be recommended is performed. First, the microblog to be recommended needs to be subjected to word segmentation and part-of-speech filtering (same as step 3), so as to convert it into the form of a word-segmented word frequency vector. Then calculate the dot product of the user domain interest vector, the converted vector of the microblog to be recommended, and the IDF value of the word segmentation, see the above discussion for details. The product is the similarity between the Weibo to be recommended and the user's domain interests.

8)使用推荐的领域微博文本作为输入,使用LDA模型实现词项的聚类,形成主题词项集合,最后计算出主题词项的重要性,根据重要程度呈现给用户。8) Use the microblog text in the recommended domain as input, use the LDA model to implement clustering of terms, form a collection of subject terms, and finally calculate the importance of subject terms, and present them to users according to their importance.

本发明所对应的实现系统的架构图如图4所示,系统基于MySQL+JSP+Servlet+Twitter Bootstrap的技术栈结构,分为数据获取与预处理模块、领域关键词提取模块、用户自定义关键词扩展模块、线性合并模块、相似度计算与个性化推荐模块以及主题获取模块六大组成部分。本发明的目标用户是新浪微博平台上的企业微博用户,通过分析其指定若干领域相关微博账户的历史微博文本,系统能够自动识别用户所在的领域特征。同时,系统还支持用户动态地自定义关键词来表述自己的在领域中的动态需求,系统会根据关键词在历史微博中的共现信息进行关键词扩展,增强关键词的语义表述能力。然后,用户可以根据自己定义的关键词权重建模出完全个性化的领域兴趣。最后,根据该领域兴趣实现对用户订阅微博的过滤推荐功能以及主题推荐功能。避免用户逐一查看大量领域不相关微博,提高了企业营销人员的工作效率。The architecture diagram of the implementation system corresponding to the present invention is shown in Figure 4. The system is based on the technology stack structure of MySQL+JSP+Servlet+Twitter Bootstrap, and is divided into a data acquisition and preprocessing module, a field keyword extraction module, and a user-defined key Word expansion module, linear merge module, similarity calculation and personalized recommendation module, and topic acquisition module are composed of six components. The target users of the present invention are enterprise microblog users on the Sina microblog platform. By analyzing the historical microblog texts of their designated microblog accounts in several fields, the system can automatically identify the characteristics of the field where the user is located. At the same time, the system also supports users to dynamically define keywords to express their own dynamic needs in the field. The system will expand keywords according to the co-occurrence information of keywords in historical microblogs to enhance the semantic expression ability of keywords. Then, users can model fully personalized domain interests according to their own defined keyword weights. Finally, according to the interests of the field, the filter recommendation function and topic recommendation function for users to subscribe to Weibo are realized. It prevents users from viewing a large number of irrelevant microblogs one by one, which improves the work efficiency of enterprise marketing personnel.

Claims (8)

1.一种基于微博平台的领域信息推荐系统,其特征在于包括:数据获取与预处理模块、领域关键词提取模块、用户自定义关键词扩展模块、线性合并模块、相似度计算与个性化推荐模块以及主题获取模块;其中:1. A domain information recommendation system based on a microblog platform, characterized in that it includes: data acquisition and preprocessing module, domain keyword extraction module, user-defined keyword expansion module, linear merging module, similarity calculation and personalization Recommendation module and topic acquisition module; among them: 数据获取与预处理模块:获取用户相关微博信息数据,并进行预处理;预处理包括数据的停用词过滤、分词和词性标注;预处理结果即为用户的历史微博数据,传给领域关键词提取模块;如果用户自定义了领域兴趣关键词,则预处理结果同时传给用户自定义关键词扩展模块;Data acquisition and preprocessing module: Obtain user-related microblog information data and perform preprocessing; preprocessing includes data stop word filtering, word segmentation and part-of-speech tagging; the preprocessing result is the user's historical microblog data, which is passed to the domain Keyword extraction module; if the user defines domain interest keywords, the preprocessing result is simultaneously passed to the user-defined keyword expansion module; 领域关键词提取模块:基于预处理结果,关键词提取采用基于TextRank算法修改的TextRank for Weibo算法无指导地进行,该算法包括基于共现关系的无向图的构造和基于图的节点权重计算两个阶段;基于共现关系的无向图的构造阶段,首先将用户历史微博中出现的分词转化为对应的节点;在节点间连接边的构造时,使用节点之间是否有边以及边的权重由两个词语在同一篇微博中的共现次数判定共现的构图,边的权重即为词语在同一微博中的共现次数,如果两个词语在用户的某条微博中共现,则两个词语所对应节点之间的边之权值加1,边的最终权值为其对应两个词语在微博中的共现次数;然后再基于图的节点权重计算阶段,迭代计算每个阶段的权重,直到节点权重的变化量收敛到某个阀值为止;迭代结束后,每个节点的权重即为其所代表的分词的重要程度,将用户的所有分词按照重要度进行排序即获得关键词提取的结果,从而自动识别用户所在的领域特征;Field keyword extraction module: Based on the preprocessing results, the keyword extraction is carried out unguided by the TextRank for Weibo algorithm modified based on the TextRank algorithm. first stage; in the construction stage of the undirected graph based on the co-occurrence relationship, firstly, the word segmentation appearing in the user’s historical Weibo is converted into the corresponding node; when constructing the connecting edge between the nodes, whether there is an edge between the nodes and the edge The weight determines the co-occurrence composition by the number of co-occurrences of two words in the same Weibo. The weight of the edge is the number of co-occurrences of words in the same Weibo. If two words co-occur in a user’s Weibo , then the weight of the edge between the nodes corresponding to the two words is increased by 1, and the final weight of the edge is the number of co-occurrences of the corresponding two words in Weibo; then based on the node weight calculation stage of the graph, the iterative calculation The weight of each stage, until the variation of the node weight converges to a certain threshold; after the iteration, the weight of each node is the importance of the participle it represents, and all the participle of the user are sorted according to the importance That is, the result of keyword extraction is obtained, so as to automatically identify the characteristics of the field where the user is located; 用户自定义关键词扩展模块:基于关键词的共现、分布以及其所属用户的属性信息来计算关键词之间的相似度,将高相关度的词语作为目标关键词的扩展结果;本模块支持用户输入多个自定义关键词,对于每个自定义关键词,会对关键词扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量;用户自定义关键词扩展功能保证了用户的动态兴趣需求能够实时得到满足,同时极大增强了用户自定义关键词的表述能力;User-defined keyword expansion module: Calculate the similarity between keywords based on the co-occurrence and distribution of keywords and the attribute information of their users, and use highly relevant words as the expansion results of target keywords; this module supports The user enters multiple custom keywords, and for each custom keyword, the extended word vectors obtained from the keywords will be linearly summed to obtain the final extended vector; the user-defined keyword expansion function ensures the user's Dynamic interest needs can be met in real time, and at the same time, the expression ability of user-defined keywords is greatly enhanced; 线性合并模块:在领域关键词自动提取和基于用户自定义关键词的扩展均完成后,采用最大值归一化方法对两个结果向量进行归一化,使关键词提取与关键词扩展的结果向量映射到一个统一的取值范围之中;归一化后,对两个归一化后的向量进行线性合并,合并过程支持用户自定义关键词提取和关键词扩展的权重;该模块输出一个代表用户最终领域兴趣的词向量;Linear merge module: After the automatic extraction of domain keywords and the expansion based on user-defined keywords are completed, the maximum value normalization method is used to normalize the two result vectors, so that the results of keyword extraction and keyword expansion The vector is mapped to a unified value range; after normalization, the two normalized vectors are linearly merged, and the merge process supports user-defined keyword extraction and keyword expansion weights; the module outputs a A word vector representing the user's final domain interests; 相关度计算与个性化推荐模块:线性合并模块刻画出用户领域兴趣的关键词向量之后,对每条待过滤微博进行分词以及词频统计以生成词频向量,然后将用户兴趣关键词向量、待推荐微博生成的词频向量以及IDF信息向量进行点乘运算,得到该微博与用户兴趣的相关度,该相关度即为该条微博的领域相关度,通过计算出每个用户微博的领域相关度,按照领域相关度由高到低进行排序,将微博信息呈现给用户,实现对用户的个性化领域微博推荐;Correlation calculation and personalized recommendation module: After the linear merging module describes the keyword vectors of user interests in the field, word segmentation and word frequency statistics are performed on each to-be-filtered microblog to generate word frequency vectors, and then the user interest keyword vectors, to-be-recommended The word frequency vector generated by the microblog and the IDF information vector are dot-multiplied to obtain the correlation between the microblog and the user's interest. The correlation is the domain correlation of the microblog. By calculating the domain of each user's microblog Relevance, sorted from high to low according to the domain relevance, presents Weibo information to users, and realizes personalized domain Weibo recommendations for users; 主题获取模块:以推荐给用户的领域微博文本为输入训练LDA模型,根据主题的词项分布将词项聚类成主题;将主题词项集合与线性合并模块中得到的用户领域兴趣关键词项进行相关度计算,获得主题重要性,并按照重要性排序呈现给用户,从而完成主题发现和推荐。Topic Acquisition Module: Train the LDA model with the domain microblog text recommended to the user as input, and cluster the terms into topics according to the distribution of terms in the topic; combine the subject term set with the user domain interest keywords obtained in the linear combination module Items are calculated to obtain the importance of topics, and presented to users in order of importance, thereby completing topic discovery and recommendation. 2.根据权利要求1所述的基于微博平台的领域信息推荐系统,其特征在于:所述数据获取与预处理模块实现过程如下:2. The field information recommendation system based on microblog platform according to claim 1, characterized in that: the implementation process of the data acquisition and preprocessing module is as follows: (1)用户登录微博系统后,首先进行用户验证,验证通过后,自动使用该用户所关联的微博平台凭证与微博平台交互,以验证用户身份在微博平台上的合法性;(1) After the user logs in to the Weibo system, the user verification is first performed. After the verification is passed, the Weibo platform credentials associated with the user are automatically used to interact with the Weibo platform to verify the legitimacy of the user's identity on the Weibo platform; (2)获取用户关注订阅的相关微博文本,利用本地数据库将获取的数据结构化地持久化起来,以便随时读取;(2) Obtain relevant microblog texts that users follow and subscribe, and use the local database to persist the acquired data in a structured manner so that they can be read at any time; (3)对持久化的微博文本进行预处理工作,包括停用词过滤、分词和词性标注三部分;针对微博文本特性,采用模式匹配的方法,对停用词首先进行了过滤,然后针对微博场景进行优化了的中文分词以及词性标注,使用分词器产品ICTCLAS5.0进行分词和词性标注,同时在关键词提取与关键词扩展之前均对用户微博分词后的结果进行词性过滤,只保留名词。预处理结果数据即为用户的历史微博数据。(3) Preprocessing the persistent microblog text, including stop word filtering, word segmentation and part-of-speech tagging; according to the characteristics of microblog text, the stop words are first filtered by pattern matching method, and then Optimized Chinese word segmentation and part-of-speech tagging for the Weibo scene, use the word segmentation product ICTCLAS5.0 for word segmentation and part-of-speech tagging, and filter the part-of-speech results of user Weibo word segmentation before keyword extraction and keyword expansion. Only keep nouns. The preprocessing result data is the user's historical microblog data. 3.根据权利要求1所述的基于微博平台的领域信息推荐系统,其特征在于:所述领域关键词提取模块中,基于历史微博数据,计算每个阶段的权重依照PageRank的算法思想迭代计算,公式如下:3. The domain information recommendation system based on microblog platform according to claim 1, characterized in that: in the domain keyword extraction module, based on historical microblog data, the weight of each stage is iterated according to the algorithm idea of PageRank Calculated as follows: TT RR (( VV ii )) == (( 11 -- dd )) ++ dd ×× ΣΣ VV jj ∈∈ EE. (( VV ii )) ww ii jj ΣΣ VV kk ∈∈ EE. (( VV jj )) ww jj kk TT RR (( VV jj )) 其中:Vi为第i个节点,TR(Vi)为节点Vi的权重,wij是节点Vi和Vj之间边的权重;E(Vi)为Vi所连接的边的集合;d为迭代的阻尼系数,设置为0.85,可以以任意初始值开始迭代,迭代直至收敛为止,收敛的条件为本次迭代与上次迭代之间各节点权值之和的绝对差小于指定数值。Among them: V i is the i-th node, TR(V i ) is the weight of node V i , w ij is the weight of the edge between nodes V i and V j ; E(V i ) is the weight of the edge connected by V i Set; d is the damping coefficient of the iteration, which is set to 0.85. You can start the iteration with any initial value and iterate until it converges. The convergence condition is that the absolute difference between the sum of the weights of each node between this iteration and the last iteration is less than the specified value. 4.根据权利要求1所述的基于微博平台的领域信息推荐系统,其特征在于:所述用户自定义关键词扩展模块中,采用改进的P-IOW算法计算关键词之间的相似度,实现过程如下:4. the domain information recommender system based on microblog platform according to claim 1, is characterized in that: in described user-defined keyword extension module, adopts improved P-IOW algorithm to calculate the similarity between keywords, The implementation process is as follows: 对于给定的用户自定义关键词s,词t关于s的领域相关度的计算方法如下:For a given user-defined keyword s, the calculation method of the domain correlation of word t with respect to s is as follows: PP -- II Oo WW (( tt ,, sthe s )) == ll oo gg (( ww ff (( sthe s )) ++ 11 )) ll oo gg (( WFWF tt Mm ++ 11 )) ×× PP -- II Oo LL oo gg WW (( tt ,, sthe s )) tt ≠≠ sthe s PP -- II Oo LL oo gg WW (( tt ,, sthe s )) tt == sthe s -- -- -- (( 11 )) 其中:in: 其中:s为用户自定义关键词;wf(t)为包含词t的微博数;tM为领域相关语料中被最多条微博所包含的词,为tM所在微博的条数;为不包含词s的微博数;wf(t∧s)为同时包含词t和词s的微博数;N为用户微博总数;sp是值域为(0,1)的平滑系数,使得为零时不会出现除数为零的情况;由公式可以得出,在(1)式乘以降权因子之前,P-IOLogW的值域在log(sp)到log(1/sp)之间,计算结果P-IOW的值越大,说明词w与用户自定义关键词s具有越高的领域相似度,对于用户自定义关键词本身,将赋予P-IOLogW值域的上限,即log(1/sp)。Among them: s is a user-defined keyword; wf(t) is the number of microblogs containing the word t; t M is the word contained in the most microblogs in the domain-related corpus, is the number of microblogs where t M is located; is the number of microblogs that do not contain word s; wf(t∧s) is the number of microblogs that contain both word t and word s; N is the total number of user microblogs; s p is a smoothing coefficient with a range of (0, 1) , making When it is zero, the divisor will not be zero; it can be concluded from the formula that before formula (1) is multiplied by the weight reduction factor, the value range of P-IOLogW is between log(s p ) and log(1/s p ) , the larger the value of the calculation result P-IOW, the higher the field similarity between the word w and the user-defined keyword s. For the user-defined keyword itself, the upper limit of the value range of P-IOLogW will be given, that is, log (1/s p ). 5.根据权利要求1所述的基于微博平台的领域信息推荐系统,其特征在于:所述用户自定义关键词扩展模块中,对关键词扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量时,扩展过程具体算法如下:5. the field information recommendation system based on microblog platform according to claim 1, is characterized in that: in the user-defined keyword expansion module, the extended word vector that keyword is expanded is carried out linear addition, thereby obtains When the final expansion vector is obtained, the specific algorithm of the expansion process is as follows: (1)对每一个用户自定义关键词集合关键词,首先,基于P-IOW计算关键词的所有扩展词权重;(1) for each user-defined keyword collection keyword, at first, calculate all expansion word weights of keyword based on P-IOW; (2)将与关键词相关的扩展词权重向量映射到领域相关语料分词的空间上,形成扩展词向量;(2) Map the extended word weight vector related to the keyword to the space of domain-related corpus word segmentation to form an extended word vector; (3)线性累加所有的扩展词向量,得到最终的扩展向量,即关键词扩展模块的输出结果。(3) Linearly accumulate all the expanded word vectors to obtain the final expanded vector, which is the output result of the keyword expanded module. 6.根据权利要求1所述的基于微博平台的领域信息推荐系统,其特征在于:所述线性合并模块的实现过程如下:6. The field information recommendation system based on microblog platform according to claim 1, characterized in that: the implementation process of the linear merging module is as follows: (1)采用最大值归一化法对关键词提取与关键词扩展生成的向量分别进行归一化处理;(1) Use the maximum value normalization method to normalize the vectors generated by keyword extraction and keyword expansion; (2)对于向量中的每一分量,具体的归一化方法如下:(2) For each component in the vector, the specific normalization method is as follows: vnormal=v/vmax v normal = v/v max 其中:v为向量某一分量的初始值;vmax为向量所有分量中的最大值,经过最大值归一化后,向量所有分量均在(0,1]之间,而且所有分量均为非零,之后进行向量线性加权合并:Among them: v is the initial value of a certain component of the vector; v max is the maximum value of all components of the vector. After normalization by the maximum value, all components of the vector are between (0, 1], and all components are non- Zero, followed by vector linear weighted merging: Vcombine=r×Vkw-extract+(1-r)×Vkw-expand V combine =r×V kw-extract +(1-r)×V kw-expand 其中:r为用户自定义的合并权重比例,Vkw-extract为关键词提取的结果向量,Vkw-expand为关键词扩展生成的结果向量,合并结果为刻画用户领域兴趣的关键词向量VcombineAmong them: r is the user-defined combined weight ratio, V kw-extract is the result vector of keyword extraction, V kw-expand is the result vector generated by keyword expansion, and the combined result is the keyword vector V combine that describes the interest of the user field . 7.根据权利要求1所述的基于微博平台的领域信息推荐系统,其特征在于:所述的相似度计算与个性化推荐模块中,领域相关度计算具体公式如下:7. The field information recommendation system based on microblog platform according to claim 1, characterized in that: in the described similarity calculation and personalized recommendation module, the specific formula for field correlation calculation is as follows: RelevanceRelevance TT == VV cc oo mm bb ii nno ee ·&Center Dot; WW ·&Center Dot; VV II DD. Ff == ΣΣ tt ii ∈∈ LL vv tt ii ww tt ii II DD. Ff (( tt ii )) 其中:用户领域兴趣的关键词向量为Vcombine;待推荐微博T对应的词频向量为W;分词的IDF向量为VIDF;L为用户历史微博分词后的向量空间维度总数;分别为Vcombine和W上ti所对应的分量;IDF(ti)为ti的IDF值。Wherein: the keyword vector of interest in the user field is V combine ; the word frequency vector corresponding to microblog T to be recommended is W; the IDF vector of word segmentation is V IDF ; L is the total number of vector space dimensions after word segmentation of user history microblogs; with are the components corresponding to t i above V combine and W respectively; IDF(t i ) is the IDF value of t i . 8.一种基于微博平台的领域信息推荐方法,其特征在于:分为数据获取与预处理、领域关键词提取、用户自定义关键词扩展、线性合并、相似度计算与个性化推荐以及主题获取六个步骤,实现如下:8. A method for recommending domain information based on a microblog platform, characterized in that it is divided into data acquisition and preprocessing, domain keyword extraction, user-defined keyword expansion, linear merging, similarity calculation and personalized recommendation, and theme Get six steps, implemented as follows: (1)获取用户相关微博信息进行数据预处理;预处理工作包括采用模式匹配的方法对停用词进行过滤;使用分词系统ICTCLAS5.0进行分词和词性标注;预处理结果即为用户的历史微博数据;(1) Obtain user-related microblog information for data preprocessing; preprocessing includes filtering stop words using pattern matching; using word segmentation system ICTCLAS5.0 for word segmentation and part-of-speech tagging; the preprocessing result is the user's history Weibo data; (2)预处理结果进行领域关键词提取,关键词提取将使用本发明基于TextRank算法修改的TextRank for Weibo算法无指导地进行,TextRank for Weibo算法分为基于共现关系的无向图的构造和基于图的节点权重计算两个阶段;基于共现关系的无向图的构造阶段首先将用户历史微博中出现的分词转化为对应的节点;然后对每个微博分词的结果中,所有出现的二元分词对进行边的构造,边的权值即为对应词对在历史微博中的共现次数;(2) Preprocessing results carry out field keyword extraction, keyword extraction will use the TextRank for Weibo algorithm modified based on TextRank algorithm of the present invention to carry out without guidance, and TextRank for Weibo algorithm is divided into the structure of the undirected graph based on co-occurrence relationship and The graph-based node weight calculation has two stages; the construction stage of the undirected graph based on the co-occurrence relationship first converts the word segmentations that appear in the user's historical microblogs into corresponding nodes; then for each microblog word segmentation result, all occurrences The edge is constructed for the binary word pair, and the weight of the edge is the number of co-occurrences of the corresponding word pair in historical microblogs; 基于图的节点权重计算阶段依照PageRank的算法思想迭代计算每个阶段的权重,直到节点权重的变化量收敛到某个阀值为止;The graph-based node weight calculation stage iteratively calculates the weight of each stage according to the algorithm idea of PageRank until the change of node weight converges to a certain threshold; (3)如果用户自定义了领域兴趣关键词,预处理结果将另传给关键词扩展模块;关键词扩展将使用基于P-IOLog算法改进的P-IOW算法进行,使用基于关键词的共现、分布以及其所属用户的属性等信息来计算关键词之间的相似度,将高相关度的词语作为目标关键词的扩展结果;同时支持用户输入多个自定义关键词;对于每个自定义关键词,关键词扩展模块会对其扩展出的扩展词向量进行线性加和,从而得到最终的扩展向量;(3) If the user defines the domain interest keywords, the preprocessing results will be passed to the keyword expansion module; the keyword expansion will be performed using the improved P-IOW algorithm based on the P-IOLog algorithm, using co-occurrence based on keywords , distribution, and the attributes of the users they belong to to calculate the similarity between keywords, and use highly relevant words as the extended results of target keywords; at the same time, users are supported to input multiple custom keywords; for each custom Keywords, the keyword expansion module will linearly add the expanded word vectors to obtain the final expanded vector; (4)将二者的结果根据用户自定义权重,采用最大值归一化方法对两个结果向量进行归一化,使其映射到一个统一的取值范围之中;归一化后,对两个归一化后的向量进行线性合并,合并过程支持用户自定义关键词提取和关键词扩展的权重,合并结果供相关推荐模块进行相关度比较,以对待推荐微博生成相关性评分;(4) The results of the two are normalized according to the user-defined weights, and the two result vectors are normalized by the maximum value normalization method, so that they are mapped to a unified value range; after normalization, the The two normalized vectors are linearly merged. The merge process supports user-defined keyword extraction and keyword expansion weights. The merged results are used for correlation comparison by related recommendation modules to generate correlation scores for recommended microblogs; (5)将用户订阅的待推荐微博进行分词并且根据词频将其向量化,然后根据识别出的用户兴趣,将用户兴趣关键词向量、待推荐微博生成的词频向量以及IDF信息向量进行点乘运算,利用向量空间点乘的方法计算出领域相关度;(5) Segment the microblogs to be recommended subscribed by the user and vectorize them according to the word frequency, and then according to the identified user interest, point the user interest keyword vector, the word frequency vector generated by the microblog to be recommended, and the IDF information vector Multiplication operation, using the method of vector space point multiplication to calculate the domain correlation; (6)将推荐给用户的领域微博文本作为输入集合,基于LDA策略实现词项聚类,完成主题的发现,再将主题词项集合与线性合并模块中得到的用户领域兴趣关键词项进行相关度计算,确立主题重要性,并根据主题重要性进行用户推荐。(6) Take the field microblog text recommended to the user as the input set, realize term clustering based on the LDA strategy, complete the topic discovery, and then combine the subject term set with the user domain interest keywords obtained in the linear merge module Correlation calculation, establishing the importance of topics, and recommending users based on the importance of topics.
CN201611075431.XA 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog Pending CN106776881A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611075431.XA CN106776881A (en) 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611075431.XA CN106776881A (en) 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog

Publications (1)

Publication Number Publication Date
CN106776881A true CN106776881A (en) 2017-05-31

Family

ID=58900759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611075431.XA Pending CN106776881A (en) 2016-11-28 2016-11-28 A kind of realm information commending system and method based on microblog

Country Status (1)

Country Link
CN (1) CN106776881A (en)

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229871A (en) * 2017-07-17 2017-10-03 梧州井儿铺贸易有限公司 A kind of safe information acquisition device
CN107370664A (en) * 2017-07-17 2017-11-21 陈剑桃 A kind of effective microblogging junk user finds system
CN107436934A (en) * 2017-07-21 2017-12-05 上海斐讯数据通信技术有限公司 It is a kind of to orient the system and method for subscribing to the story of a play or opera
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN107766482A (en) * 2017-10-13 2018-03-06 北京猎户星空科技有限公司 Information pushes and sending method, device, electronic equipment, storage medium
CN108255957A (en) * 2017-12-21 2018-07-06 杭州传送门网络科技有限公司 One kind recommends matching process based on Venture Capital field precision dataization
CN108287916A (en) * 2018-02-11 2018-07-17 北京方正阿帕比技术有限公司 A kind of resource recommendation method
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108319677A (en) * 2018-01-30 2018-07-24 中南大学 The alignment schemes of the cyberrelationship figure of dynamic change
CN108388597A (en) * 2018-02-01 2018-08-10 深圳市鹰硕技术有限公司 Conference summary generation method and device
CN108763205A (en) * 2018-05-21 2018-11-06 阿里巴巴集团控股有限公司 A kind of brand alias recognition methods, device and electronic equipment
CN108846023A (en) * 2018-05-24 2018-11-20 普强信息技术(北京)有限公司 The unconventional characteristic method for digging and device of text
CN108932318A (en) * 2018-06-26 2018-12-04 四川政资汇智能科技有限公司 A kind of intellectual analysis and accurate method for pushing based on Policy resources big data
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system
CN109241238A (en) * 2018-06-27 2019-01-18 广州优视网络科技有限公司 Article search method, apparatus and electronic equipment
CN109255126A (en) * 2018-09-10 2019-01-22 百度在线网络技术(北京)有限公司 Article recommended method and device
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text cluster analysis method, device and terminal device
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN109685085A (en) * 2017-10-18 2019-04-26 阿里巴巴集团控股有限公司 A kind of master map extracting method and device
CN110019702A (en) * 2017-09-18 2019-07-16 阿里巴巴集团控股有限公司 Data digging method, device and equipment
CN110110207A (en) * 2018-01-18 2019-08-09 北京搜狗科技发展有限公司 A kind of information recommendation method, device and electronic equipment
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110427547A (en) * 2018-04-26 2019-11-08 观相科技(上海)有限公司 A kind of search system and searching method based on industrial characteristic
CN110427480A (en) * 2019-06-28 2019-11-08 平安科技(深圳)有限公司 Personalized text intelligent recommendation method, apparatus and computer readable storage medium
CN110489665A (en) * 2019-08-16 2019-11-22 北京信息科技大学 A kind of microblogging personalized recommendation method based on scene modeling and convolutional neural networks
CN110633408A (en) * 2018-06-20 2019-12-31 北京正和岛信息科技有限公司 Recommendation method and system for intelligent business information
CN111831802A (en) * 2020-06-04 2020-10-27 北京航空航天大学 A city domain knowledge detection system and method based on LDA topic model
CN112068712A (en) * 2020-09-02 2020-12-11 北京搜狗科技发展有限公司 A recommended method, apparatus and electronic device
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN112749284A (en) * 2020-12-31 2021-05-04 平安科技(深圳)有限公司 Knowledge graph construction method, device, equipment and storage medium
CN112784142A (en) * 2019-10-24 2021-05-11 北京搜狗科技发展有限公司 Information recommendation method and device
CN112861004A (en) * 2021-02-20 2021-05-28 中国联合网络通信集团有限公司 Rich media determination method and device
CN113220994A (en) * 2021-05-08 2021-08-06 中国科学院自动化研究所 User personalized information recommendation method based on target object enhanced representation
CN114048374A (en) * 2021-10-28 2022-02-15 盐城金堤科技有限公司 Method and device for determining object to be recommended
CN114819507A (en) * 2022-03-28 2022-07-29 浙江运动家体育发展有限公司 Multi-party interactive community motion management method and system
CN116228282A (en) * 2023-05-09 2023-06-06 湖南惟客科技集团有限公司 Intelligent commodity distribution method for user data tendency
CN116244496A (en) * 2022-12-06 2023-06-09 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
CN105677769A (en) * 2015-12-29 2016-06-15 广州神马移动信息科技有限公司 Keyword recommending method and system based on latent Dirichlet allocation (LDA) model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴雨龙等: "一种面向企业的行业微博信息推荐方法", 《计算机应用与软件》 *
唐晓波等: "基于文本聚类与LDA相融合的微博主题检索模型研究", 《情报理论与实践》 *

Cited By (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304371A (en) * 2017-07-14 2018-07-20 腾讯科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium that Hot Contents excavate
CN108304371B (en) * 2017-07-14 2021-07-13 腾讯科技(深圳)有限公司 Method and device for mining hot content, computer equipment and storage medium
CN107370664A (en) * 2017-07-17 2017-11-21 陈剑桃 A kind of effective microblogging junk user finds system
CN107229871A (en) * 2017-07-17 2017-10-03 梧州井儿铺贸易有限公司 A kind of safe information acquisition device
CN107436934A (en) * 2017-07-21 2017-12-05 上海斐讯数据通信技术有限公司 It is a kind of to orient the system and method for subscribing to the story of a play or opera
CN107436934B (en) * 2017-07-21 2023-09-08 杭州吉吉知识产权运营有限公司 System and method for directionally subscribing to scenario
CN107704512B (en) * 2017-08-31 2021-08-24 平安科技(深圳)有限公司 Financial product recommendation method based on social data, electronic device and medium
CN107704512A (en) * 2017-08-31 2018-02-16 平安科技(深圳)有限公司 Financial product based on social data recommends method, electronic installation and medium
CN110019702B (en) * 2017-09-18 2023-04-07 阿里巴巴集团控股有限公司 Data mining method, device and equipment
CN110019702A (en) * 2017-09-18 2019-07-16 阿里巴巴集团控股有限公司 Data digging method, device and equipment
CN107766482A (en) * 2017-10-13 2018-03-06 北京猎户星空科技有限公司 Information pushes and sending method, device, electronic equipment, storage medium
CN109685085B (en) * 2017-10-18 2023-09-26 阿里巴巴集团控股有限公司 Main graph extraction method and device
CN109685085A (en) * 2017-10-18 2019-04-26 阿里巴巴集团控股有限公司 A kind of master map extracting method and device
CN108255957A (en) * 2017-12-21 2018-07-06 杭州传送门网络科技有限公司 One kind recommends matching process based on Venture Capital field precision dataization
CN110110207B (en) * 2018-01-18 2023-11-03 北京搜狗科技发展有限公司 Information recommendation method and device and electronic equipment
CN110110207A (en) * 2018-01-18 2019-08-09 北京搜狗科技发展有限公司 A kind of information recommendation method, device and electronic equipment
CN108319677A (en) * 2018-01-30 2018-07-24 中南大学 The alignment schemes of the cyberrelationship figure of dynamic change
CN108388597A (en) * 2018-02-01 2018-08-10 深圳市鹰硕技术有限公司 Conference summary generation method and device
CN108287916A (en) * 2018-02-11 2018-07-17 北京方正阿帕比技术有限公司 A kind of resource recommendation method
CN110427547A (en) * 2018-04-26 2019-11-08 观相科技(上海)有限公司 A kind of search system and searching method based on industrial characteristic
CN108763205B (en) * 2018-05-21 2022-05-03 创新先进技术有限公司 Brand alias identification method and device and electronic equipment
CN108763205A (en) * 2018-05-21 2018-11-06 阿里巴巴集团控股有限公司 A kind of brand alias recognition methods, device and electronic equipment
CN108846023A (en) * 2018-05-24 2018-11-20 普强信息技术(北京)有限公司 The unconventional characteristic method for digging and device of text
CN110633408A (en) * 2018-06-20 2019-12-31 北京正和岛信息科技有限公司 Recommendation method and system for intelligent business information
CN110633408B (en) * 2018-06-20 2024-03-15 北京正和岛信息科技有限公司 Intelligent business information recommendation method and system
CN108932318A (en) * 2018-06-26 2018-12-04 四川政资汇智能科技有限公司 A kind of intellectual analysis and accurate method for pushing based on Policy resources big data
CN108932318B (en) * 2018-06-26 2022-03-04 四川政资汇智能科技有限公司 Intelligent analysis and accurate pushing method based on policy resource big data
CN109241238A (en) * 2018-06-27 2019-01-18 广州优视网络科技有限公司 Article search method, apparatus and electronic equipment
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system
CN109255126A (en) * 2018-09-10 2019-01-22 百度在线网络技术(北京)有限公司 Article recommended method and device
CN109635081A (en) * 2018-11-23 2019-04-16 上海大学 A kind of text key word weighing computation method based on word frequency power-law distribution characteristic
CN109635081B (en) * 2018-11-23 2023-06-13 上海大学 A Text Keyword Weight Calculation Method Based on the Power-law Distribution of Word Frequency
CN109299280A (en) * 2018-12-12 2019-02-01 河北工程大学 Short text cluster analysis method, device and terminal device
CN109376309B (en) * 2018-12-28 2022-05-17 北京百度网讯科技有限公司 Document recommendation method and device based on semantic tags
US11216504B2 (en) 2018-12-28 2022-01-04 Beijing Baidu Netcom Science And Technology Co., Ltd. Document recommendation method and device based on semantic tag
CN109376309A (en) * 2018-12-28 2019-02-22 北京百度网讯科技有限公司 Document recommendation method and device based on semantic label
CN110222160A (en) * 2019-05-06 2019-09-10 平安科技(深圳)有限公司 Intelligent semantic document recommendation method, device and computer readable storage medium
CN110222160B (en) * 2019-05-06 2023-09-15 平安科技(深圳)有限公司 Intelligent semantic document recommendation method and device and computer readable storage medium
WO2020258481A1 (en) * 2019-06-28 2020-12-30 平安科技(深圳)有限公司 Method and apparatus for intelligently recommending personalized text, and computer-readable storage medium
CN110427480A (en) * 2019-06-28 2019-11-08 平安科技(深圳)有限公司 Personalized text intelligent recommendation method, apparatus and computer readable storage medium
CN110489665B (en) * 2019-08-16 2023-11-14 北京信息科技大学 Microblog personalized recommendation method based on scene modeling and convolutional neural network
CN110489665A (en) * 2019-08-16 2019-11-22 北京信息科技大学 A kind of microblogging personalized recommendation method based on scene modeling and convolutional neural networks
CN112784142A (en) * 2019-10-24 2021-05-11 北京搜狗科技发展有限公司 Information recommendation method and device
CN111831802A (en) * 2020-06-04 2020-10-27 北京航空航天大学 A city domain knowledge detection system and method based on LDA topic model
CN111831802B (en) * 2020-06-04 2023-05-26 北京航空航天大学 A system and method for detecting urban domain knowledge based on LDA topic model
CN112068712A (en) * 2020-09-02 2020-12-11 北京搜狗科技发展有限公司 A recommended method, apparatus and electronic device
CN112749284A (en) * 2020-12-31 2021-05-04 平安科技(深圳)有限公司 Knowledge graph construction method, device, equipment and storage medium
CN112749284B (en) * 2020-12-31 2021-12-17 平安科技(深圳)有限公司 Knowledge graph construction method, device, equipment and storage medium
CN112364947A (en) * 2021-01-14 2021-02-12 北京崔玉涛儿童健康管理中心有限公司 Text similarity calculation method and device
CN112364947B (en) * 2021-01-14 2021-06-29 北京育学园健康管理中心有限公司 Text similarity calculation method and device
CN112861004A (en) * 2021-02-20 2021-05-28 中国联合网络通信集团有限公司 Rich media determination method and device
CN112861004B (en) * 2021-02-20 2024-02-06 中国联合网络通信集团有限公司 Method and device for determining rich media
CN113220994A (en) * 2021-05-08 2021-08-06 中国科学院自动化研究所 User personalized information recommendation method based on target object enhanced representation
CN113220994B (en) * 2021-05-08 2022-10-28 中国科学院自动化研究所 User personalized information recommendation method based on target object enhanced representation
CN114048374A (en) * 2021-10-28 2022-02-15 盐城金堤科技有限公司 Method and device for determining object to be recommended
CN114819507A (en) * 2022-03-28 2022-07-29 浙江运动家体育发展有限公司 Multi-party interactive community motion management method and system
CN116244496A (en) * 2022-12-06 2023-06-09 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain
CN116244496B (en) * 2022-12-06 2023-12-01 山东紫菜云数字科技有限公司 Resource recommendation method based on industrial chain
CN116228282B (en) * 2023-05-09 2023-08-11 湖南惟客科技集团有限公司 Intelligent commodity distribution method for user data tendency
CN116228282A (en) * 2023-05-09 2023-06-06 湖南惟客科技集团有限公司 Intelligent commodity distribution method for user data tendency

Similar Documents

Publication Publication Date Title
CN106776881A (en) A kind of realm information commending system and method based on microblog
CN110390103B (en) Automatic short text summarization method and system based on double encoders
WO2021083239A1 (en) Graph data query method and apparatus, and device and storage medium
Li et al. Filtering out the noise in short text topic modeling
CN105893609B (en) A mobile APP recommendation method based on weighted mixture
CN105095433B (en) Entity recommended method and device
CN104699766B (en) A kind of implicit attribute method for digging for merging word association relation and context of co-text deduction
US20210406473A1 (en) System and method for building chatbot providing intelligent conversational service
CN104951518B (en) One kind recommends method based on the newer context of dynamic increment
WO2017000402A1 (en) Page generation method and device
CN107239512B (en) A microblog spam comment identification method combined with comment relationship network graph
CN103279515B (en) Recommendation method based on micro-group and micro-group recommendation apparatus
CN113672693B (en) Tag recommendation method for online question answering platform based on knowledge graph and tag association
CN105912524B (en) Method and device for extracting article topic keywords based on low-rank matrix decomposition
CN109949174B (en) Heterogeneous social network user entity anchor link identification method
CN108710611A (en) A kind of short text topic model generation method of word-based network and term vector
Wang et al. Multi-source knowledge integration based on machine learning algorithms for domain ontology
Kwapong et al. A knowledge graph based framework for web API recommendation
CN106126605B (en) Short text classification method based on user portrait
CN109308315A (en) A Collaborative Recommendation Method Based on Expert Domain Similarity and Association
CN110069713A (en) A kind of personalized recommendation method based on user's context perception
CN105404693A (en) Service clustering method based on demand semantics
Devika et al. A semantic graph-based keyword extraction model using ranking method on big social data
CN104281565A (en) Semantic dictionary constructing method and device
CN104298732A (en) Personalized text sequencing and recommending method for network users

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170531