CN103927398A

CN103927398A - Microblog hype group discovering method based on maximum frequent item set mining

Info

Publication number: CN103927398A
Application number: CN201410188004.7A
Authority: CN
Inventors: 刘琰; 张进; 罗军勇; 罗向阳; 董雨辰; 陈静; 常斌
Original assignee: PLA Information Engineering University
Current assignee: PLA Information Engineering University
Priority date: 2014-05-07
Filing date: 2014-05-07
Publication date: 2014-07-16
Anticipated expiration: 2034-05-07
Also published as: CN103927398B

Abstract

The invention relates to a method for discovering microblog hype groups based on the mining of the largest frequent itemset, which effectively solves the problem of discovering microblog hype groups and preventing false and malicious hype. The microblog public open platform obtains a set of accounts participating in hype microblog communication; takes a single microblog as a transaction and accounts participating in microblog communication as items, and builds a hype microblog transaction database; treats and detects transactions in the microblog group corresponding to the transaction database. For each transaction, find the largest frequent item set contained in all transactions, calculate the overlap rate between each largest frequent item set, merge small item sets into large item sets, reduce the number of intersections, and when taking intersections between transactions, The binary search method is used to judge whether a certain item is included in the transaction, to improve the efficiency of mining the largest frequent item set, and to discover groups of microblog hype.

Description

Microblog Hype Group Discovery Method Based on Maximum Frequent Itemset Mining

技术领域technical field

本发明涉及微博舆情监控领域，特别是一种基于最大频繁项集挖掘的微博炒作群体发现方法。The invention relates to the field of microblog public opinion monitoring, in particular to a method for discovering microblog hype groups based on maximum frequent item set mining.

背景技术Background technique

微博作为一种新兴的社会媒体形式，兼具博客、媒体、即时通讯功能于一身。微博自身的即时性、草根性、移动性、互动性等特点使其成为网络舆情传播的天然载体。在网络舆情中，微博不仅成为舆论传播的中心和渠道，同时也参与舆论的形成、发展与引导过程。As a new form of social media, Weibo has the functions of blogging, media and instant messaging. The immediacy, grassroots, mobility, and interactivity of Weibo itself make it a natural carrier for the spread of Internet public opinion. In online public opinion, Weibo not only becomes the center and channel of public opinion dissemination, but also participates in the formation, development and guidance of public opinion.

微博传播是一把双刃剑：一方面，微博为一些社会事件中的信息公开提供了一个快速响应的平台，它在一定程度上弥补了传统媒体和其他网络工具的不足；另一方面，微博不同于传统新闻媒体，其新闻的发布存在重复性，且真实性无法保证，可能会被利用成为谣言传播的载体、不满情绪的导火索，甚至给国家安全和社会稳定造成极坏的后果。网络不实信息始于其制造者，扩散于其传播者。Weibo dissemination is a double-edged sword: on the one hand, Weibo provides a platform for rapid response to information disclosure in some social events, which makes up for the shortcomings of traditional media and other network tools to a certain extent; , Weibo is different from traditional news media in that its news releases are repetitive and its authenticity cannot be guaranteed. It may be used as a carrier of rumors, a fuse of dissatisfaction, and even cause serious damage to national security and social stability. s consequence. Online disinformation begins with its creators and spreads to its disseminators.

惠普公司旗下的社交计算研究团队在最新报告中称，新浪微博存在异常严重的话题炒作问题，围绕热门话题转发的微博中有半数都是由炒作用户发送的。研究发现，热门话题传播中人为操纵的虚假转发数量极大，1％的垃圾消息发送者创造了49％的转发量。自2013年8月以来，政府部门加大了对网络舆论引导的力度，根据对“秦火火”、“立二拆四”等所在网络推手公司的调查结果来看，网络中存在着大量有组织的推手团队，他们伙同少数“意见领袖”组织网络“水军”，长期在网上炮制虚假新闻、故意歪曲事实，制造事端，混淆是非，严重扰乱了网络舆论秩序，其行为已经受到国家舆情管控的高度关注，相关人等也因涉嫌犯罪被依法刑事拘留。Hewlett-Packard's social computing research team said in its latest report that Sina Weibo has an extremely serious problem of topic hype, and half of the Weibo reposted around hot topics are sent by hype users. The study found that the number of artificially manipulated false retweets in the spread of hot topics is extremely large, with 1% of spammers creating 49% of retweets. Since August 2013, government departments have intensified their efforts to guide online public opinion. According to the results of investigations on the Internet promoters companies such as "Qin Huohuo" and "Li 2 Demolition 4", there are a large number of malicious people on the Internet. The push-hand team of the organization, together with a small number of "opinion leaders" to organize the Internet "Water Army", has been fabricating fake news on the Internet for a long time, deliberately distorting facts, creating troubles, confusing right and wrong, and seriously disrupting the order of Internet public opinion. Their behavior has been controlled by national public opinion Highly concerned, relevant persons were also detained in accordance with the law on suspicion of crimes.

因此，面向新兴媒体，针对各种隐藏的舆论煽动行为，开展对炒作微博的识别，分析其传播群体特征，收集虚假推送行为的识别证据，甄别人为制造的炒作热点，对于发现、预测、引导网络舆论，提高政府舆论监管能力，维护社会和谐稳定具有重要的理论价值和现实意义。Therefore, facing emerging media and aiming at all kinds of hidden public opinion incitement behaviors, carry out the identification of hyped microblogs, analyze the characteristics of their dissemination groups, collect identification evidence of false push behaviors, and identify hype hotspots artificially created by others. It has important theoretical value and practical significance to improve the government's ability to supervise public opinion and maintain social harmony and stability.

随着微博的爆炸式发展，针对微博账户的研究吸引了国内外学者的广泛兴趣，一些研究成果近年来在WWW、KDD等重要会议上发表。目前对微博账户的研究可以大致分为以下三类：1)特征分析，包括账户属性特征和行为特征等；2)影响力分析，包括影响力评价体系构建及度量方法等；3)账户间关系网络分析，包括账户关系网络的基本属性、生成与演进等。With the explosive development of microblog, research on microblog accounts has attracted widespread interest from scholars at home and abroad, and some research results have been published in important conferences such as WWW and KDD in recent years. The current research on microblog accounts can be roughly divided into the following three categories: 1) feature analysis, including account attribute characteristics and behavior characteristics, etc.; 2) influence analysis, including influence evaluation system construction and measurement methods, etc.; 3) account interaction Relationship network analysis, including the basic attributes, generation and evolution of account relationship networks.

然而，目前国内外对炒作群体研究的文献相对较少，主要相关文献有对垃圾账户(spammer)、马甲账户(sockpuppet)、僵尸账户的识别。垃圾账户是指经常发布垃圾信息的账户，Z.Yi等人从多个角度分析了垃圾账户的特征，并采用机器学习的方式自动识别垃圾账户。Chao Yang等人深入分析了垃圾账户间的社会关系，提出了一种根据账户间亲密度来发现垃圾账户的方法。马甲账户是指通过注册多个账号进行发帖、转发、评论等行为的虚假账户，Xueling Zheng等人提出了一种利用文本内容、相似度匹配来识别马甲账户的方法。僵尸账户是指为了进行粉丝买卖而恶意注册的账户，方明等提出了一种基于微博注册账户名特征提取的智能分类方法，具有较高的准确率。但这些方法并未解决如何发现微博炒作群体，防止虚假炒作，炒作账户与以上几类账户之间最大的区别是，炒作账户侧重于其“炒作”行为，参与炒作的账户较为分散且直接关系不明显，隐蔽性和组织性更强，也更加难以发现。However, there are relatively few domestic and foreign literatures on the study of hype groups. The main relevant literatures include the identification of spammer accounts, sockpuppet accounts, and zombie accounts. Junk accounts refer to accounts that often post spam information. Z.Yi et al. analyzed the characteristics of junk accounts from multiple perspectives, and used machine learning to automatically identify junk accounts. Chao Yang et al. deeply analyzed the social relationship among spam accounts, and proposed a method to discover spam accounts according to the intimacy between accounts. Vest accounts refer to fake accounts that register multiple accounts for posting, forwarding, and commenting. Xueling Zheng et al. proposed a method to identify vest accounts by using text content and similarity matching. Zombie accounts refer to accounts registered maliciously for the purpose of buying and selling fans. Fang Ming et al. proposed an intelligent classification method based on the feature extraction of Weibo registered account names, which has a high accuracy rate. However, these methods do not solve how to discover Weibo hype groups and prevent false hype. The biggest difference between hype accounts and the above-mentioned types of accounts is that hype accounts focus on their "hype" behavior, and accounts participating in hype are scattered and directly related. Not obvious, more concealed and organized, and more difficult to find.

群体炒作与普通微博类似，传播人群的发帖、转发、评论等行为表面上是孤立的，但是非常规的恶意传播往往不是单个人的行为，而是有组织的群体行为，但是这种群体行为是隐蔽的，难以察觉。因此，如何发现微博炒作群体，防止虚假恶意炒作给社会造成的不良影响和不必要的经济损失，是必需认真解决的技术问题。Group hype is similar to ordinary Weibo. The behaviors of posting, reposting, and commenting of the spreading crowd are isolated on the surface, but unconventional malicious spreading is often not the behavior of a single person, but an organized group behavior. However, this group behavior Is hidden, difficult to detect. Therefore, how to discover microblog hype groups and prevent false and malicious hype from causing adverse effects on society and unnecessary economic losses is a technical problem that must be seriously solved.

发明内容Contents of the invention

针对上述情况，为克服现有技术之缺陷，本发明之目的就是提供一种基于最大频繁项集挖掘的微博炒作群体发现方法，可有效解决微博炒作群体的发现，防止虚假恶意炒作的问题。In view of the above situation, in order to overcome the defects of the prior art, the purpose of the present invention is to provide a method for discovering microblog hype groups based on maximum frequent itemset mining, which can effectively solve the discovery of microblog hype groups and prevent false and malicious hype .

本发明解决的技术方案是，基于最大频繁项集挖掘的微博炒作账户发现方法包括如下步骤：The technical solution solved by the present invention is that the microblog speculation account discovery method based on the largest frequent itemset mining includes the following steps:

(1)炒作微博样本搜集：以炒作微博的相关性为线索，基于爬虫技术或微博公共开放平台获取参与炒作微博传播的账户集合；(1) Sample collection of hype microblogs: take the relevance of hype microblogs as clues, and obtain a collection of accounts participating in the spread of hype microblogs based on crawler technology or the public open platform of microblogs;

(2)事务数据库构建：以单个微博为事务，参与微博传播的账户为项，构建炒作微博事务数据库；(2) Construction of transactional database: taking a single microblog as a transaction and accounts participating in the spread of microblogs as items, construct a hype microblog transactional database;

(3)最大频繁项集挖掘：对待检测微博组所对应的事务数据库中的每个事务，利用迭代交集法找出所有事务中包含的最大频繁项集，得到若干最大频繁项集集合；(3) Maximum frequent itemset mining: For each transaction in the transaction database corresponding to the microblog group to be detected, use the iterative intersection method to find out the maximum frequent itemsets contained in all transactions, and obtain several maximum frequent itemsets;

由于炒作微博事务库中每个事务包含的项目大都数以万计，直接在原始事务数据库中挖掘最大频繁项集将会影响算法执行的效率，利用二分查找法，快速剔除事务中的非频繁项目，找出最大频繁项集的候选集合，缩减事务数据库规模；Due to the hype that each transaction in the microblog transaction database contains tens of thousands of items, directly mining the largest frequent item set in the original transaction database will affect the efficiency of algorithm execution, and use the binary search method to quickly eliminate non-frequent items in transactions project, find out the candidate set of the largest frequent item set, and reduce the size of the transaction database;

(4)最大频繁项集归并：对每个最大频繁项集，计算项集间的重叠率，对最大频繁项集进行合并，尽量将规模较小的项集归并到较大项集中，并保证归并后项集中的账户依然具有一定的关联性；通过缩减事务数据库规模，减少交集次数，事务间取交集时，采用二分查找法判断事务中是否包含某项目，以提高挖掘最大频繁项集的效率，从而发现微博炒作群体。(4) Maximum frequent itemset merging: For each maximum frequent itemset, calculate the overlap rate between itemsets, merge the largest frequent itemsets, try to merge smaller itemsets into larger itemsets, and ensure After merging, the accounts in the item set still have certain relevance; by reducing the size of the transaction database, reducing the number of intersections, and when taking the intersection between transactions, use the binary search method to determine whether a certain item is included in the transaction, so as to improve the efficiency of mining the largest frequent itemset , so as to discover Weibo hype groups.

本发明方法简单，易操作，能准确发现恶意微博炒作群体，防止给社会造成的不良影响和不必要的经济损失，具有实际的应用价值。The method of the invention is simple and easy to operate, can accurately discover malicious microblog hype groups, prevent adverse effects on society and unnecessary economic losses, and has practical application value.

附图说明Description of drawings

图1为本发明流程框示图。Fig. 1 is a flow chart diagram of the present invention.

图2为本发明的炒作微博事务数据库示意图。Fig. 2 is a schematic diagram of the hype microblog transaction database of the present invention.

图3为本发明炒作微博事务数据库截图。Fig. 3 is a screenshot of the hype microblog transaction database of the present invention.

图4为本发明算法在Mushroom数据集上执行时间对比图。Fig. 4 is a comparison diagram of execution time of the algorithm of the present invention on the Mushroom dataset.

图5为本发明算法在炒作微博数据集上执行时间对比图。Fig. 5 is a comparison diagram of the execution time of the algorithm of the present invention on the hype microblog data set.

图6为本发明MFS中项集个数变化图。Fig. 6 is a graph showing the change of the number of itemsets in the MFS of the present invention.

图7为本发明MFS中项集的最大长度变化图。Fig. 7 is a diagram of the maximum length variation of itemsets in the MFS of the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的具体实施方式作详细说明。The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

由图1给出，本发明包括炒作微博事务库、最大频繁项集挖掘以及最大频繁项集归并部分，炒作微博事务库构建模块主要负责采集数据并进行预处理，构建事务数据库D；最大频繁项集挖掘模块首先基于二分查找方法筛选候选最大频繁项集，然后基于迭代交集方法从事务数据库D中挖掘出最大频繁项集MFS；最大频繁项集归并模块主要对MFS进行归并处理，以尽可能还原真实的炒作群体，具体步骤是：As shown in Fig. 1, the present invention includes hype microblog transaction library, maximum frequent item set mining and maximum frequent item set merging part, hype microblog transaction library construction module is mainly responsible for collecting data and preprocessing, constructing transaction database D; The frequent itemset mining module first screens the candidate maximum frequent itemsets based on the binary search method, and then mines the maximum frequent itemsets MFS from the transaction database D based on the iterative intersection method; the maximum frequent itemsets merge module mainly merges the MFSs to maximize It is possible to restore the real hype group. The specific steps are:

1)、搜集炒作微博样本1) Collect hype Weibo samples

炒作微博样本搜集实现本发明的最初步骤，微博样本的选择应具有相关性，若某个炒作账户曾经参与的若干微博，或与某个主题相关的若干微博，微博样本的判定应借鉴已有的成熟判别方法或专家系统，炒作微博样本搜集有两种方法：一种方法是选择爬虫技术，从微博网页下载网页、解析页面结构并提取微博传播账户的信息；另一种方法是调用微博公共开放平台，调用微博官方对外提供的API函数获取微博传播账户的信息，为了有利于对炒作群体的发现，在选取炒作微博样本时还应遵循以下原则：The collection of hype microblog samples is the first step to realize the present invention. The selection of microblog samples should be relevant. We should learn from the existing mature discrimination methods or expert systems. There are two ways to collect hype microblog samples: one method is to choose crawler technology to download web pages from microblog web pages, analyze the page structure and extract the information of microblog dissemination accounts; One method is to call the Weibo public open platform and call the API function provided by the Weibo official to obtain the information of the Weibo dissemination account. In order to facilitate the discovery of hype groups, the following principles should also be followed when selecting hype Weibo samples:

a、选取转发数相对较高的热门微博；a. Select popular microblogs with a relatively high number of retweets;

b、微博发布时间跨度<180天；b. Weibo release time span < 180 days;

按照待挖掘炒作账户的算法分析条件，样本搜集的内容应包括微博标识号、微博账户标识号、微博账户的基本信息；According to the algorithmic analysis conditions of the hype account to be mined, the content of the sample collection should include the microblog identification number, the microblog account identification number, and the basic information of the microblog account;

2)构建事务数据库2) Build a transactional database

将炒作群体发现问题转化为数据挖掘中的最大频繁项集挖掘，在炒作微博样本搜集的基础上，将炒作微博对应事务，参与微博转发的账户对应事务中的项，构建事务数据库，如图2所示；Transform the discovery of hype groups into the largest frequent itemset mining in data mining. Based on the collection of hype microblog samples, the corresponding transactions of hype microblogs and the items in the corresponding transactions of accounts participating in microblog forwarding are constructed to build a transaction database. as shown in picture 2;

3)基于二分查找的候选最大频繁项集筛选3) Screening of candidate maximum frequent itemsets based on binary search

由于炒作微博事务库中每个事务包含的项目大都数以万计，直接在原始事务库中挖掘最大频繁项集将会影响算法执行的效率，基于二分查找的方法，能够快速剔除事务中的非频繁项目，找出最大频繁项集的候选集合，缩减事务库规模，给定事务数据库D，最小支持数S，进行候选最大频繁项集筛选，方法是：Due to the hype that each transaction in the microblog transaction database contains tens of thousands of items, directly mining the largest frequent item set in the original transaction database will affect the efficiency of algorithm execution. Based on the binary search method, it can quickly eliminate the items in the transaction. For non-frequent items, find out the candidate set of the largest frequent itemset, reduce the size of the transaction database, given the transaction database D, the minimum support number S, and filter the candidate maximum frequent itemset, the method is:

(1)将事务库D中的事务按项目个数从大到小排序(1) Sort the transactions in the transaction library D from the largest to the smallest according to the number of items

(2)记频繁项目集合，非频繁项目集合；从i＝1开始，按顺序遍历D中的每个事务T_i(1≤i≤|D|)，对事务T_i中的每个项目u：(2) Record frequent item collection , the set of infrequent items ;Start from i=1, traverse each transaction T _i in D in order (1≤i≤|D|), for each item u in transaction T _i :

a)若u∈FI，则保留u；a) If u∈FI, keep u;

b)若u∈NFI，则从T_i中剔除u；b) If u∈NFI, remove u from T _i ;

c)若，则转到下一步判断u是否是频繁项目；c) if , go to the next step to judge whether u is a frequent item;

(3)、从j＝i+1开始遍历剩余的事务，并利用二分查找法判断T_j，i<j≤|D|中是否包含u，终止条件为：(3) Start traversing the remaining transactions from j=i+1, and use the binary search method to judge whether u is contained in T _j , i<j≤|D|, and the termination condition is:

a)当包含u的事务个数达到S时，说明u是频繁项目，将u加入到FI中；a) When the number of transactions containing u reaches S, it means that u is a frequent item, and u is added to FI;

b)当剩余的事务个数与包含了u的事务个数之和小于S时，说明u是非频繁项目，从T_i中剔除u。若此时包含了u的事务个数大于1，说明u还出现在T_i之外的事务中，则将u加入到NFI中；b) When the sum of the number of remaining transactions and the number of transactions including u is less than S, it means that u is an infrequent item, and u is removed from T _i . If the number of transactions containing u is greater than 1 at this time, it means that u also appears in transactions other than T _i , then add u to NFI;

(4)剔除完D中所有事务中的非频繁项目后，即可得到缩减后的事务库D₁；(4) After eliminating the non-frequent items in all affairs in D, the reduced transaction library _D1 can be obtained;

4)基于迭代交集的最大频繁项集挖掘：4) Mining of maximum frequent itemsets based on iterative intersection:

通过对事务迭代取交集的方式挖掘最大频繁项集，给定缩减后的事务库D₁，最小支持数S，最大频繁项集挖掘的方法如下：Mining the maximum frequent itemset by iteratively taking the intersection of transactions, given the reduced transaction database D ₁ , the minimum support number S, the method of mining the maximum frequent itemset is as follows:

(1)将事务库D₁中的事务按项的个数从大到小排序，以尽早发现最大频繁项集，为缩减事务库规模，合并事务库中重复的事务，并对事务个数计数；(1) Sort the transactions in the transaction database D ₁ according to the number of items from large to small, so as to find the largest frequent item set as soon as possible, in order to reduce the size of the transaction database, merge the repeated transactions in the transaction database, and count the number of transactions ;

(2)为减少取交集的次数，对于事务T_i，1≤i≤|D₁|-S+1，从i＝1开始，首先找出包含了T_i中任意项的事务集合，T_j|T_j至少包含了Ti中的一个项目；j>i)，T_i依次与T_j取交集，将两者的交集移入新的事务库D₂，同时剔除T_j，；(2) In order to reduce the number of intersections, for a transaction T _i , 1≤i≤|D ₁ |-S+1, starting from i=1, first find the transaction set that contains any item in T _i , T _j |T _j contains at least one item in Ti; j>i), T _i takes the intersection with T _j in turn, moves the intersection of the two into the new transaction database D ₂ , and removes T _j at the same time, ;

(3)对于新事务库D₂中的事务T，如果T是由不小于S个事务取交集而得，则将T移入最大频繁候选项集集合MFCS中，同时剔除T在D₂中的子事务；(3) For the transaction T in the new transaction database D ₂ , if T is obtained by taking the intersection of not less than S transactions, then move T into the maximum frequent candidate item set set MFCS, and remove the child of T in D ₂ affairs;

(4)如果新事务库D₂中的剩余事务个数小于S，则结束对事务库D₂的处理，返回到上层事务库；否则，对D₂从第1步开始再进行此过程；(4) If the number of remaining transactions in the new transaction library D ₂ is less than S, then end the processing of the transaction library D ₂ and return to the upper-level transaction library; otherwise, start the process from the first step to D ₂ ;

(5)当事务库D₁中剩余的事务数小于S时，即i>|D₁|-S+1，结束对当前事务库D₁的处理；(5) When the number of remaining transactions in the transaction database D ₁ is less than S, that is, i>|D ₁ |-S+1, end the processing of the current transaction database D ₁ ;

(6)对MFCS中的项集进行合并同时剔除非最大频繁项集，最后的结果即为所求的最大频繁项集集合MFS；(6) Merge the itemsets in the MFCS and eliminate the non-maximal frequent itemsets at the same time, and the final result is the maximum frequent itemsets set MFS sought;

5)最大频繁项集归并：5) Maximum frequent itemsets merge:

由于最小支持数的限制，使得MFS中最大频繁项集规模较小，而且有些项集之间存在大量的重叠项，这些项集代表的账户群很可能从属于同一个炒作群体，为解决这一问题，使用重叠率来反映两个项集之间的相似性，设项集X₁,X₂∈MFS，将X₁和X₂的重叠率记为：Due to the limitation of the minimum support number, the maximum frequent itemsets in MFS are small in size, and there are a large number of overlapping items between some itemsets. The account groups represented by these itemsets are likely to belong to the same hype group. In order to solve this The problem is to use the overlap rate to reflect the similarity between two item sets, set the item set X ₁ , X ₂ ∈ MFS, and record the overlap rate of X ₁ and X ₂ as:

$ORate ORate (({X x}_{11},, {X x}_{22})) = = \frac{| | {X x}_{11} \cap \cap {X x}_{22} | |}{Min Min ((| | {X x}_{11} | |,, | | {X x}_{22} | |))}$

上式中，|X₁∩X₂|表示X₁与X₂重叠项目的个数，Min(|X₁|,|X₂|)表示规模较小的项集中项目的个数，项集归并的方法是：In the above formula, |X ₁ ∩X ₂ | represents the number of overlapping items between X ₁ and X ₂ , Min(|X ₁ |,|X ₂ |) represents the number of items in a smaller itemset, and itemsets are merged The method is:

(1)将MFS中的最大频繁项集按项目的个数从大到小排序；(1) Sort the maximum frequent itemsets in the MFS from the largest to the smallest according to the number of items;

(2)遍历MFS中的每个最大频繁项集，从i＝1开始，对，若ORate(X_i,X_j)≥minOR，i<j≤|MFS|，则将X_i和X_j的并集添加到新的集合MMFS中，同时剔除X_j；(2) Traversing each maximum frequent itemset in MFS, starting from i=1, for , if ORate(X _i ,X _j )≥minOR, i<j≤|MFS|, then add the union of Xi and X _j to the new set MMFS _, and remove X _j at the same time;

(3)对MMFS中的项集重复执行以上两个步骤；(3) Repeat the above two steps for the itemsets in MMFS;

(4)当MMFS中任意两个项集的重叠率小于minOR时，结束。(4) When the overlapping rate of any two itemsets in MMFS is less than minOR, end.

本发明方法简单，易操作，并经实际试用，表明方法稳定可靠，具有实际的应用价值，有关资料如下：The method of the present invention is simple, easy to operate, and through actual trial, it is shown that the method is stable and reliable, and has practical application value. The relevant information is as follows:

1)数据集1) Dataset

以新浪微博作为研究平台，以81条具有炒作嫌疑的微博为研究对象，实际参与其转发的账户数量为380,726(不含多次参与转发的账户)，平均每条事务的项目个数为6,286，这些微博大多属于广告营销类，有可能存在多个炒作群体参与其传播过程。利用爬虫程序爬取参与这些微博转发的所有账户标识(UID)，并存储到事务数据库中，部分数据的格式如图3所示。Taking Sina Weibo as the research platform, and taking 81 microblogs suspected of hype as the research object, the number of accounts that actually participated in their forwarding was 380,726 (excluding accounts that participated in multiple forwardings), and the average number of items per transaction was 6,286. Most of these microblogs belong to the category of advertising and marketing, and there may be multiple hype groups participating in the dissemination process. Use the crawler program to crawl all the account IDs (UID) participating in these Weibo forwarding, and store them in the transaction database. The format of some data is shown in Figure 3.

为了验证本发明所述算法(以下简称IIA)应用于最大频繁项集挖掘的效率，对经典的Mushroom数据集进行性能测试，并与已知方法进行比较。该数据集包含了8,124条记录，每条记录有23个项，记录了蘑菇的23个属性。In order to verify the efficiency of the algorithm described in the present invention (hereinafter referred to as IIA) applied to the mining of maximum frequent itemsets, a performance test is performed on a classic Mushroom data set and compared with known methods. The dataset contains 8,124 records, each record has 23 items, and records 23 attributes of mushrooms.

2)性能评估2) Performance evaluation

首先对本发明所述方法的性能进行评估，实验环境为4G内存、2.0GHz双核Duo T5800CPU、Windows732位操作系统，用Java实现该算法，并分别与经典的MAFIA算法和DFMFI算法进行比较。At first the performance of the method of the present invention is evaluated, the experimental environment is 4G internal memory, 2.0GHz dual-core Duo T5800CPU, Windows732 bit operating system, realize this algorithm with Java, and compare with classic MAFIA algorithm and DFMFI algorithm respectively.

图4为三种算法在Mushroom数据集中不同支持度下的执行情况，可以看出本方法的效率明显高于其它两种算法，即使在最小支持度很低的情况下执行效率也有优势。图5为三种算法在炒作微博数据集上执行情况，可以看出本方法的执行效率最高。Figure 4 shows the execution of the three algorithms under different support degrees in the Mushroom dataset. It can be seen that the efficiency of this method is significantly higher than the other two algorithms, and the execution efficiency has advantages even when the minimum support degree is very low. Figure 5 shows the implementation of the three algorithms on the hype microblog dataset. It can be seen that this method has the highest execution efficiency.

3)参数阈值选择3) Parameter threshold selection

图6、图7为在不同最小支持数下从炒作微博数据集中发现的最大频繁项集结果，图6和图7分别表示最大频繁项集中项集个数和最大频繁项集中项集的最大长度随最小支持数的变化。结合本发明研究背景可以发现，minSup(最小支持数)设定的越大，发现的账户群体炒作嫌疑越大，但群体规模和数量也会随之减小；反之，minSup设定的越小，发现的账户群体炒作嫌疑越小，但群体规模和数量会增大。为此，需要给minSup设定一个合理的阈值，以发现具有一定规模且炒作嫌疑较高的群体。Figure 6 and Figure 7 show the results of the largest frequent itemsets found from the hype microblog data set under different minimum support numbers. Figures 6 and 7 show the number of itemsets in the largest frequent itemset and the largest The length varies with the minimum number of supports. In combination with the research background of the present invention, it can be found that the larger the setting of minSup (minimum number of supports), the greater the suspicion of speculation of the account group found, but the group size and number will also decrease thereupon; otherwise, the smaller the setting of minSup, The smaller the suspected account group speculation is, the larger the group size and number will be. To this end, it is necessary to set a reasonable threshold for minSup to discover groups with a certain scale and high suspicion of speculation.

另一方面，在对最大频繁项集中的项集进行归并时，minOR的设定也将直接影响合并后项集的规模。通过对数据的不断分析，将minOR设定为50％，即当两个项集超过一半的项目相同时将其合并。On the other hand, when merging the itemsets in the largest frequent itemset, the setting of minOR will also directly affect the size of the merged itemsets. Through the continuous analysis of the data, the minOR is set to 50%, that is, when more than half of the items in the two item sets are the same, they will be merged.

为了进一步确定minSup的取值，表1分别列出了minSup＝3,4,5时对最大频繁项集归并后的结果，按归并后项集长度排序，这里仅列出了前8个项集(疑似炒作群体)。从表中可以看出，当minSup＝3和5时，除了第一个项集规模很大外，其它项集规模都很小；而当minSup＝4时，项集规模并没有急剧变化，且规模适当，说明取值相对合理。。In order to further determine the value of minSup, Table 1 lists the results of merging the largest frequent itemsets when minSup=3, 4, and 5 respectively, sorted by the length of the merged itemsets, and only the first 8 itemsets are listed here (Suspected hype groups). It can be seen from the table that when minSup=3 and 5, except for the first item set, the scale of other itemsets is very small; when minSup=4, the size of the itemset does not change sharply, and The scale is appropriate, indicating that the value is relatively reasonable. .

表1不同支持数下最大频繁项集归并结果Table 1 Merging results of maximum frequent itemsets under different support numbers

序号serial number minSup＝3minSup=3 minSup＝4minSup=4 minSup＝5minSup=5 11 14,86314,863 2,6232,623 963963 22 311311 1,7551,755 6565 33 156156 688688 2929 44 7777 410410 1919 55 5959 129129 99

66 5656 9898 99 77 5555 8282 77 88 5555 5454 55

4)准确率分析4) Accuracy analysis

为了验证本发明所述炒作群体发现算法的准确率，即发现的炒作群体中实际炒作账户所占比例，结合已有基于多特征分析的炒作账户识别方法和人工标注方法综合验证结果的准确率。假设待验证的炒作群体为H，首先利用已有基于多特征分析的炒作账户识别方法对每个账户进行判别，得到的炒作账户集合记为H₁；然后，采用人工标注的方法对剩余的账户进行判别，得到的炒作账户集合记为H₂，炒作群体H的准确率计算公式为：In order to verify the accuracy of the hype group discovery algorithm described in the present invention, that is, the proportion of actual hype accounts in the discovered hype groups, the accuracy of the results is comprehensively verified by combining the existing hype account identification method based on multi-feature analysis and the manual labeling method. Assuming that the hype group to be verified is H, first use the existing hype account identification method based on multi-feature analysis to identify each account, and the obtained hype account set is recorded as H ₁ ; then, use the method of manual labeling to identify the remaining accounts Discrimination is carried out, and the obtained hype account set is denoted as H ₂ , and the calculation formula for the accuracy rate of hype group H is:

$Precision Precision = = \frac{| | {H h}_{11} | | + + | | {H h}_{22} | |}{| | H h | |} \times \times 100100 % % - - - - - - ((11))$

上式中，|H|表示H中的账户总数，|H₁|+|H₂|表示H中实际的炒作账户数。对表1中minSup＝4且群体规模(即项集长度)大于100的部分群体进行验证，具体结果如表2所示。In the above formula, |H| represents the total number of accounts in H, and |H ₁ |+|H ₂ | represents the actual number of speculation accounts in H. In Table 1, some groups whose minSup=4 and whose group size (ie item set length) is greater than 100 are verified, and the specific results are shown in Table 2.

表2炒作群体发现的准确率(minSup＝4)Table 2 The accuracy rate of hype group discovery (minSup=4)

序号serial number |H₁||H ₁ | |H₂||H ₂ | |H||H| PrecisionPrecision 11 2,0162,016 451451 2,6232,623 94.1％94.1% 22 1,4651,465 163163 1,7551,755 92.8％92.8% 33 571571 7878 688688 94.3％94.3% 44 354354 3333 410410 94.4％94.4% 55 109109 1010 129129 92.2％92.2%

从表2中可以看到，对于本方法发现的每一个炒作群体，实际炒作账户所占的比例都高于90％，表明本方法能识别出更为隐蔽的炒作账户(即H₂)，而这些账户往往是一些偶尔参与炒作但影响力巨大的炒作大号。由此可见，本发明具有实际的应用价值，经济和社会效益巨大。It can be seen from Table 2 that for each hype group found by this method, the proportion of actual hype accounts is higher than 90%, indicating that this method can identify more hidden hype accounts (namely H ₂ ), while These accounts are often hype accounts that occasionally participate in the hype but have a huge influence. It can be seen that the present invention has practical application value and huge economic and social benefits.

Claims

1. A microblog hype group discovery method based on maximum frequent itemset mining, is characterized in that, comprises the steps:

(1) Sample collection of hype microblogs: take the relevance of hype microblogs as clues, and obtain a collection of accounts participating in the spread of hype microblogs based on crawler technology or the public open platform of microblogs;

(2) Construction of transactional database: taking a single microblog as a transaction and accounts participating in the spread of microblogs as items, construct a hype microblog transactional database;

(3) Maximum frequent itemset mining: For each transaction in the transaction database corresponding to the microblog group to be detected, use the iterative intersection method to find out the maximum frequent itemsets contained in all transactions, and obtain several maximum frequent itemsets;

Due to the hype that each transaction in the microblog transaction database contains tens of thousands of items, directly mining the largest frequent item set in the original transaction database will affect the efficiency of algorithm execution, and use the binary search method to quickly eliminate non-frequent items in transactions project, find out the candidate set of the largest frequent item set, and reduce the size of the transaction database;

(4) Maximum frequent itemset merging: For each maximum frequent itemset, calculate the overlap rate between itemsets, merge the largest frequent itemsets, try to merge smaller itemsets into larger itemsets, and ensure After merging, the accounts in the item set still have certain relevance; by reducing the size of the transaction database, reducing the number of intersections, and when taking the intersection between transactions, use the binary search method to determine whether a certain item is included in the transaction, so as to improve the efficiency of mining the largest frequent itemset , so as to discover Weibo hype groups.

2. the microblog hype group discovery method based on maximum frequent itemset mining according to claim 1, is characterized in that, comprises hype microblog transaction library, maximum frequent item set mining and maximum frequent item set merging part, hype microblog The transaction database building module is mainly responsible for collecting data and preprocessing to construct the transaction database D; the maximum frequent itemset mining module first screens the candidate maximum frequent itemsets based on the binary search method, and then mines the maximum frequent itemsets from the transaction database D based on the iterative intersection method. Itemset MFS; the maximum frequent itemset merging module mainly merges MFS to restore the real hype group. The specific steps are:

1) Collect hype Weibo samples

The collection of hype microblog samples is the first step to realize the present invention. The selection of microblog samples should be relevant. We should learn from the existing mature discrimination methods or expert systems. There are two ways to collect hype microblog samples: one method is to choose crawler technology to download web pages from microblog web pages, analyze the page structure and extract the information of microblog dissemination accounts; One method is to call the public open platform of Weibo, and call the API function provided by the Weibo official to obtain the information of the Weibo dissemination account;

According to the algorithmic analysis conditions of the hype account to be mined, the content of the sample collection should include the microblog identification number, the microblog account identification number, and the basic information of the microblog account;

2) Build a transactional database

Transform the discovery of hype groups into the largest frequent itemset mining in data mining. Based on the collection of hype microblog samples, the corresponding transactions of hype microblogs and the items in the corresponding transactions of accounts participating in microblog forwarding are constructed to build a transaction database. as shown in picture 2;

3) Screening of candidate maximum frequent itemsets based on binary search

Due to the hype that each transaction in the microblog transaction database contains tens of thousands of items, directly mining the largest frequent item set in the original transaction database will affect the efficiency of algorithm execution. Based on the binary search method, it can quickly eliminate the items in the transaction. For non-frequent items, find out the candidate set of the largest frequent itemset, reduce the size of the transaction database, given the transaction database D, the minimum support number S, and filter the candidate maximum frequent itemset, the method is:

(1) Sort the transactions in the transaction library D from the largest to the smallest according to the number of items

(2) Record frequent item collection , the set of infrequent items ;Start from i=1, traverse each transaction T _i in D in order (1≤i≤|D|), for each item u in transaction T _i :

a) If u∈FI, keep u;

b) If u∈NFI, remove u from T _i ;

c) if , go to the next step to judge whether u is a frequent item;

(3) Start traversing the remaining transactions from j=i+1, and use the binary search method to judge whether u is contained in T _j , i<j≤|D|, and the termination condition is:

a) When the number of transactions containing u reaches S, it means that u is a frequent item, and u is added to FI;

b) When the sum of the number of remaining transactions and the number of transactions containing u is less than S, it means that u is an infrequent item, and u is removed from T _i . If the number of transactions containing u is greater than 1, it means that u Also appears in a transaction other than T _i , then add u to NFI;

(4) After eliminating the non-frequent items in all affairs in D, the reduced transaction library _D1 can be obtained;

4) Mining of maximum frequent itemsets based on iterative intersection:

Mining the maximum frequent itemset by iteratively taking the intersection of transactions, given the reduced transaction database D ₁ , the minimum support number S, the method of mining the maximum frequent itemset is as follows:

(1) Sort the transactions in the transaction database D ₁ according to the number of items from large to small, so as to find the largest frequent item set as soon as possible, in order to reduce the size of the transaction database, merge the repeated transactions in the transaction database, and count the number of transactions ;

(2) In order to reduce the number of intersections, for a transaction T _i , 1≤i≤|D ₁ |-S+1, starting from i=1, first find the transaction set that contains any item in T _i , T _j |T _j contains at least one item in T _i ; j>i), T _i takes the intersection with T _j in turn, moves the intersection of the two into the new transaction library D ₂ , and removes T _j at the same time, ;

(3) For the transaction T in the new transaction database D ₂ , if T is obtained by taking the intersection of not less than S transactions, then move T into the maximum frequent candidate item set set MFCS, and remove the child of T in D ₂ affairs;

(4) If the number of remaining transactions in the new transaction library D ₂ is less than S, then end the processing of the transaction library D ₂ and return to the upper-level transaction library; otherwise, start the process from the first step to D ₂ ;

(5) When the number of remaining transactions in the transaction database D ₁ is less than S, that is, i>|D ₁ |-S+1, end the processing of the current transaction database D ₁ ;

(6) Merge the itemsets in the MFCS and eliminate the non-maximal frequent itemsets at the same time, and the final result is the maximum frequent itemsets set MFS sought;

5) Maximum frequent itemsets merge:

Due to the limitation of the minimum support number, the maximum frequent itemsets in MFS are small in size, and there are a large number of overlapping items between some itemsets. The account groups represented by these itemsets are likely to belong to the same hype group. In order to solve this The problem is to use the overlap rate to reflect the similarity between two item sets, set the item set X ₁ , X ₂ ∈ MFS, and record the overlap rate of X ₁ and X ₂ as:

ORate ORate (({X x}_{11},, {X x}_{22})) = = \frac{| | {X x}_{11} \cap \cap {X x}_{22} | |}{Min Min ((| | {X x}_{11} | |,, | | {X x}_{22} | |))}

In the above formula, |X ₁ ∩X ₂ | represents the number of overlapping items between X ₁ and X ₂ , Min(|X ₁ |,|X ₂ |) represents the number of items in a smaller itemset, and itemsets are merged The method is:

(1) Sort the maximum frequent itemsets in the MFS from the largest to the smallest according to the number of items;

(2) Traversing each maximum frequent itemset in MFS, starting from i=1, for , if ORate(X _i ,X _j )≥minOR, i<j≤|MFS|, then add the union of Xi and X _j to the new set MMFS _, and remove X _j at the same time;

(3) Repeat the above two steps for the itemsets in MMFS;

(4) When the overlapping rate of any two itemsets in MMFS is less than minOR, end.

3. the microblog hype group discovery method based on maximum frequent itemset mining according to claim 2 is characterized in that, in the described step 1), the collection of hype microblog samples should meet the following conditions:

a. Select popular microblogs with a relatively high number of retweets;

b. The time span of Weibo releases is <180 days; to facilitate the discovery of hype groups.