CN103927398A - Microblog hype group discovering method based on maximum frequent item set mining - Google Patents
Microblog hype group discovering method based on maximum frequent item set mining Download PDFInfo
- Publication number
- CN103927398A CN103927398A CN201410188004.7A CN201410188004A CN103927398A CN 103927398 A CN103927398 A CN 103927398A CN 201410188004 A CN201410188004 A CN 201410188004A CN 103927398 A CN103927398 A CN 103927398A
- Authority
- CN
- China
- Prior art keywords
- affairs
- microblogging
- collection
- maximum frequent
- item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000005065 mining Methods 0.000 title claims abstract description 32
- 238000005516 engineering process Methods 0.000 claims abstract description 6
- 230000009467 reduction Effects 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 4
- 230000000644 propagated effect Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 3
- 230000001902 propagating effect Effects 0.000 claims description 3
- 238000007418 data mining Methods 0.000 claims description 2
- 230000009286 beneficial effect Effects 0.000 claims 1
- 230000005540 biological transmission Effects 0.000 abstract description 3
- 238000011160 research Methods 0.000 description 9
- 230000006399 behavior Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 235000001674 Agaricus brunnescens Nutrition 0.000 description 4
- 244000097202 Rathbunia alamosensis Species 0.000 description 2
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 2
- 230000006854 communication Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000009931 harmful effect Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 244000025254 Cannabis sativa Species 0.000 description 1
- 241000270322 Lepidosauria Species 0.000 description 1
- 244000046052 Phaseolus vulgaris Species 0.000 description 1
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- NUFNQYOELLVIPL-UHFFFAOYSA-N acifluorfen Chemical compound C1=C([N+]([O-])=O)C(C(=O)O)=CC(OC=2C(=CC(=CC=2)C(F)(F)F)Cl)=C1 NUFNQYOELLVIPL-UHFFFAOYSA-N 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a microblog hype group discovering method based on maximum frequent item set mining. Microblog hype groups are effectively discovered, and false and malicious hype is avoided. The method comprises the steps that the correlation of a hype microblog is used as a clue, and an account set participating in hype microblog transmission is obtained based on the crawler technology or a microblog public open platform; single microblogs are used as transactions, accounts participating in microblog transmission are used as items, and a hype microblog transaction database is built; for transactions, corresponding to a microblog set to be detected, in the transaction database, the maximum frequent item sets contained in the transactions are found, the overlapping ratio of the maximum frequent item sets is calculated, small-scale item sets are merged into large item sets, the intersection frequency is reduced, whether the transactions comprise a certain item or not is judged through the binary search method when intersection is obtained between the transactions, the maximum frequent item set mining efficiency is improved, and microblog hype groups are discovered. The microblog hype group discovering method based on maximum frequent item set mining is simple, and can accurately discover malicious microblog hype groups and avoid negative influences on the society.
Description
Technical field
The present invention relates to microblogging public sentiment monitoring field, particularly a kind of microblogging based on Maximum Frequent item set mining is propagandized colony's discover method.
Background technology
Microblogging, as a kind of emerging Social Media form, has blog, media, instant communication function concurrently.The instantaneity of microblogging self, grass roots, movability, the feature such as interactive become the natural carrier that network public-opinion is propagated.In network public-opinion, microblogging not only becomes center and the channel of Public Opinion Transmission, also participates in formation, development and the bootup process of public opinion simultaneously.
It is a double-edged sword that microblogging is propagated: on the one hand, microblogging provides a platform for response fast for the information disclosure in some social events, and it has made up the deficiency of traditional media and other network tools to a certain extent; On the other hand, microblogging is different from traditional news media, and the issue of its news exists repeatability, and authenticity cannot ensure, may be utilized the carrier that becomes rumour and propagate, the safety fuse of discontented mood, cause extremely bad consequence even to national security and social stability.The unreal information of network starts from its fabricator, is spread in its blazer.
Social activity under Hewlett-Packard is calculated research team and is claimed in up-to-date report, and Sina's microblogging exists abnormal serious topic to propagandize problem, has half all to send by propagandizing user around hot issue in the microblogging forwarding.Research discovery, it is very big that the falseness that hot issue is artificially handled in propagating forwards quantity, and 1% rubbish message sender has created 49% transfer amount.Since in August, 2013, government department has strengthened the dynamics to network public opinion guiding, according to the investigation result to the place network pushing hands company such as " Qin Huohuo ", " vertical two tear four open ", in network, exist a large amount of organized pushing hands team, they are in league with minority " leader of opinion " organization network " waterborne troops ", concocting false news on the net, distorting the facts deliberately for a long time, strike trouble, confuse right and wrong, seriously upset network public opinion order, its behavior has been subject to showing great attention to of national public sentiment management and control, and relevant people etc. are also detained for criminal act because being suspected of committing a crime in accordance with the law.
Therefore, towards New Media, for various hiding public opinion demagogueries, carry out the identification to propagandizing microblogging, analyze it and propagate population characteristic, collect the identification evidence of false propelling movement behavior, screen the artificial propagation focus of manufacturing, for discovery, prediction, which directs network public opinion, improve government public opinion ability to supervise, safeguard that social harmony is stable and there is important theory value and realistic meaning.
Along with the explosive growth of microblogging, attracted the broad interest of Chinese scholars for the research of microblogging account, some achievements in research are delivered in recent years in the momentous conference such as WWW, KDD.At present can roughly be divided into following three classes to the research of microblogging account: 1) signature analysis, comprises account attributes feature and behavioural characteristic etc.; 2) influence power analysis, comprises influence power appraisement system structure and measure etc.; 3) relational network analysis between account, comprises base attribute, generation and the evolution etc. of account relational network.
But relatively less to the document of propagation colony research both at home and abroad at present, main pertinent literature has the identification of pair rubbish account (spammer), vest account (sockpuppet), corpse account.Rubbish account refers to the account of frequent issue junk information, the people such as Z.Yi from multiple angle analysis the feature of rubbish account, and adopt the mode of machine learning automatically to identify rubbish account.The people such as Chao Yang have analysed in depth the social relationships between rubbish account, have proposed a kind of method of finding rubbish account according to cohesion between account.The dummy account of the behaviors such as vest account refers to by registering multiple accounts and posts, forwards, comment, the people such as Xueling Zheng have proposed a kind of method of utilizing content of text, similarity to mate to identify vest account.Corpse account refers to the account of malicious registration in order to carry out bean vermicelli dealing, and Fang Ming etc. have proposed a kind of intelligent method for classifying based on the feature extraction of microblogging login account name, have higher accuracy rate.But these methods the unresolved microblogging propagation colony that how to find, prevent false propagation, propagandizing difference maximum between account and above a few class account is, propagandize account and lay particular emphasis on its " propagation " behavior, participate in propagandize account comparatively disperse and direct relation not obvious, disguise and the sense of organization are stronger, are also more difficult to find.
Colony propagandizes with common microblogging similar, the posting of propagation crowd, forward, comment etc. isolates on behavior surface, but to propagate be not often single people's behavior to unconventional malice, but organized group behavior, but this group behavior is hidden, be difficult to discover.Therefore, how finding microblogging propagation colony, prevent the false maliciously harmful effect that propagation causes to society and unnecessary economic loss, is the technical matters that must conscientiously solve.
Summary of the invention
For above-mentioned situation, for overcoming the defect of prior art, the present invention's object is just to provide a kind of microblogging based on Maximum Frequent item set mining and propagandizes colony's discover method, can effectively solve the discovery of microblogging propagation colony, prevents the problem that false malice is propagandized.
The technical scheme that the present invention solves is that the microblogging based on Maximum Frequent item set mining is propagandized account discover method and comprised the steps:
(1) propagandizing microblogging sample collects: taking the correlativity of propagandizing microblogging as clue, obtain and participate in propagandizing the account aggregation that microblogging is propagated based on crawler technology or the public open platform of microblogging;
(2) transaction database builds: taking single microblogging as affairs, the account that participates in microblogging propagation is item, builds and propagandizes microblogging transaction database;
(3) Maximum Frequent item set mining: to the each affairs in the corresponding transaction database of microblogging group to be detected, utilize iteration common factor method to find out the maximum frequent set collection comprising in all affairs, obtain the set of some maximum frequent set collection;
Owing to propagandizing, the project that in microblogging affairs storehouse, each transaction packet contains is mostly ten hundreds of, directly in original transaction database, Mining Maximum Frequent Itemsets will affect the efficiency that algorithm is carried out, utilize binary chop, reject fast the non-frequent item in affairs, find out the candidate collection of maximum frequent set collection, reduction transaction database scale;
(4) maximum frequent set collection merger: to each maximum frequent set collection, Duplication between computational item collection, maximum frequent set collection is merged, item collection less scale is integrated into compared with sport and is concentrated as far as possible, and ensure that the consequent concentrated account of merger still has certain relevance; By reduction transaction database scale, reduce common factor number of times, between affairs, get while common factor, adopt binary chop to judge in affairs whether comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thereby discovery microblogging is propagandized colony.
The inventive method is simple, easy to operate, can accurately find that malice microblogging propagandizes colony, prevents the harmful effect that causes to society and unnecessary economic loss, has actual using value.
Brief description of the drawings
Fig. 1 is flow chart element diagram of the present invention.
Fig. 2 is propagation microblogging transaction database schematic diagram of the present invention.
Fig. 3 is that the present invention propagandizes microblogging transaction database sectional drawing.
Fig. 4 is algorithm of the present invention execution time comparison diagram on Mushroom data set.
Fig. 5 is that algorithm of the present invention is being propagandized execution time comparison diagram on microblogging data set.
Fig. 6 is MFS middle term collection number variation diagram of the present invention.
Fig. 7 is the maximum length variation diagram of MFS middle term collection of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is elaborated.
Provided by Fig. 1, the present invention includes and propagandize microblogging affairs storehouse, Maximum Frequent item set mining and maximum frequent set collection merger part, propagandize microblogging affairs storehouse structure module and be mainly responsible for image data and carry out pre-service, build transaction database D; First Maximum Frequent item set mining module based on binary chop method screening candidate maximum frequent set collection, then excavates maximum frequent set collection MFS from affairs database D based on iteration Intersection set method; Maximum frequent set collection merger module is mainly carried out merger processing to MFS, and with the propagation colony of rediscover as far as possible, concrete steps are:
1), collect and propagandize microblogging sample
Propagandize the collection of microblogging sample and realize initial step of the present invention, the selection of microblogging sample should have correlativity, if certain propagandizes some microbloggings that account once participated in, or with some microbloggings of certain Topic relative, the judgement of microblogging sample should be used for reference existing ripe method of discrimination or expert system, propagandizing microblogging sample collects and has two kinds of methods: a kind of method is to select crawler technology, from microblogging page download webpage, resolve page structure and extract microblogging the information of propagating account; Another kind method is to call the public open platform of microblogging, and the api function that calling microblogging official externally provides obtains the information of microblogging propagation account, in order to be conducive to the discovery to propagandizing colony, also should follow following principle in the time choosing propagation microblogging sample:
A, choose and forward number relatively high popular microblogging;
B, microblogging issuing time span L EssT.LTssT.LT180 days;
According to the Algorithm Analysis condition of propagation account to be excavated, the content that sample is collected should comprise the essential information of microblogging identification number, microblogging account identification number, microblogging account;
2) build transaction database
Propagation colony is pinpointed the problems and is converted into the Maximum Frequent item set mining in data mining, propagandizing on the basis of microblogging sample collection, will propagandize the corresponding affairs of microblogging, the item in the corresponding affairs of account of participation microblogging forwarding, structure transaction database, as shown in Figure 2;
3) screening of the candidate's maximum frequent set collection based on binary chop
Owing to propagandizing, the project that in microblogging affairs storehouse, each transaction packet contains is mostly ten hundreds of, directly in original transaction storehouse, Mining Maximum Frequent Itemsets will affect the efficiency that algorithm is carried out, based on the method for binary chop, can reject fast the non-frequent item in affairs, find out the candidate collection of maximum frequent set collection, reduction affairs storehouse scale, given transaction database D, minimum number of support S, carries out the screening of candidate's maximum frequent set collection, and method is:
(1) affairs in the D of affairs storehouse are sorted from big to small by project number
(2) note frequent item set
, Infrequent item-set closes
; From i=1, travel through in order the each affairs T in D
i(1≤i≤| D|), to affairs T
iin each project u:
If a) u ∈ FI, retains u;
If b) u ∈ NFI, from T
imiddle rejecting u;
If c)
, forward next step to and judge whether u is frequent item;
(3), start to travel through remaining affairs from j=i+1, and utilize binary chop to judge T
j, i<j≤| in D|, whether comprise u, end condition is:
A), in the time that the affairs number that comprises u reaches S, illustrate that u is frequent item, joins u in FI;
B) in the time that remaining affairs number is less than S with the affairs number sum that has comprised u, illustrate that u is non-frequent item, from T
imiddle rejecting u.Be greater than 1 if now comprised the affairs number of u, illustrate that u also appears at T
ioutside affairs in, u is joined in NFI;
(4) rejected in D after the non-frequent item in all affairs, can obtain the affairs storehouse D after reduction
1;
4) the Maximum Frequent item set mining occuring simultaneously based on iteration:
By affairs iteration being got to the mode Mining Maximum Frequent Itemsets of common factor, the affairs storehouse D after given reduction
1, minimum number of support S, the method for Maximum Frequent item set mining is as follows:
(1) by affairs storehouse D
1in affairs sort from big to small by the number of item, to find as early as possible maximum frequent set collection, be reduction affairs storehouse scale, merge the affairs that repeat in affairs storehouse, and to an affairs counting number;
(2) for reducing the number of times of getting common factor, for affairs T
i, 1≤i≤| D
1|-S+1, from i=1, first finds out and has comprised T
ithe affairs set of middle Arbitrary Term
, T
j| T
jat least comprise a project in Ti; J>i), T
isuccessively with T
jget common factor, both common factors are moved into new affairs storehouse D
2, reject T simultaneously
j,
;
(3) for new affairs storehouse D
2in affairs T, if T be by be not less than S affairs get occur simultaneously and obtain, T is moved in Maximum Frequent candidate set MFCS, reject T at D simultaneously
2in subtransaction;
(4) if new affairs storehouse D
2in residue affairs number be less than S, finish affairs storehouse D
2processing, turn back to affairs storehouse, upper strata; Otherwise, to D
2carry out again this process since the 1st step;
(5) as affairs storehouse D
1in remaining number of transactions while being less than S, i.e. i>|D
1|-S+1, finishes current affairs storehouse D
1processing;
(6) the item collection in MFCS is merged and rejects non-maximum frequent set collection simultaneously, last result is required maximum frequent set collection set MFS;
5) maximum frequent set collection merger:
Due to the restriction of minimum number of support, make in MFS maximum frequent set collection scale less, and between some collection, there is a large amount of crowded items, account groups of these collection representatives are probably subordinated to same propagation colony, for addressing this problem, reflect two similaritys between item collection by Duplication, establish a collection X
1, X
2∈ MFS, by X
1and X
2duplication be designated as:
In above formula, | X
1∩ X
2| represent X
1with X
2crowded item object number, Min (| X
1|, | X
2|) expression scale less item concentrated project number, collection merger method be:
(1) the maximum frequent set collection in MFS is sorted from big to small by the number of project;
(2) the each maximum frequent set collection in traversal MFS, from i=1, right
if, ORate (X
i, X
j)>=minOR, i<j≤| MFS|, by X
iand X
junion add in new set MMFS, reject X simultaneously
j;
(3) the item collection in MMFS is repeated to above two steps;
(4), in the time that the Duplication of any two item collection in MMFS is less than minOR, finish.
The inventive method is simple, easy to operate, and through practical probation, shows that method is reliable and stable, has actual using value, and interrelated data is as follows:
1) data set
Using Sina's microblogging as research platform, taking 81 microbloggings with propagation suspicion as research object, the account quantity of its forwarding of actual participation is 380,726 (not containing the accounts that repeatedly participate in forwarding), the project number of average every affairs is 6,286, these microbloggings belong to advertisement marketing class mostly, likely exist multiple propagation colony to participate in its communication process.Utilize reptile program to crawl and participate in all account identification (UID) that these microbloggings forward, and store in transaction database, the form of partial data as shown in Figure 3.
In order to verify that algorithm of the present invention (hereinafter to be referred as IIA) is applied to the efficiency of Maximum Frequent item set mining, carries out performance test to classical Mushroom data set, and compares with known method.This data set has comprised 8,124 records, and every records 23 items, has recorded 23 attributes of mushroom.
2) Performance Evaluation
First the performance of the method for the invention is assessed, experimental situation is 4G internal memory, 2.0GHz double-core Duo T5800CPU, Windows732 bit manipulation system, realizes this algorithm with Java, and compares with classical MAFIA algorithm and DFMFI algorithm respectively.
Fig. 4 is the implementation status of three kinds of algorithms under the different supports of Mushroom data centralization, can find out that the efficiency of this method is apparently higher than other two kinds of algorithms, even execution efficiency also has superiority in the situation that minimum support is very low.Fig. 5 is three kinds of algorithms implementation status on propagation microblogging data set, can find out that the execution efficiency of this method is the highest.
3) parameter threshold is selected
Fig. 6, Fig. 7 assemble fruit from the maximum frequent set of propagandizing the discovery of microblogging data centralization under the minimum number of support of difference, and Fig. 6 and Fig. 7 represent that respectively concentrated of maximum frequent set collects number and maximum frequent set is concentrated the variation of a maximum length collecting with minimum number of support.Can find in conjunction with research background of the present invention, it is larger that minSup (minimum number of support) sets, and it is larger that the account colony of discovery propagandizes suspicion, but population size and quantity also can reduce thereupon; Otherwise it is less that minSup sets, it is less that the account colony of discovery propagandizes suspicion, but population size and quantity can increase.For this reason, need to set a rational threshold value to minSup, to find colony of certain scale and that propagation suspicion is higher.
On the other hand, in the time that the item collection that maximum frequent set is concentrated carries out merger, the setting of minOR also will directly affect the scale that merges consequent collection.By the continuous analysis to data, minOR is set as to 50%, when the project that exceedes half when two item collection is identical, merged.
In order further to determine the value of minSup, table 1 has been listed respectively minSup=3, and 4,5 o'clock to the result after the merger of maximum frequent set collection, sorts by the consequent collection length of merger, has only listed front 8 item collection (doubtful propagation colony) here.As can be seen from the table, in the time of minSup=3 and 5, except first collection is on a grand scale, other collection scale is all very little; And in the time of minSup=4, a collection scale does not sharply change, and suitable scale, illustrate that value is relatively reasonable.。
Maximum frequent set collection merger result under the different numbers of support of table 1
Sequence number | minSup=3 | minSup=4 | minSup=5 |
1 | 14,863 | 2,623 | 963 |
2 | 311 | 1,755 | 65 |
3 | 156 | 688 | 29 |
4 | 77 | 410 | 19 |
5 | 59 | 129 | 9 |
6 | 56 | 98 | 9 |
7 | 55 | 82 | 7 |
8 | 55 | 54 | 5 |
4) accuracy rate analysis
In order to verify the accuracy rate of propagandizing colony's discovery algorithm of the present invention, the actual account proportion of propagandizing in the propagation colony finding, in conjunction with the accuracy rate of the existing propagation account recognition methods based on many signature analysises and artificial mask method comprehensive verification result.Suppose that propagation colony to be verified is H, first utilize the existing propagation account recognition methods based on many signature analysises to differentiate each account, the propagation account aggregation obtaining is designated as H
1; Then, adopt the method for artificial mark to differentiate remaining account, the propagation account aggregation obtaining is designated as H
2, the accuracy rate computing formula of propagandizing the H of colony is:
In above formula, | H| represents the account base in H, | H
1|+| H
2| represent propagation account number actual in H.MinSup=4 and population size in his-and-hers watches 1 (collection length) are greater than 100 partial mass to be verified, concrete outcome is as shown in table 2.
Table 2 is propagandized the accuracy rate (minSup=4) that colony finds
Sequence number | |H 1| | |H 2| | |H| | Precision |
1 | 2,016 | 451 | 2,623 | 94.1% |
2 | 1,465 | 163 | 1,755 | 92.8% |
3 | 571 | 78 | 688 | 94.3% |
4 | 354 | 33 | 410 | 94.4% |
5 | 109 | 10 | 129 | 92.2% |
From table 2, can see, each that find for this method is propagandized colony, and the actual shared ratio of account of propagandizing, all higher than 90%, shows that it (is H that this method can identify more hidden propagation account
2), and these accounts participate in propagandizing but the huge propagation large size of influence power more once in a while.As can be seen here, the present invention has actual using value, and economic and social benefit is huge.
Claims (3)
1. the microblogging based on Maximum Frequent item set mining is propagandized colony's discover method, it is characterized in that, comprises the steps:
(1) propagandizing microblogging sample collects: taking the correlativity of propagandizing microblogging as clue, obtain and participate in propagandizing the account aggregation that microblogging is propagated based on crawler technology or the public open platform of microblogging;
(2) transaction database builds: taking single microblogging as affairs, the account that participates in microblogging propagation is item, builds and propagandizes microblogging transaction database;
(3) Maximum Frequent item set mining: to the each affairs in the corresponding transaction database of microblogging group to be detected, utilize iteration common factor method to find out the maximum frequent set collection comprising in all affairs, obtain the set of some maximum frequent set collection;
Owing to propagandizing, the project that in microblogging affairs storehouse, each transaction packet contains is mostly ten hundreds of, directly in original transaction database, Mining Maximum Frequent Itemsets will affect the efficiency that algorithm is carried out, utilize binary chop, reject fast the non-frequent item in affairs, find out the candidate collection of maximum frequent set collection, reduction transaction database scale;
(4) maximum frequent set collection merger: to each maximum frequent set collection, Duplication between computational item collection, maximum frequent set collection is merged, item collection less scale is integrated into compared with sport and is concentrated as far as possible, and ensure that the consequent concentrated account of merger still has certain relevance; By reduction transaction database scale, reduce common factor number of times, between affairs, get while common factor, adopt binary chop to judge in affairs whether comprise certain project, to improve the efficiency of Mining Maximum Frequent Itemsets, thereby discovery microblogging is propagandized colony.
2. the microblogging based on Maximum Frequent item set mining according to claim 1 is propagandized colony's discover method, it is characterized in that, comprise and propagandize microblogging affairs storehouse, Maximum Frequent item set mining and maximum frequent set collection merger part, propagandize microblogging affairs storehouse structure module and be mainly responsible for image data and carry out pre-service, build transaction database D; First Maximum Frequent item set mining module based on binary chop method screening candidate maximum frequent set collection, then excavates maximum frequent set collection MFS from affairs database D based on iteration Intersection set method; Maximum frequent set collection merger module is mainly carried out merger processing to MFS, the propagation colony of rediscover, and concrete steps are:
1), collect and propagandize microblogging sample
Propagandize the collection of microblogging sample and realize initial step of the present invention, the selection of microblogging sample should have correlativity, if certain propagandizes some microbloggings that account once participated in, or with some microbloggings of certain Topic relative, the judgement of microblogging sample should be used for reference existing ripe method of discrimination or expert system, propagandizing microblogging sample collects and has two kinds of methods: a kind of method is to select crawler technology, from microblogging page download webpage, resolve page structure and extract microblogging the information of propagating account; Another kind method is to call the public open platform of microblogging, and the api function that calling microblogging official externally provides obtains the information of microblogging propagation account;
According to the Algorithm Analysis condition of propagation account to be excavated, the content that sample is collected should comprise the essential information of microblogging identification number, microblogging account identification number, microblogging account;
2) build transaction database
Propagation colony is pinpointed the problems and is converted into the Maximum Frequent item set mining in data mining, propagandizing on the basis of microblogging sample collection, will propagandize the corresponding affairs of microblogging, the item in the corresponding affairs of account of participation microblogging forwarding, structure transaction database, as shown in Figure 2;
3) screening of the candidate's maximum frequent set collection based on binary chop
Owing to propagandizing, the project that in microblogging affairs storehouse, each transaction packet contains is mostly ten hundreds of, directly in original transaction storehouse, Mining Maximum Frequent Itemsets will affect the efficiency that algorithm is carried out, based on the method for binary chop, can reject fast the non-frequent item in affairs, find out the candidate collection of maximum frequent set collection, reduction affairs storehouse scale, given transaction database D, minimum number of support S, carries out the screening of candidate's maximum frequent set collection, and method is:
(1) affairs in the D of affairs storehouse are sorted from big to small by project number
(2) note frequent item set
, Infrequent item-set closes
; From i=1, travel through in order the each affairs T in D
i(1≤i≤| D|), to affairs T
iin each project u:
If a) u ∈ FI, retains u;
If b) u ∈ NFI, from T
imiddle rejecting u;
If c)
, forward next step to and judge whether u is frequent item;
(3), start to travel through remaining affairs from j=i+1, and utilize binary chop to judge T
j, i<j≤| in D|, whether comprise u, end condition is:
A), in the time that the affairs number that comprises u reaches S, illustrate that u is frequent item, joins u in FI;
B) in the time that remaining affairs number is less than S with the affairs number sum that has comprised u, illustrate that u is non-frequent item, from T
imiddle rejecting u, is greater than 1 if now comprised the affairs number of u, illustrates that u also appears at T
ioutside affairs in, u is joined in NFI;
(4) rejected in D after the non-frequent item in all affairs, can obtain the affairs storehouse D after reduction
1;
4) the Maximum Frequent item set mining occuring simultaneously based on iteration:
By affairs iteration being got to the mode Mining Maximum Frequent Itemsets of common factor, the affairs storehouse D after given reduction
1, minimum number of support S, the method for Maximum Frequent item set mining is as follows:
(1) by affairs storehouse D
1in affairs sort from big to small by the number of item, to find as early as possible maximum frequent set collection, be reduction affairs storehouse scale, merge the affairs that repeat in affairs storehouse, and to an affairs counting number;
(2) for reducing the number of times of getting common factor, for affairs T
i, 1≤i≤| D
1|-S+1, from i=1, first finds out and has comprised T
ithe affairs set of middle Arbitrary Term
, T
j| T
jat least comprise T
iin a project; J>i), T
isuccessively with T
jget common factor, both common factors are moved into new affairs storehouse D
2, reject T simultaneously
j,
;
(3) for new affairs storehouse D
2in affairs T, if T be by be not less than S affairs get occur simultaneously and obtain, T is moved in Maximum Frequent candidate set MFCS, reject T at D simultaneously
2in subtransaction;
(4) if new affairs storehouse D
2in residue affairs number be less than S, finish affairs storehouse D
2processing, turn back to affairs storehouse, upper strata; Otherwise, to D
2carry out again this process since the 1st step;
(5) as affairs storehouse D
1in remaining number of transactions while being less than S, i.e. i>|D
1|-S+1, finishes current affairs storehouse D
1processing;
(6) the item collection in MFCS is merged and rejects non-maximum frequent set collection simultaneously, last result is required maximum frequent set collection set MFS;
5) maximum frequent set collection merger:
Due to the restriction of minimum number of support, make in MFS maximum frequent set collection scale less, and between some collection, there is a large amount of crowded items, account groups of these collection representatives are probably subordinated to same propagation colony, for addressing this problem, reflect two similaritys between item collection by Duplication, establish a collection X
1, X
2∈ MFS, by X
1and X
2duplication be designated as:
In above formula, | X
1∩ X
2| represent X
1with X
2crowded item object number, Min (| X
1|, | X
2|) expression scale less item concentrated project number, collection merger method be:
(1) the maximum frequent set collection in MFS is sorted from big to small by the number of project;
(2) the each maximum frequent set collection in traversal MFS, from i=1, right
if, ORate (X
i, X
j)>=minOR, i<j≤| MFS|, by X
iand X
junion add in new set MMFS, reject X simultaneously
j;
(3) the item collection in MMFS is repeated to above two steps;
(4), in the time that the Duplication of any two item collection in MMFS is less than minOR, finish.
3. the microblogging based on Maximum Frequent item set mining according to claim 2 is propagandized colony's discover method, it is characterized in that described step 1) in, collect propagation microblogging sample and should meet following condition:
A, choose and forward number relatively high popular microblogging;
B, microblogging issuing time span L EssT.LTssT.LT180 days; Be beneficial to the discovery to propagandizing colony.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410188004.7A CN103927398B (en) | 2014-05-07 | 2014-05-07 | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410188004.7A CN103927398B (en) | 2014-05-07 | 2014-05-07 | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103927398A true CN103927398A (en) | 2014-07-16 |
CN103927398B CN103927398B (en) | 2016-12-28 |
Family
ID=51145617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410188004.7A Expired - Fee Related CN103927398B (en) | 2014-05-07 | 2014-05-07 | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103927398B (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104516978A (en) * | 2014-12-31 | 2015-04-15 | 天津南大通用数据技术股份有限公司 | Algorithm for compressing middle candidate frequent item sets in field of database intrusion detection |
CN104778475A (en) * | 2015-03-30 | 2015-07-15 | 南京邮电大学 | Image classification method based on maximum frequent visual word of annular region |
CN104954360A (en) * | 2015-04-17 | 2015-09-30 | 腾讯科技(深圳)有限公司 | Method and device for blocking shared content |
CN104991956A (en) * | 2015-07-21 | 2015-10-21 | 中国人民解放军信息工程大学 | Microblog transmission group division and account activeness evaluation method based on theme possibility model |
CN105224593A (en) * | 2015-08-25 | 2016-01-06 | 中国人民解放军信息工程大学 | Frequent co-occurrence account method for digging in a kind of of short duration online affairs |
CN105530265A (en) * | 2016-01-28 | 2016-04-27 | 李青山 | Mobile Internet malicious application detection method based on frequent itemset description |
CN105681312A (en) * | 2016-01-28 | 2016-06-15 | 李青山 | Mobile internet exceptional user detection method based on frequent itemset mining |
CN105808988A (en) * | 2014-12-31 | 2016-07-27 | 阿里巴巴集团控股有限公司 | Method and device for identifying exceptional account |
CN106484679A (en) * | 2016-10-20 | 2017-03-08 | 北京邮电大学 | A kind of false review information recognition methodss being applied on consumption platform and device |
CN106533893A (en) * | 2015-09-09 | 2017-03-22 | 腾讯科技(深圳)有限公司 | Message processing method and system |
CN106650273A (en) * | 2016-12-28 | 2017-05-10 | 东方网力科技股份有限公司 | Behavior prediction method and device |
CN106921565A (en) * | 2017-03-30 | 2017-07-04 | 北京奇艺世纪科技有限公司 | A kind of method and device of junk information identification |
CN107870956A (en) * | 2016-09-28 | 2018-04-03 | 腾讯科技(深圳)有限公司 | A kind of effective item set mining method, apparatus and data processing equipment |
CN109783531A (en) * | 2018-12-07 | 2019-05-21 | 北京明略软件系统有限公司 | A kind of relationship discovery method and apparatus, computer readable storage medium |
CN110033302A (en) * | 2014-10-28 | 2019-07-19 | 阿里巴巴集团控股有限公司 | The recognition methods of malice account and device |
CN110874786A (en) * | 2019-10-11 | 2020-03-10 | 支付宝(杭州)信息技术有限公司 | False transaction group identification method, equipment and computer readable medium |
CN112115305A (en) * | 2019-06-21 | 2020-12-22 | 杭州海康威视数字技术股份有限公司 | Group identification method and device and computer readable storage medium |
TWI718643B (en) * | 2019-01-17 | 2021-02-11 | 開曼群島商創新先進技術有限公司 | Method and device for identifying abnormal groups |
CN112948864A (en) * | 2021-03-19 | 2021-06-11 | 西安电子科技大学 | Verifiable PPFIM method based on vertical partition database |
CN113254755A (en) * | 2021-07-19 | 2021-08-13 | 南京烽火星空通信发展有限公司 | Public opinion parallel association mining method based on distributed framework |
US11620344B2 (en) | 2020-03-04 | 2023-04-04 | Honeywell International Inc. | Frequent item set tracking |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102111296A (en) * | 2011-01-10 | 2011-06-29 | 浪潮通信信息系统有限公司 | Mining method for communication alarm association rule based on maximal frequent item set |
US20130332432A1 (en) * | 2012-06-12 | 2013-12-12 | International Business Machines Corporation | Closed itemset mining using difference update |
-
2014
- 2014-05-07 CN CN201410188004.7A patent/CN103927398B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102111296A (en) * | 2011-01-10 | 2011-06-29 | 浪潮通信信息系统有限公司 | Mining method for communication alarm association rule based on maximal frequent item set |
US20130332432A1 (en) * | 2012-06-12 | 2013-12-12 | International Business Machines Corporation | Closed itemset mining using difference update |
Non-Patent Citations (2)
Title |
---|
丁兆云等: "微博中基于统计特征与双向投票的垃圾用户发现", 《计算机研究与发展》 * |
陈波等: "挖掘最大频繁项集的事务集迭代算法", 《计算机工程与应用》 * |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110033302A (en) * | 2014-10-28 | 2019-07-19 | 阿里巴巴集团控股有限公司 | The recognition methods of malice account and device |
CN104516978B (en) * | 2014-12-31 | 2018-11-27 | 天津南大通用数据技术股份有限公司 | The method of compression intermediate candidate frequent item set for Database Intrusion Detection field |
CN104516978A (en) * | 2014-12-31 | 2015-04-15 | 天津南大通用数据技术股份有限公司 | Algorithm for compressing middle candidate frequent item sets in field of database intrusion detection |
CN105808988B (en) * | 2014-12-31 | 2020-07-03 | 阿里巴巴集团控股有限公司 | Method and device for identifying abnormal account |
CN105808988A (en) * | 2014-12-31 | 2016-07-27 | 阿里巴巴集团控股有限公司 | Method and device for identifying exceptional account |
CN104778475A (en) * | 2015-03-30 | 2015-07-15 | 南京邮电大学 | Image classification method based on maximum frequent visual word of annular region |
CN104778475B (en) * | 2015-03-30 | 2018-01-19 | 南京邮电大学 | A kind of image classification method based on annular region Maximum Frequent vision word |
CN104954360A (en) * | 2015-04-17 | 2015-09-30 | 腾讯科技(深圳)有限公司 | Method and device for blocking shared content |
CN104991956A (en) * | 2015-07-21 | 2015-10-21 | 中国人民解放军信息工程大学 | Microblog transmission group division and account activeness evaluation method based on theme possibility model |
CN104991956B (en) * | 2015-07-21 | 2018-07-31 | 中国人民解放军信息工程大学 | Microblogging based on theme probabilistic model is propagated group and is divided and account liveness appraisal procedure |
CN105224593B (en) * | 2015-08-25 | 2019-08-16 | 中国人民解放军信息工程大学 | Frequent co-occurrence account method for digging in the of short duration online affairs of one kind |
CN105224593A (en) * | 2015-08-25 | 2016-01-06 | 中国人民解放军信息工程大学 | Frequent co-occurrence account method for digging in a kind of of short duration online affairs |
CN106533893A (en) * | 2015-09-09 | 2017-03-22 | 腾讯科技(深圳)有限公司 | Message processing method and system |
CN106533893B (en) * | 2015-09-09 | 2020-11-27 | 腾讯科技(深圳)有限公司 | Message processing method and system |
CN105681312B (en) * | 2016-01-28 | 2019-03-05 | 李青山 | A kind of mobile Internet abnormal user detection method based on frequent item set mining |
CN105530265A (en) * | 2016-01-28 | 2016-04-27 | 李青山 | Mobile Internet malicious application detection method based on frequent itemset description |
CN105681312A (en) * | 2016-01-28 | 2016-06-15 | 李青山 | Mobile internet exceptional user detection method based on frequent itemset mining |
CN107870956B (en) * | 2016-09-28 | 2021-04-27 | 腾讯科技(深圳)有限公司 | High-utility item set mining method and device and data processing equipment |
CN107870956A (en) * | 2016-09-28 | 2018-04-03 | 腾讯科技(深圳)有限公司 | A kind of effective item set mining method, apparatus and data processing equipment |
CN106484679B (en) * | 2016-10-20 | 2020-02-11 | 北京邮电大学 | False comment information identification method and device applied to consumption platform |
CN106484679A (en) * | 2016-10-20 | 2017-03-08 | 北京邮电大学 | A kind of false review information recognition methodss being applied on consumption platform and device |
CN106650273B (en) * | 2016-12-28 | 2019-08-23 | 东方网力科技股份有限公司 | A kind of behavior prediction method and apparatus |
CN106650273A (en) * | 2016-12-28 | 2017-05-10 | 东方网力科技股份有限公司 | Behavior prediction method and device |
CN106921565B (en) * | 2017-03-30 | 2019-12-13 | 北京奇艺世纪科技有限公司 | Junk information identification method and device |
CN106921565A (en) * | 2017-03-30 | 2017-07-04 | 北京奇艺世纪科技有限公司 | A kind of method and device of junk information identification |
CN109783531A (en) * | 2018-12-07 | 2019-05-21 | 北京明略软件系统有限公司 | A kind of relationship discovery method and apparatus, computer readable storage medium |
TWI718643B (en) * | 2019-01-17 | 2021-02-11 | 開曼群島商創新先進技術有限公司 | Method and device for identifying abnormal groups |
CN112115305A (en) * | 2019-06-21 | 2020-12-22 | 杭州海康威视数字技术股份有限公司 | Group identification method and device and computer readable storage medium |
CN112115305B (en) * | 2019-06-21 | 2024-04-09 | 杭州海康威视数字技术股份有限公司 | Group identification method apparatus and computer-readable storage medium |
CN110874786A (en) * | 2019-10-11 | 2020-03-10 | 支付宝(杭州)信息技术有限公司 | False transaction group identification method, equipment and computer readable medium |
CN110874786B (en) * | 2019-10-11 | 2022-10-18 | 支付宝(杭州)信息技术有限公司 | False transaction group identification method, device and computer readable medium |
US11620344B2 (en) | 2020-03-04 | 2023-04-04 | Honeywell International Inc. | Frequent item set tracking |
CN112948864A (en) * | 2021-03-19 | 2021-06-11 | 西安电子科技大学 | Verifiable PPFIM method based on vertical partition database |
CN112948864B (en) * | 2021-03-19 | 2022-12-06 | 西安电子科技大学 | Verifiable PPFIM method based on vertical partition database |
CN113254755A (en) * | 2021-07-19 | 2021-08-13 | 南京烽火星空通信发展有限公司 | Public opinion parallel association mining method based on distributed framework |
Also Published As
Publication number | Publication date |
---|---|
CN103927398B (en) | 2016-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103927398B (en) | The microblogging excavated based on maximum frequent itemsets propagandizes colony's discovery method | |
CN103176983B (en) | A kind of event method for early warning based on internet information | |
CN103116605B (en) | A kind of microblog hot event real-time detection method based on monitoring subnet and system | |
Xia et al. | Phishing detection on ethereum via attributed ego-graph embedding | |
CN110457404A (en) | Social media account-classification method based on complex heterogeneous network | |
Qu et al. | Efficient online summarization of large-scale dynamic networks | |
Li et al. | Mining blackhole and volcano patterns in directed graphs: A general approach | |
CN106980651B (en) | Crawling seed list updating method and device based on knowledge graph | |
Guo et al. | GroupMe: Supporting group formation with mobile sensing and social graph mining | |
CN107239512A (en) | The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination | |
CN103488683B (en) | Microblog data management system and implementation method thereof | |
CN113422761A (en) | Malicious social user detection method based on counterstudy | |
Liu | Study on application of apriori algorithm in data mining | |
CN109597926A (en) | A kind of information acquisition method and system based on social media emergency event | |
Xu et al. | FaNDS: Fake news detection system using energy flow | |
Paraschiv et al. | A unified graph-based approach to disinformation detection using contextual and semantic relations | |
CN106681980A (en) | Method and device for analyzing junk short messages | |
CN105589916B (en) | Method for extracting explicit and implicit interest knowledge | |
CN106411704A (en) | Distributed junk short message recognition method | |
CN110287237A (en) | One kind analyzing efficient corporations' data digging method based on social network structure | |
Zhang et al. | Efficient top-k edge structural diversity search | |
CN110851684B (en) | Social topic influence recognition method and device based on ternary association graph | |
CN105989176A (en) | Data processing method and device | |
Tyagi et al. | Twitter bot detection using machine learning models | |
CN116723005A (en) | Method and system for tracking malicious code implicit information under polymorphic hiding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20161228 |