Summary of the invention
In order to overcome the deficiency cannot analyzing Chinese microblog emotional in prior art, the invention provides that a kind of dirigibility is higher, reliability merges Chinese microblog emotional analytical approach that is dominant and recessive character preferably.
The technical solution adopted for the present invention to solve the technical problems is:
Merge a dominant and Chinese microblog emotional analytical approach that is recessive character, described Chinese microblog emotional analytical approach comprises the following steps:
1) microblogging dominant character process, specifically comprises following process:
1.1) emoticon process: build emotional symbol storehouse according to the expression that microblogging carries, according to 7 class sensibility classification methods, emotion is divided into happiness, hobby, indignation, sadness, fear, detest, surprised seven classifications, the frequency of occurrences is come the emoticon of front 150, do unitized process, namely emotional symbol table is first set up, 150 emoticons are put into emotional symbol table, judge whether this emotional symbol belongs to emotional symbol table by lookup table mode, if then extract emotional symbol, after converting emotion classification to, write affective characteristics table;
1.2) emotion word process: first set up an emotion vocabulary based on sentiment dictionary, emotion word in microblogging is put into vocabulary, judge, by whether being emotion word after text participle, if then extract emotion word, and to write affective characteristics table by the mode of tabling look-up;
Set up the emotion vocabulary of a vocabulary Network Based again, the network words in microblogging is put into vocabulary, by the emotion classification of lookup table mode judging section content of microblog;
2) microblogging recessive character process: create initial emotion bunch based on frequent item set, each initial emotion bunch text, containing frequent item set, adopts the Chinese semantic similarity model knowing net, is separated each initial emotion bunch according to maximum simple semantic degree principle; Finally, by semantic similarity matrix between definition bunch, complete the Agglomerative hierarchical clustering of microblog emotional bunch, and optimize and obtain final emotion bunch, realize microblog emotional analysis.
Further again, described step 2) comprise following process:
2.1) Frequent Itemsets Mining Association Rules Algorithm Apriori is adopted to calculate Mining Frequent word set
Frequent item set is utilized to divide the initial emotion bunch of structure, frequent tendency word set microblogging will be comprised and be divided into one bunch, obtain based on the initial emotion bunch of frequent item set, simultaneously, to the frequent item set of initial emotion bunch be described as corresponding emotion bunch temporary mark, represent this initial emotion bunch emotional semantic by the frequent item set extracting each initial emotion bunch;
2.2) microblogging simple semantic degree initial cluster overlap abatement
Every bar microblogging is belonged to an emotion bunch, and between compute cluster, lap is to the emotional semantic degree of membership of initial emotion bunch, is finally undertaken bunch distributing by maximum simple semantic degree principle; Deleting the rear size of those initial cluster separation is again the sky bunch of 0, and the initial cluster after overlapping abatement is called candidate's emotion bunch;
2.3) based on the coagulation type emotion cluster of semantic similarity: carry out Agglomerative hierarchical clustering to candidate's emotion bunch, emotion bunch is merged.
Further again, described step 2.1) in,
Definition 1: to certain collection X in database E, if the number of times that item collection X occurs in database E is greater than preset ratio, then title X is the frequent item set of database E, and this preset ratio is called minimum support;
If text to be regarded as affairs, the project in the corresponding affairs of text vocabulary, then can be expressed as text d: d=<t
1, t
2..., t
n>, wherein n represents the feature vocabulary quantity that text d comprises;
Definition 2: to certain word set W of text set D, if the support s of W in D (W) >=min_s, then title power set W is the frequent term set of text set D, and min_s is global minima support;
Scan text collection D, utilizes word frequency Trend Degree to add up the number of times of candidate appearance, collects the item collection meeting minimum support min_s and set, is designated as frequent item set; Utilize the frequent k-item collection structure Strong association rule produced, utilize frequent k-item collection to construct candidate (k+1)-item collection, iterate until candidate (k+1)-Xiang Jiwei is empty.
Further, described step 2.2) in,
Definition 3: if microblogging doc
jbe assigned to initial emotion bunch C
iin, then claim microblogging doc
jsupport bunch C
i;
Definition 4: note D
iand D
jsupport bunch C
iand C
jmicroblogging set, and D
i∩ D
j≠ 0, then claim bunch C
iwith a bunch C
joverlapping between existing bunch;
Definition 5: microblog emotional simple semantic degree, the present invention is by microblogging doc
jto a bunch C
iemotional semantic membership function be defined as follows:
Wherein, bunch frequent 1-item collection { f
i1, f
i2..., f
imrepresent initial cluster C
iaffective characteristics item, { t
j1, t
j2..., t
jnrepresent initial cluster C
imiddle microblogging text doc
jcharacteristic item; Sim (f
ik, t
jl) be a bunch characteristic item f
jkwith text feature item t
jlthe semantic similarity of definition in " knowing net ", n is microblogging text doc
jcharacteristic item number, m is a bunch characteristic item number.
Further again, described step 2.3) in,
Definition 6: bunch proper vector, for candidate's emotion bunch CT
i, excavate CT
ibunch frequent 1-item collection, namely form bunch proper vector of this bunch, be designated as
Definition 7: bunch similarity matrix, remembers two different candidate's emotion bunch CT
iand CT
ja bunch proper vector be respectively:
with
wherein n and m submeter representation feature vocabulary quantity, then CT
iand CT
jcharacteristic item form a bunch semantic similarity matrix define by the mode of table 1;
Table 1
Definition 8: emotion bunch semantic similarity, choosing the maximum k stack features of semantic similarity in similarity matrix item to carrying out Similarity Measure between candidate's emotion, being designated as { sim (t
it
j)
1, sim (t
it
j)
2..., sim (t
it
j)
k, the semantic similarity of candidate's emotion bunch is defined as:
Coagulation type emotion cluster process based on semantic similarity is as follows:
Step 1: the proper vector extracting each candidate's emotion bunch, the semantic similarity of calculated candidate emotion bunch;
Step 2: the semantic similarity matrix building candidate emotion bunch, from the definition of bunch similarity
Sim (CT
i, CT
j)=sim (CT
j, CT
i), namely this similarity matrix is a symmetric matrix;
Step 3: select similarity between maximum bunch from similarity matrix, be designated as
max{sim(CT
i,CT
j)},
If max{sim is (CT
i, CT
j)≤λ, perform Step 6; Otherwise, perform Step 4;
Step 4: if max{sim is (CT
i, CT
j) > λ, CT
iand CT
jbetween similarity comparatively large, therefore by CT
iand CT
jtwo bunches of merging, form a new bunch CT
i', delete former CT
i, and compute cluster proper vector again, update semantics similarity matrix;
Step 5: if bunch between the line number of semantic similarity matrix or columns be less than or equal to default minimum number of clusters order μ, perform Step 6; Otherwise cluster not yet terminates, come back to Step 3;
Step 6: Agglomerative hierarchical clustering terminates, obtains emotion clustering cluster CT '.
Described step 1.2) in, collect negative word set, whether with negative word before parsing emotion vocabulary, if having, negative word and emotion word are write affective characteristics table in the lump.
Technical conceive of the present invention is: the present invention will based on emoticon, in conjunction with the sentiment analysis word finder (being open resources bank) that Chinese ontological resource and " knowing net " HowNet of Dalian University of Technology's Research into information retrieval room mark provide, build emoticon storehouse, emotion word dictionary and cyberword dictionary; Therefrom extract dominant affective characteristics, and merge Latent Semantic feature, adopt and carry out sentiment analysis based on comparatively large, that different emotions microblogging text similarity the is less Clustering of similar emotion microblogging text similarity.Cluster is without the need to training process and mark classification to document is manual in advance, directly based on frequent item set and Semantic Clustering algorithm, has good dirigibility and automatic business processing ability.
Beneficial effect of the present invention is mainly manifested in: dirigibility is higher, reliability is better.
Embodiment
Below in conjunction with accompanying drawing, the invention will be further described.
With reference to Fig. 1 ~ Fig. 3, a kind ofly merge dominant and Chinese microblog emotional analytical approach that is recessive character, microblogging emoticon is a kind of intuitively dominant affective characteristics, contents semantic is then recessive, and emotion is judged to have decisive role, therefore the present invention proposes the microblog emotional analytical approach of two kinds of characteristic factors fusions.First build sentiment analysis dictionary, cyberword dictionary and emoticon storehouse, the frequent feature word set of definition microblogging, according to frequent feature word set, utilize maximum frequent itemsets to obtain the initial emotion bunch of microblogging; For there is text overlapping cases between initial cluster, propose based on short text expansion simple semantic degree bunch between overlapping abatement algorithm, obtain the initial cluster be separated completely; According to a bunch semantic similarity matrix, provide coagulation type emotion clustering method.
Chinese microblog emotional analytical approach of the present invention comprises following three steps:
1), microblogging dominant character process
1.1) emoticon process
Normally user oneself input of emoticon on English microblogging, as " :) "; The emoticon that Sina's microblog provides is the text representation comprised with bracket, as expression
corresponding text is " [laughing a great ho-ho] ".Emoticon uses extensively in microblogging, and as randomly drawed 5000 Sina's microbloggings, the microblogging number comprising emoticon is 1071, and ratio is 21.24%.Multiple emoticon may be comprised in wall scroll microblogging, Fig. 1 gives the microblogging amount sampling statistics comprising different emoticon number, result shows: Sina's microblog users uses the ratio of 1 emoticon to be about 62%, use the ratio of 2-5 emoticon to be about 30%, illustrate that microblog users is more happy to use single emoticon.
The expression that the present invention adopts Sina's microblogging to carry builds emotional symbol storehouse, according to 7 class sensibility classification methods, emotion is divided into happiness, hobby, indignation, sadness, fear, detest, surprised seven classifications.The frequency of occurrences is come front 150 emoticons, do unitized process, namely first set up emotional symbol table, 150 emoticons are put into emotional symbol table, as shown in table 2.Judge whether this emotional symbol belongs to emotional symbol table by lookup table mode, if then extract emotional symbol, after converting emotion classification to, write affective characteristics table.Experiment shows to be conducive to producing better Clustering Effect to the emoticon process that unitizes, thus realizes sentiment analysis more accurately.
The typical emoticon of table 2 emotion classification and each classification
1.2) emotion word process
Emotion word best embodies the text emotion of microblogging, therefore the structure of sentiment dictionary and network words dictionary is the element task that microblog emotional tendentiousness judges.
Chinese emotion classified vocabulary: emotion vocabulary is complicated, part of speech is more, comprise adjective, noun, adverbial word etc., only consider that part of speech selects emotion word not science, if noun (" rubbish ", " wooden club ") is all with negative emotion, and emotional color be not with in most of noun, select and can reduce classification performance.The Chinese ontological resource that the present invention adopts Dalian University of Technology's Research into information retrieval room to provide, comprises 27467 Chinese emotion word.As shown in table 3, first set up the emotion vocabulary of a sentiment dictionary, these emotion word are put into vocabulary.Judge, by whether being emotion word after text participle, if then extract emotion word, and to write affective characteristics table by the mode of tabling look-up.
The Chinese ontological resource emotional semantic classification table that table 3 the present invention selects
In addition, also have collected the negative word set in the microblogging such as " no ", " not having ", " impossible ", " being difficult to ", whether with negative word before parsing emotion vocabulary, if having, negative word and emotion word are write affective characteristics table in the lump.
Network words dictionary creation: microblog emotional often has originality, along with network Development constantly has neologisms to occur, comprises homophonic word, brief word, netspeak etc., so the present invention builds the Sentiment orientation sex determination of network words dictionary for microblog emotional.Collected by social networks, arrange, adopt 141 network words altogether, carry out Emotion tagging respectively and do unitized process, namely first set up the emotion vocabulary of a network words, these network words are put into vocabulary.Many network words are not having under contextual linguistic context, and emotion tendency has ambiguity, and text only retains the obvious network word of emotion, and subnetwork word and emotion tendency thereof mark as shown in table 4.Equally, vocabulary dictionary Network Based also can the emotion classification of direct judging section content of microblog by lookup table mode.
The word of table 4 subnetwork and tendentious Emotion tagging example thereof
2), microblogging recessive character process
FIHC (Frequent Itemset-based Hierarchical Clustering, the hierarchical clustering algorithm based on frequent item set) is current industry application more a kind of Text Clustering Algorithm.This algorithm is centered by clustering cluster, and direct frequent item set to weigh bunch between extent of polymerization, and think: be under the jurisdiction of between identical relational document and share more frequent item set, be under the jurisdiction of different relation and share less frequent item set, use the concept of frequent item set to divide text.
Content of microblog part of speech and the semantic recessive affective characteristics that all can be considered microblogging.The present invention adopts the thought of FIHC algorithm " first build bunch then disappear heavily re-unite ", and propose a kind of new method in conjunction with frequent item set and Semantic Clustering, cluster main process as shown in Figure 2.
The main flow of emotional semantic classification is: first, creates initial emotion bunch based on frequent item set, and each initial emotion bunch text is containing frequent item set, and this causes overlapping text between initial emotion bunch; Overlapping in order to more precisely eliminate text between initial emotion bunch, adopt the Chinese semantic similarity model knowing net, be separated each initial emotion bunch according to maximum simple semantic degree principle; Finally, by semantic similarity matrix between definition bunch, complete the Agglomerative hierarchical clustering of microblog emotional bunch, and optimize and obtain final emotion bunch, realize microblog emotional analysis.
2.1) frequent item set method is obtained
Definition 1: to certain collection X in database E, if the number of times that item collection X occurs in database E is greater than preset ratio, then title X is the frequent item set of database E, and this preset ratio is called minimum support.
If text to be regarded as affairs, the project in the corresponding affairs of text vocabulary, then can be expressed as text d: d=<t
1, t
2..., t
n>
Definition 2: to certain word set W of text set D, if the support s of W in D (W) >=min_s, then title power set W is the frequent term set of text set D, and min_s is global minima support.
The present invention adopts Frequent Itemsets Mining Association Rules Algorithm Apriori to calculate Mining Frequent word set.
Algorithm: Apriori algorithm
Input: microblog data, most tuftlet support min_s
Export: the frequent item set in microblog data
Method:
The first step, scan text collection D, utilizes word frequency Trend Degree to add up the number of times of candidate appearance, collects the item collection meeting minimum support min_s and set, is designated as frequent item set;
Second step, utilizes the frequent k-item collection structure Strong association rule produced, utilizes frequent k-item collection to construct candidate (k+1)-item collection, iterate until candidate (k+1)-Xiang Jiwei is empty.
Frequent item set describes emotion information in microblogging.The present invention utilizes frequent item set to divide the initial emotion bunch of structure, frequent tendency word set microblogging will be comprised and be divided into one bunch, obtain based on the initial emotion bunch of frequent item set, simultaneously, to the frequent item set of initial emotion bunch be described as corresponding emotion bunch temporary mark, represent this initial emotion bunch emotional semantic by the frequent item set extracting each initial emotion bunch.
2.2) microblogging simple semantic degree initial cluster overlap abatement: microblogging literal expression has terseness, randomness, same emotion microblogging has different expression, article one, multiple different emotions may be comprised in microblogging, cause between initial emotion bunch, there is a large amount of text overlapping, sentiment analysis should be as the criterion with the main emotion of bloger, needs every bar microblogging to belong to an emotion bunch.
From semantic level, the present invention introduces " knowing net " semantic base expansion semantic information, and between compute cluster, lap is to the emotional semantic degree of membership of initial emotion bunch, is finally undertaken bunch distributing by maximum simple semantic degree principle.
Definition 3: if microblogging doc
jbe assigned to initial emotion bunch C
iin, then claim microblogging doc
jsupport bunch C
i.
Definition 4: note D
iand D
jsupport bunch C
iand C
jmicroblogging set, and D
i∩ D
j≠ 0, then claim bunch C
iwith a bunch C
joverlapping between existing bunch.
Definition 5: microblog emotional simple semantic degree, the present invention is by microblogging doc
jto a bunch C
iemotional semantic membership function be defined as follows:
Wherein, bunch frequent 1-item collection { f
i1, f
i2..., f
imrepresent initial cluster C
iaffective characteristics item, { t
j1, t
j2..., t
jnrepresent initial cluster C
imiddle microblogging text doc
jcharacteristic item; Sim (f
ik, t
jl) be a bunch characteristic item f
jkwith text feature item t
jlthe semantic similarity of definition in " knowing net ", n is microblogging text doc
jcharacteristic item number, m is a bunch characteristic item number.
Algorithm: microblogging simple semantic degree initial cluster overlap abatement algorithm
Input: with the initial cluster C of overlap
1, C
2..., C
n
Export: the initial cluster C ' after overlapping abatement
1, C '
2..., C '
n
Method:
Doc
jperform initial cluster overlap abatement after, then delete those initial cluster be separated after size be the sky bunch of 0, finally namely obtain final candidate's emotion bunch.
2.3) based on the coagulation type emotion cluster of semantic similarity
Can obtain by abatement overlapping between initial emotion bunch candidate's emotion bunch that microblogging cluster emotion detects, but these emotions bunch all can belong to some large emotions, are therefore necessary to carry out Agglomerative hierarchical clustering to candidate's emotion bunch again, merge emotion bunch.
Definition 6: bunch proper vector.For candidate's emotion bunch CT
i, excavate CT
ibunch frequent 1-item collection, namely form bunch proper vector of this bunch, be designated as
Definition 7: bunch similarity matrix.Remember two different candidate's emotion bunch CT
iand CT
ja bunch proper vector be respectively:
with
then CT
iand CT
jcharacteristic item form a bunch semantic similarity matrix define by the mode of table 5.
Table 5 bunch semantic similarity defined matrix table
Definition 8: emotion bunch semantic similarity.For avoid too much non-critical word to bunch between the noise of semantic similarity, only choose semantic similarity maximum k stack features item in similarity matrix and, to carrying out Similarity Measure between candidate's emotion, be designated as { sim (t
it
j)
1, sim (t
it
j)
2..., sim (t
it
j)
k, the semantic similarity of candidate's emotion bunch is defined as:
Algorithm: candidate's emotion bunch hierarchical clustering
Input: candidate's emotion bunch CT{CT
1, CT
2..CT
i,
λ (two bunches merge minimum threshold values), μ (minimum number of clusters order)
Export: emotion bunch CT '
Step 1: the proper vector extracting each candidate's emotion bunch, the semantic similarity of calculated candidate emotion bunch.
Step 2: the semantic similarity matrix building candidate emotion bunch, from the definition of bunch similarity
Sim (CT
i, CT
j)=sim (CT
j, CT
i), namely this similarity matrix is a symmetric matrix.
Step 3: select similarity between maximum bunch from similarity matrix, be designated as
max{sim(CT
i,CT
j)},
If max{sim is (CT
i, CT
j)≤λ, perform Step 6; Otherwise, perform Step 4.
Step 4: if max{sim is (CT
i, CT
j) > λ, CT
iand CT
jbetween similarity comparatively large, therefore by CT
iand CT
jtwo bunches of merging, form a new bunch CT
i', delete former CT
i, and compute cluster proper vector again, update semantics similarity matrix.
Step 5: if bunch between the line number of semantic similarity matrix or columns be less than or equal to default minimum number of clusters order μ, perform Step 6; Otherwise cluster not yet terminates, come back to Step 3.
Step 6: Agglomerative hierarchical clustering terminates, obtains emotion clustering cluster CT '.
In the present embodiment, microblogging emoticon collection and emotion word to be collected etc. and carry out unitized characteristic processing, select the emotion word set obtained not only effectively can reduce text feature dimension like this, more can retain the dominant emotion information that original microblogging is concentrated.
Adopt maximum frequent set clustering to obtain dominant leading-in affectively bunch, calculate microblogging semantic similarity again by after the semantic information that " knowing net " semantic base expansion short text is implicit, propose a kind of overlapping method for reducing of initial cluster divided based on simple semantic degree.
By the semantic similarity between definition initial cluster, provide a kind of Agglomerative hierarchical clustering method towards microblog emotional, utilize clustering parameter adjustable to obtain best microblog emotional classification, finally realize sentiment analysis accurately based on emotional semantic classification result.
All algorithms involved by microblog emotional analytical approach disclosed by the invention and implementation step, theoretical foundation is abundant, implementation step is detailed, analysis result is accurate, can be widely used in the public sentiment monitoring etc. of social networks.
Example: in order to proved invents the method put forward to the Detection results of microblogging for certain event sentiment analysis, the present invention passes through keyword search from Sina's microblogging square, obtain 44524 microblog data about " horse boat event " between March 8 to 12 days Mays in 2014 in 2014, the change of " Ma Hang " event emotion is illustrated in fig. 3 shown below.
Composition graphs 3 " Ma Hang " event emotion variation tendency and " Ma Hang " event practical development situation, below analyze with regard to several material time point:
March 8, horse boat official website issue first part of statement: confirmation 2: 40 on the 8th Beijing time MH370 flight and control tower out of touch.Microblog emotional is " sadness ", " surprised ", " fear ", and the worry of the performance common people to the passenger that suffers a calamity or disaster, the shock to this aviation safety and the feared state of mind, " happiness " and " hobby " emotion is in reduced levels.
March 9, Malaysia communication minister confirms that 2 person's draft banks of holding counterfeit passport are connected.Because lost contact aircraft does not have message in more than 40 hour, the common people's " sadness " emotion obviously rises, and occurs holding counterfeit passport event, and " fear ", " detest " emotion rise simultaneously.
March 10, Malaysian official admits that lost contact flight has by airplane hijacking possibility.The common people's " sadness " emotion continues, and because there is " airplane hijacking " situation, the doubtful attack of terrorism, the common people's " fear " emotion continues to rise.
March 12, whether deliberately Malaysian aspect is queried conceals information or delays search and rescue process.Therefore " indignation " emotion significantly raises, and account for 58% of microblogging amount on the same day.
March 24, premier Ma holds a press conference, and the lost contact horse of many days boat MH370 passenger plane crashes into the southern Indian Ocean, unmanned survival on machine." sadness " emotion reaches the highest, and the common people deeply feel distressed to these sad news.
As time goes on, whole horse boat event enter the later stage introspection, processing stage, common people's focus starts to shift gradually.