CN104516947A

CN104516947A - Chinese microblog emotion analysis method fused with dominant and recessive characters

Info

Publication number: CN104516947A
Application number: CN201410723617.6A
Authority: CN
Inventors: 陈铁明; 缪茹一
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Hangzhou Zero Seven Technology Co ltd
Priority date: 2014-12-03
Filing date: 2014-12-03
Publication date: 2015-04-15
Anticipated expiration: 2034-12-03
Also published as: CN104516947B

Abstract

A Chinese microblog sentiment analysis method that combines dominant and recessive features, comprising the following steps: 1) microblog dominant feature processing, 1.1) emoticon processing; 1.2) emotional word processing; 2) microblog recessive feature processing : Create initial sentiment clusters based on frequent itemsets. Each initial sentiment cluster text contains frequent itemsets. Using HowNet’s Chinese semantic similarity model, each initial sentiment cluster is separated according to the principle of maximum semantic membership; finally, by defining the inter-cluster The semantic similarity matrix completes the cohesive hierarchical clustering of microblog sentiment clusters, and optimizes the final sentiment clusters to realize microblog sentiment analysis. The invention provides a Chinese microblog emotion analysis method with high flexibility and good reliability which integrates dominant and recessive features.

Description

A Sentiment Analysis Method of Chinese Microblogs Combining Dominant and Implicit Features

技术领域 technical field

本发明涉及互联网舆情内容分析技术领域，尤其是一种文微博情感分析方法。 The invention relates to the technical field of Internet public opinion content analysis, in particular to a text microblog sentiment analysis method.

背景技术 Background technique

情感分析(Sentiment analysis)是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程，目的是从用户发布的带有主观感情色彩的文本信息中提取用户观点，并判断其情感极性。 Sentiment analysis (Sentiment analysis) is the process of analyzing, processing, summarizing and inferring subjective text with emotional color, the purpose is to extract the user's point of view from the text information with subjective emotional color published by the user, and judge its emotion polarity.

由于人类情感复杂，情感类别划分没有统一标准。常见方法例如把情感划分任务分为两种：主、客观信息的二元分类，对主观信息的情感分类，包括最常见的褒贬二元分类以及更细致的多元分类。对于多元分类，也有研究提出了四类情感：angry愤怒，disgusting厌恶，happy高兴，sad悲伤，或者七类情感：anger愤怒、disgust厌恶、fear恐惧、happiness高兴、like喜好、sadness悲伤、surprise惊讶等。 Due to the complexity of human emotions, there is no uniform standard for the division of emotion categories. Common methods, for example, divide the emotion classification task into two types: binary classification of subjective and objective information, and emotional classification of subjective information, including the most common binary classification of praise and criticism and more detailed multivariate classification. For multivariate classification, some studies have proposed four types of emotions: anger, disgusting, happy, sad, or seven types of emotions: anger, disgust, fear, happiness, like, sadness, surprise, etc. .

对于情感监测方法，国外方法有提出距离监督学习方法对Twitter中的消息进行情感分类，即给定一个检索词，消息自动被分为正面或负面信息，抽取Twitter中含有表情图标的消息作为训练集，最后利用朴素贝叶斯、最大熵以及支持向量机等算法进行分类；若内针对中文微博则有提出基于层次结构的多策略方法对新浪微博数据展开情感监测研究，并在特征提取时采用了主题相关特征，实验结果显示，使用主题相关的特征后所获得的最高准确率由66.467％提升到67.283％，但该方法分析过程较为繁琐。 For emotion monitoring methods, foreign methods have proposed a distance supervised learning method for emotional classification of messages in Twitter, that is, given a search term, messages are automatically classified into positive or negative information, and messages containing emoticons in Twitter are extracted as training sets , and finally use algorithms such as naive Bayesian, maximum entropy, and support vector machines to classify; for Chinese microblogs, a multi-strategy method based on hierarchical structure is proposed to carry out emotion monitoring research on Sina microblog data, and when feature extraction Using topic-related features, the experimental results show that the highest accuracy rate obtained after using topic-related features is increased from 66.467% to 67.283%, but the analysis process of this method is cumbersome.

微博具有原创性、不可预见性等特点，单条微博字数在140以内，融合了网络用语和表情符号等显性特征以及微博语义情感等隐性特征，这给微博情感分析带了新的挑战。微博中广泛存在谐音词、简写词等，如“稀饭”代表“喜欢”、“杯具”代表“悲剧”等，且这些词汇随时间不断变化，并不断有新词出现，有必要建立特定的网络用语词典；微博表情符号通常可直接表达情感，但表情符号五花八门，需要建立特定的表情符号情感分类；此外，一条微博中可能包含多个不同情感，情感分析一般以博主的主要情感为准。现有技术中无法分析中文微博情感。 Weibo has the characteristics of originality and unpredictability. The number of words in a single Weibo is less than 140. It combines the dominant features such as Internet language and emoticons and the recessive features of Weibo semantic emotion. This brings new insights into Weibo sentiment analysis. challenge. There are widely homophonic words and abbreviations in Weibo, such as "porridge" means "like", "cupware" means "tragedy", etc., and these words change with time, and new words appear constantly, it is necessary to establish a specific A dictionary of online terms; microblog emoticons can usually directly express emotions, but there are various emoticons, and a specific emoticon emotion classification needs to be established; in addition, a microblog may contain multiple different emotions, and sentiment analysis is generally based on the blogger’s main Emotion prevails. In the prior art, it is impossible to analyze the emotion of Chinese Weibo.

发明内容 Contents of the invention

为了克服现有技术中无法分析中文微博情感的不足，本发明提供一种灵活性较高、可靠性较好的融合显性和隐性特征的中文微博情感分析方法。 In order to overcome the disadvantage of being unable to analyze Chinese microblog sentiment in the prior art, the present invention provides a Chinese microblog sentiment analysis method with high flexibility and good reliability which integrates dominant and recessive features.

本发明解决其技术问题所采用的技术方案是： The technical solution adopted by the present invention to solve its technical problems is:

一种融合显性和隐性特征的中文微博情感分析方法，所述中文微博情感分析方法包括以下步骤： A Chinese microblog sentiment analysis method combining dominant and recessive features, said Chinese microblog sentiment analysis method comprising the following steps:

1)微博显性特征处理，具体包括以下过程： 1) Microblog dominant feature processing, specifically including the following processes:

1.1)表情符号处理：根据微博自带的表情构建情感符号库，依据7类情感分类方法，将情感分为高兴、喜好、愤怒、悲伤、恐惧、厌恶、惊讶七个类别，将出现频率排在前150的表情符号，作统一化处理，即先建立情感符号表，将150个表情符号放入情感符号表，通过查表方式判断该情感符号是否属于情感符号表，若是则提取情感符号，通过转换成情感类别后写入情感特征表； 1.1) Emoticon processing: construct an emotional symbol library based on the emoticons that come with Weibo, and divide emotions into seven categories: happiness, liking, anger, sadness, fear, disgust, and surprise according to the seven types of emotion classification methods, and rank them by frequency For the first 150 emoticons, perform unified processing, that is, create an emoticon table first, put 150 emoticons into the emoticon table, and judge whether the emoticon belongs to the emoticon table by looking up the table, and if so, extract the emoticon, Write into the emotional feature table after converting into an emotional category;

1.2)情感词处理：首先建立一个基于情感词典的情感词表，将微博中的情感词放入词表中，通过查表的方式判断通过文本分词后是否是情感词，若是则提取情感词，并写入情感特征表； 1.2) Emotional word processing: first create an emotional word list based on the emotional dictionary, put the emotional words in Weibo into the word list, and judge whether it is an emotional word after text segmentation by looking up the table, and if so, extract the emotional word , and write into the emotional feature table;

再建立一个基于网络词汇的情感词表，将微博中的网络词汇放入词表中，通过查表方式判定部分微博内容的情感类别； Then build an emotional vocabulary based on online vocabulary, put the online vocabulary in Weibo into the vocabulary, and determine the emotional category of some Weibo content by looking up the table;

2)微博隐性特征处理：基于频繁项集创建初始情感簇，每个初始情感簇文本都含有频繁项集，采用知网的中文语义相似度模型，根据最大语义隶属度原则分离各个初始情感簇；最后，通过定义簇间语义相似度矩阵，完成微博情感簇的凝聚式层次聚类，并优化得到最终的情感簇，实现微博情感分析。 2) Microblog recessive feature processing: create initial sentiment clusters based on frequent itemsets, each initial sentiment cluster text contains frequent itemsets, use HowNet’s Chinese semantic similarity model, and separate each initial sentiment according to the principle of maximum semantic membership Clusters; finally, by defining the semantic similarity matrix between clusters, the agglomerative hierarchical clustering of microblog sentiment clusters is completed, and the final sentiment clusters are optimized to realize microblog sentiment analysis.

再进一步，所述步骤2)包括以下过程： Further, the step 2) includes the following processes:

2.1)采用频繁集挖掘算法Apriori来计算挖掘频繁词集 2.1) Use the frequent set mining algorithm Apriori to calculate and mine frequent word sets

利用频繁项集划分构造初始情感簇，将包含频繁趋势词集微博划分为一个簇，得到基于频繁项集初始情感簇，同时，将描述初始情感簇的频繁项集作为对应情感簇临时标识，通过抽取各个初始情感簇的频繁项集来代表这个初始情感簇情感语义； Use the frequent itemset division to construct the initial sentiment cluster, divide the microblog containing the frequent trend word set into a cluster, and obtain the initial sentiment cluster based on the frequent itemset, and at the same time, use the frequent itemset describing the initial sentiment cluster as the temporary identifier of the corresponding sentiment cluster, By extracting the frequent itemsets of each initial emotional cluster to represent the emotional semantics of this initial emotional cluster;

2.2)微博语义隶属度初始簇重叠消减 2.2) Initial cluster overlap reduction of microblog semantic membership degree

将每条微博归属到一个情感簇，计算簇间重叠部分对初始情感簇的情感语义隶属度，最后按最大语义隶属度原则进行簇分配；再删除那些初始簇分离后大小为0的空簇，重叠消减后的初始簇称为候选情感簇； Assign each microblog to an emotional cluster, calculate the emotional semantic membership degree of the overlapping part between the clusters to the initial emotional cluster, and finally assign clusters according to the principle of maximum semantic membership degree; then delete those empty clusters whose size is 0 after the initial cluster separation , the initial cluster after overlap reduction is called the candidate sentiment cluster;

2.3)基于语义相似度的凝聚式情感聚类：对候选情感簇进行凝聚式层次聚类，合并情感簇。 2.3) Cohesive sentiment clustering based on semantic similarity: perform cohesive hierarchical clustering on candidate sentiment clusters, and merge sentiment clusters.

再进一步，所述步骤2.1)中， Further, in the step 2.1),

定义1：对数据库E中某个项集X，若项集X在数据库E中出现的次数大于预设比例，则称X是数据库E的频繁项集，这个预设比例称作最小支持度； Definition 1: For an item set X in database E, if the number of occurrences of item set X in database E is greater than the preset ratio, X is said to be a frequent itemset of database E, and this preset ratio is called the minimum support;

若将文本看成一条事务，文本词汇对应事务中的项目，则可将文本d表示为：d＝<t₁，t₂，...，t_n>，其中n表示文本d包含的特征词汇数量； If the text is regarded as a transaction, and the vocabulary of the text corresponds to the items in the transaction, then the text d can be expressed as: d=<t ₁ , t ₂ ,...,t _n >, where n represents the characteristic vocabulary contained in the text d quantity;

定义2：对文本集D的某个词集W，若W在D中的支持度s(W)≥min_s，则称势集W是文本集D的频繁词集，min_s为全局最小支持度； Definition 2: For a word set W of text set D, if the support degree of W in D is s(W)≥min_s, then the potential set W is called the frequent word set of text set D, and min_s is the global minimum support degree;

扫描文本集D，利用词频趋势度统计候选项集出现的次数，收集满足最小支持度min_s设定的项集，记为频繁项集；利用产生的频繁k-项集构造强关联规则，利用频繁k-项集构造候选(k+1)-项集，反复迭代直至候选(k+1)-项集为空。 Scan the text set D, use the word frequency trend to count the number of occurrences of candidate item sets, collect itemsets that meet the minimum support min_s setting, and record them as frequent itemsets; use the generated frequent k-itemsets to construct strong association rules, use frequent The k-itemset constructs the candidate (k+1)-itemset, and iterates repeatedly until the candidate (k+1)-itemset is empty.

更进一步，所述步骤2.2)中， Further, in the step 2.2),

定义3：若微博doc_j被分配到初始情感簇C_i中，则称微博doc_j支持簇C_i； Definition 3: If the microblog doc _j is assigned to the initial emotional cluster C _i , it is said that the microblog doc _j supports the cluster C _i ;

定义4：记D_i和D_j是支持簇C_i和C_j微博集合，并且D_i∩D_j≠0，则称簇C_i和簇C_j存在簇间重叠； Definition 4: Note that D _i and D _j are microblog sets that support clusters C _i and C _j , and D _i ∩ _{D j} ≠ 0, then there is inter-cluster overlap between clusters C _i and C _j ;

定义5：微博情感语义隶属度，本发明将微博doc_j对簇C_i的情感语义隶属度函数定义如下： $Score (C_{i} &LeftArrow; {doc}_{j}) = \frac{Σ_{l = 1}^{n} \max k = 1,2, . . ., m {sim (f_{ik}, t_{jl})}}{n};$ 其中，簇频繁1-项集{f_i1，f_i2，...，f_im}表示初始簇C_i的情感特征项，{t_j1，t_j2，...，t_jn}表示初始簇C_i中微博文本doc_j的特征项；sim(f_ik，t_jl)为簇特征项f_jk和文本特征项t_jl在《知网》中定义的语义相似度，n为微博文本doc_j特征项数目，m为簇特征项数目。 Definition 5: Microblog emotional semantic membership degree, the present invention defines the emotional semantic membership function of microblog doc _j to cluster C _i as follows: $Score (C_{i} &LeftArrow; {doc}_{j}) = \frac{Σ_{l = 1}^{no} \max k = 1,2, . . ., m {sim (f_{ik}, t_{jl})}}{no};$ Among them, the cluster frequent 1-itemset {f _i1 , f _i2 ,..., fi _im } represents the emotional feature item of the initial cluster C _i , {t _j1 , t _j2 ,..., t _jn } represents the initial cluster C The feature item of the microblog text doc _j in _i ; sim(fi _ik , t _jl ) is the semantic similarity between the cluster feature item f _jk and the text feature item t _jl defined in HowNet, and n is the microblog text doc _j The number of feature items, m is the number of cluster feature items.

又进一步，所述步骤2.3)中， Still further, in the step 2.3),

定义6：簇特征向量，针对候选情感簇CT_i，挖掘出CT_i的簇频繁1-项集，即构成该簇的簇特征向量，记为 Definition 6: Cluster eigenvector. For the candidate emotional cluster CT _i , the cluster frequent 1-itemset of CT _i is mined, that is, the cluster eigenvector that constitutes the cluster, denoted as

定义7：簇相似度矩阵，记两个不同候选情感簇CT_i和CT_j的簇特征向量分别为：和其中n和m分表表示特征词汇数量，则CT_i和CT_j的特征项构成的簇语义相似度矩阵按表1的方式定义； Definition 7: Cluster similarity matrix, remember the cluster feature vectors of two different candidate emotional clusters CT _i and CT _j are: and Where n and m sub-tables represent the number of feature words, then the cluster semantic similarity matrix composed of feature items of CT _i and CT _j is defined in the manner of Table 1;

表1 Table 1

定义8：情感簇语义相似度，选取相似度矩阵中语义相似度最大k组特征项对进行候选情感间相似度计算，记为{sim(t_it_j)₁，sim(t_it_j)₂，...，sim(t_it_j)_k}，候选情感簇的语义相似度定义为： Definition 8: Semantic similarity of emotional clusters, select k feature item pairs with the largest semantic similarity in the similarity matrix to calculate the similarity between candidate emotions, denoted as {sim(t _i t _j ) ₁ , sim(t _i t _j ) ₂ ,..., sim(t _i t _j ) _k }, the semantic similarity of candidate emotion clusters is defined as:

$sim sim (({CT CT}_{i i},, {CT CT}_{j j})) = = \frac{{Σ Σ}_{l l = = 11}^{k k} sim sim {(({t t}_{i i} {t t}_{j j}))}_{l l}}{k k}$

基于语义相似度的凝聚式情感聚类过程如下： The process of agglomerative sentiment clustering based on semantic similarity is as follows:

Step 1：抽取各个候选情感簇的特征向量，计算候选情感簇的语义相似度； Step 1: Extract the feature vectors of each candidate emotional cluster, and calculate the semantic similarity of the candidate emotional clusters;

Step 2：构建候选情感簇的语义相似度矩阵，由簇相似度的定义可知 Step 2: Construct the semantic similarity matrix of candidate emotional clusters, which can be known from the definition of cluster similarity

sim(CT_i，CT_j)＝sim(CT_j，CT_i)，即该相似度矩阵为一个对称矩阵； sim(CT _i , CT _j )=sim(CT _j , CT _i ), that is, the similarity matrix is a symmetric matrix;

Step 3：从相似度矩阵中选择最大的簇间相似度，记为 Step 3: Select the largest inter-cluster similarity from the similarity matrix, denoted as

max{sim(CT_i，CT_j)}， max{sim(CT _i , CT _j )},

若max{sim(CT_i，CT_j)}≤λ，执行Step 6；否则，执行Step 4； If max{sim(CT _i , CT _j )}≤λ, execute Step 6; otherwise, execute Step 4;

Step 4：若max{sim(CT_i，CT_j)}＞λ，CT_i和CT_j之间的相似性较大，故将CT_i和CT_j两个簇合并，形成一个新的簇CT_i′，删除原CT_i，并重新计算簇特征向量，更新语义相似度矩阵； Step 4: If max{sim(CT _i , CT _j )}>λ, the similarity between CT _i and CT _j is relatively large, so the two clusters CT _i and CT _j are merged to form a new cluster CT _i ′, delete the original CT _i , and recalculate the cluster feature vector, and update the semantic similarity matrix;

Step 5：若簇间语义相似度矩阵的行数或列数小于等于预设的最小簇数目μ，执行Step 6；否则，聚类尚未结束，重新回到Step 3； Step 5: If the number of rows or columns of the inter-cluster semantic similarity matrix is less than or equal to the preset minimum number of clusters μ, execute Step 6; otherwise, the clustering has not yet ended, and return to Step 3;

Step 6：凝聚式层次聚类结束，得到情感聚类簇CT′。 Step 6: The agglomerative hierarchical clustering ends, and the emotional cluster CT′ is obtained.

所述步骤1.2)中，收集否定词集，解析情感词汇前是否带有否定词，若有则将否定词与情感词一并写入情感特征表。 In the step 1.2), collect the set of negative words, analyze whether there are negative words before the emotional vocabulary, and if so, write the negative words and emotional words into the emotional feature table.

本发明的技术构思为：本发明将以表情符号为基础，结合大连理工大学信息检索研究室标注的中文本体资源以及《知网》HowNet提供的情感分析词汇集(均为公开资源库)，构建表情符号库、情感词语词典以及网络用语词典；从中提取显性情感特征，并融合隐性语义特征，采用基于同类情感微博文本相似度较大、不同情感微博文本相似度较小的聚类思想进行情感分析。聚类无需训练过程和预先对文档手工标注类别，直接基于频繁项集和语义聚类算法，具有较好的灵活性和自动化处理能力。 The technical concept of the present invention is: the present invention will be based on emoticons, combined with the Chinese ontology resources marked by the Information Retrieval Research Office of Dalian University of Technology and the sentiment analysis vocabulary set provided by HowNet (both are public resource libraries) to construct Emoji library, emotional word dictionary and Internet term dictionary; extract explicit emotional features from them, and integrate implicit semantic features, and use clustering based on the similarity of similar emotional microblog texts and the small similarity of different emotional microblog texts Thoughts for sentiment analysis. Clustering does not require a training process and manual labeling of documents in advance, and is directly based on frequent itemsets and semantic clustering algorithms, which has good flexibility and automatic processing capabilities.

本发明的有益效果主要表现在：灵活性较高、可靠性较好。 The beneficial effects of the invention are mainly manifested in: higher flexibility and better reliability.

附图说明 Description of drawings

图1是包含不同表情符号数目的抽样微博数量比例图。 Figure 1 is a graph of the proportion of sampled microblogs containing different numbers of emoticons.

图2是结合频繁项集和语义聚类的微博情感分类方法的流程图。 Figure 2 is a flowchart of a microblog sentiment classification method combining frequent itemsets and semantic clustering.

图3是“马航”事件情感变化趋势的示意图。 Figure 3 is a schematic diagram of the emotional change trend of the "Malaysia Airlines" incident.

具体实施方式 Detailed ways

下面结合附图对本发明作进一步描述。 The present invention will be further described below in conjunction with the accompanying drawings.

参照图1～图3，一种融合显性和隐性特征的中文微博情感分析方法，微博表情符号是一种直观显性的情感特征，而内容语义则是隐性的，且对情感判定具有决定性作用，因此本发明提出将两种特征因素融合的微博情感分析方法。首先构建情感分析词典、网络用语词典以及表情符号库，定义微博频繁特征词集，根据频繁特征词集，利用最大频繁项集获得微博初始情感簇；针对初始簇间存在文本重叠情况，提出基于短文本扩展语义隶属度的簇间重叠消减算法，获得完全分离的初始簇；根据簇语义相似度矩阵，给出凝聚式情感聚类方法。 Referring to Figures 1 to 3, a Chinese microblog sentiment analysis method that integrates explicit and implicit features, microblog emoticons are an intuitive and explicit emotional feature, while content semantics are implicit, and have Sentiment judgment plays a decisive role, so the present invention proposes a microblog sentiment analysis method that combines two characteristic factors. Firstly, the sentiment analysis dictionary, the network term dictionary and the emoticon library are constructed, and the frequent feature word set of Weibo is defined. According to the frequent feature word set, the initial emotion cluster of Weibo is obtained by using the maximum frequent itemset; in view of the text overlap between the initial clusters, a proposed Based on the inter-cluster overlap reduction algorithm of extended semantic membership of short text, completely separated initial clusters are obtained; according to the cluster semantic similarity matrix, an agglomerative emotional clustering method is given.

本发明的中文微博情感分析方法包括如下三个步骤： Chinese microblog emotion analysis method of the present invention comprises following three steps:

1)、微博显性特征处理 1), Weibo dominant feature processing

1.1)表情符号处理 1.1) Emoji processing

英文微博上的表情符号通常是用户自己输入，如“：)”；新浪微博平台提供的表情符号是用中括号包含的文本表达，如表情对应的文本为“[呵呵]”。表情符号在微博中使用广泛，如随机抽取5000条新浪微博，包含表情符号的微博数为1071，比例为21.24％。单条微博中可能包含多个表情符号，图1给出了包含不同表情符号数目的微博量抽样统计，结果表明：新浪微博用户使用1个表情符号的比例约为62％，使用2-5个表情符号的比例约为30％，说明微博用户更乐于使用单表情符号。 The emoticons on English Weibo are usually input by users themselves, such as ":)"; the emoticons provided by the Sina Weibo platform are expressed by text enclosed in square brackets, such as expression The corresponding text is "[呵呵]". Emoticons are widely used in microblogs. For example, 5000 Sina Weibo posts were randomly selected, and the number of microblogs containing emoticons was 1071, accounting for 21.24%. A single microblog may contain multiple emoticons. Figure 1 shows the sampling statistics of the number of microblogs containing different numbers of emoticons. The results show that the proportion of Sina Weibo users who use one emoticon is about 62%, and those who use 2- The proportion of 5 emoji is about 30%, indicating that Weibo users are more willing to use a single emoji.

本发明采用新浪微博自带的表情构建情感符号库，依据7类情感分类方法，将情感分为高兴、喜好、愤怒、悲伤、恐惧、厌恶、惊讶七个类别。将出现频率排在前150表情符号，作统一化处理，即先建立情感符号表，将150个表情符号放入情感符号表，如表2所示。通过查表方式判断该情感符号是否属于情感符号表，若是则提取情感符号，通过转换成情感类别后写入情感特征表。实验表明对表情符号统一化处理有利于产生更好聚类效果，从而实现更精准的情感分析。 The present invention adopts the emoticons that come with Sina Weibo to construct an emotional symbol library, and divides emotions into seven categories of happiness, liking, anger, sadness, fear, disgust, and surprise according to the seven types of emotion classification methods. The top 150 emoticons with the highest frequency of occurrence are unified, that is, the emotional symbol table is established first, and 150 emoticons are put into the emotional symbol table, as shown in Table 2. Determine whether the emotional symbol belongs to the emotional symbol table by means of table lookup, if so, extract the emotional symbol, convert it into an emotional category, and write it into the emotional feature table. Experiments show that the unified processing of emoji is beneficial to produce better clustering effect, so as to achieve more accurate sentiment analysis.

表2情感类别和每个类别的典型表情符号 Table 2 Emotion categories and typical emoticons for each category

1.2)情感词处理 1.2) Emotional word processing

情感词最能体现微博的文本情感，故情感词典和网络词汇词典的构建是微博情感倾向性判定的基础工作。 Emotional words can best reflect the text emotion of Weibo, so the construction of emotional dictionary and network vocabulary dictionary is the basic work of determining the emotional orientation of Weibo.

中文情感词汇分类：情感词汇复杂，词性较多，包括形容词、名词、副词等，仅考虑词性选择情感词并不科学，如名词(“垃圾”、“棒槌”)都带有负面情感，而大多数名词并不带情感色彩，选用会降低分类性能。本发明采用大连理工大学信息检索研究室提供的中文本体资源，包含27467个中文情感词。如表3所示，先建立一个情感词典的情感词表，将这些情感词放入词表中。通过查表的方式判断通过文本分词后是否是情感词，若是则提取情感词，并写入情感特征表。 Chinese emotional lexicon classification: Emotional vocabulary is complex and has many parts of speech, including adjectives, nouns, adverbs, etc. It is not scientific to choose emotional words only considering the part of speech. Most nouns do not have emotional color, and the selection will reduce the classification performance. The present invention adopts the Chinese ontology resource provided by the Information Retrieval Laboratory of Dalian University of Technology, which contains 27467 Chinese emotional words. As shown in Table 3, first establish an emotional word list of an emotional dictionary, and put these emotional words into the word list. Check the table to determine whether the text is an emotional word after word segmentation, and if so, extract the emotional word and write it into the emotional feature table.

表3本发明选用的中文本体资源情感分类表 Table 3 The Chinese Ontology Resource Sentiment Classification Table selected by the present invention

此外，还收集了“不”、“没有”、“不可能”、“很难”等微博中的否定词集，解析情感词汇前是否带有否定词，若有则将否定词与情感词一并写入情感特征表。 In addition, the negative word sets in Weibo such as "no", "no", "impossible" and "difficult" are also collected, and whether there are negative words before the emotional words are analyzed, and if there are, the negative words are combined with the emotional words Write it into the emotion feature table together.

网络词汇词典构建：微博情感往往具有原创性，随着网络发展不断有新词出现，包括谐音词、简写词、网络语言等，所以本发明构建网络词汇词典用于微博情感的情感倾向性判定。通过社交网络搜集、整理，共采用141个网络用词，分别进行情感标注以及作统一化处理，即先建立一个网络词汇的情感词表，将这些网络词汇放入词表中。许多网络用词在没有上下文的语境下，情感倾向性是有歧义的，文本只保留情感明显的网络用词，部分网络用词及其情感倾向性标注如表4所示。同样，基于网络词汇词典也可通过查表方式可直接判定部分微博内容的情感类别。 Internet vocabulary dictionary construction: Microblog emotions are often original. With the development of the Internet, new words appear continuously, including homophonic words, abbreviated words, Internet languages, etc. Therefore, the present invention constructs an online vocabulary dictionary for the emotional tendency of microblog emotions determination. Through the collection and arrangement of social networks, a total of 141 Internet words were used, which were respectively labeled with emotions and unified. That is, firstly, an emotional word list of Internet words was established, and these Internet words were put into the word list. Many online words have ambiguous emotional tendencies in the context of no context, and the text only retains online words with obvious emotions. Some online words and their emotional tendencies are marked in Table 4. Similarly, based on the network vocabulary dictionary, the emotional category of some microblog content can be directly determined by looking up the table.

表4部分网络用词及其倾向性的情感标注实例 Table 4 Examples of emotion labeling of some network words and their tendencies

2)、微博隐性特征处理 2), Weibo recessive feature processing

FIHC(Frequent Itemset-based Hierarchical Clustering，基于频繁项集的层次聚类算法)是目前业界应用较广泛的一种文本聚类算法。该算法以聚类簇为中心，并且直接用频繁项集来衡量簇之间聚合程度，并且认为：隶属于相同关系文档之间共享较多频繁项集，隶属于不同关系共享较少频繁项集，使用频繁项集的概念来对文本进行划分。 FIHC (Frequent Itemset-based Hierarchical Clustering) is a text clustering algorithm widely used in the industry at present. The algorithm is centered on clustering clusters, and directly uses frequent itemsets to measure the degree of aggregation between clusters, and believes that: documents belonging to the same relationship share more frequent itemsets, and documents belonging to different relationships share less frequent itemsets , use the concept of frequent itemsets to divide the text.

微博内容词性和语义都可视为微博的隐性情感特征。本发明采用FIHC算法“先建簇后消重再凝聚”的思想，提出一种结合频繁项集和语义聚类的新方法，聚类主要过程如图2所示。 Both the speech and semantics of Weibo content can be regarded as the implicit emotional characteristics of Weibo. The present invention adopts the idea of "creating clusters first, then deduplication and then agglomeration" of the FIHC algorithm, and proposes a new method combining frequent itemsets and semantic clustering. The main process of clustering is shown in Figure 2.

情感分类的主要流程为：首先，基于频繁项集创建初始情感簇，每个初始情感簇文本都含有频繁项集，这导致初始情感簇间产生重叠文本；为了更精准消除初始情感簇间文本重叠，采用知网的中文语义相似度模型，根据最大语义隶属度原则分离各个初始情感簇；最后，通过定义簇间语义相似度矩阵，完成微博情感簇的凝聚式层次聚类，并优化得到最终的情感簇，实现微博情感分析。 The main process of sentiment classification is as follows: First, create initial sentiment clusters based on frequent itemsets, and each initial sentiment cluster text contains frequent itemsets, which leads to overlapping texts between initial sentiment clusters; in order to more accurately eliminate text overlap between initial sentiment clusters , using the Chinese semantic similarity model of HowNet, and separating each initial emotional cluster according to the principle of maximum semantic membership; finally, by defining the semantic similarity matrix between clusters, complete the cohesive hierarchical clustering of microblog emotional clusters, and optimize to obtain the final Sentiment clusters to realize microblog sentiment analysis.

2.1)获取频繁项集方法 2.1) Obtain frequent itemset method

定义1：对数据库E中某个项集X，若项集X在数据库E中出现的次数大于预设比例，则称X是数据库E的频繁项集，这个预设比例称作最小支持度。 Definition 1: For an itemset X in the database E, if the number of occurrences of the itemset X in the database E is greater than the preset ratio, then X is said to be a frequent itemset of the database E, and this preset ratio is called the minimum support.

若将文本看成一条事务，文本词汇对应事务中的项目，则可将文本d表示为：d＝<t₁，t₂，...，t_n> If the text is regarded as a transaction, and the vocabulary of the text corresponds to the items in the transaction, then the text d can be expressed as: d=<t ₁ , t ₂ ,...,t _n >

定义2：对文本集D的某个词集W，若W在D中的支持度s(W)≥min_s，则称势集W是文本集D的频繁词集，min_s为全局最小支持度。 Definition 2: For a word set W of text set D, if the support degree of W in D is s(W)≥min_s, then the potential set W is called a frequent word set of text set D, and min_s is the global minimum support degree.

本发明采用频繁集挖掘算法Apriori来计算挖掘频繁词集。 The invention adopts frequent set mining algorithm Apriori to calculate and mine frequent word sets.

算法：Apriori算法 Algorithm: Apriori Algorithm

输入：微博数据，最小簇支持度min_s Input: Weibo data, minimum cluster support min_s

输出：微博数据中的频繁项集 Output: frequent itemsets in Weibo data

方法： method:

第一步，扫描文本集D，利用词频趋势度统计候选项集出现的次数，收集满足最小支持度min_s设定的项集，记为频繁项集； The first step is to scan the text set D, use the word frequency trend to count the number of occurrences of candidate item sets, collect itemsets that meet the minimum support min_s setting, and record them as frequent itemsets;

第二步，利用产生的频繁k-项集构造强关联规则，利用频繁k-项集构造候选(k+1)-项集，反复迭代直至候选(k+1)-项集为空。 The second step is to use the generated frequent k-itemsets to construct strong association rules, use frequent k-itemsets to construct candidate (k+1)-itemsets, and iterate repeatedly until the candidate (k+1)-itemsets are empty. the

频繁项集描述微博中情感信息。本发明利用频繁项集划分构造初始情感簇，将包含频繁趋势词集微博划分为一个簇，得到基于频繁项集初始情感簇，同时，将描述初始情感簇的频繁项集作为对应情感簇临时标识，通过抽取各个初始情感簇的频繁项集来代表这个初始情感簇情感语义。 Frequent itemsets describe sentiment information in microblogs. The present invention constructs initial emotion clusters by dividing frequent itemsets, divides microblogs containing frequent trend word sets into one cluster, and obtains initial emotion clusters based on frequent itemsets. At the same time, the frequent itemsets describing the initial emotion clusters are used as temporary Identification, by extracting the frequent itemsets of each initial emotional cluster to represent the emotional semantics of this initial emotional cluster.

2.2)微博语义隶属度初始簇重叠消减：微博文字表达具有简洁性、随意性，同一情感微博具有不同表述，一条微博中可能包含多个不同情感，导致初始情感簇之间存在大量文本重叠，情感分析应以博主主要情感为准，需要将每条微博归属到一个情感簇。 2.2) The initial cluster overlap and reduction of the semantic membership degree of microblog: the text expression of microblog is concise and random, the same emotional microblog has different expressions, and a microblog may contain multiple different emotions, resulting in a large number of initial emotional clusters. The text overlaps, and the sentiment analysis should be based on the main sentiment of the blogger, and each microblog needs to be assigned to an sentiment cluster.

从语义层面出发，本发明引入《知网》语义库扩展语义信息，计算簇间重叠部分对初始情感簇的情感语义隶属度，最后按最大语义隶属度原则进行簇分配。 Starting from the semantic level, the present invention introduces the extended semantic information of the "HowNet" semantic database, calculates the emotional semantic membership of the overlapping parts between the clusters to the initial emotional clusters, and finally allocates the clusters according to the principle of the maximum semantic membership.

定义3：若微博doc_j被分配到初始情感簇C_i中，则称微博doc_j支持簇C_i。 Definition 3: If the microblog doc _j is assigned to the initial sentiment cluster C _i , then the microblog doc _j is said to support the cluster C _i .

定义4：记D_i和D_j是支持簇C_i和C_j微博集合，并且D_i∩D_j≠0，则称簇C_i和簇C_j存在簇间重叠。 Definition 4: Note that D _i and D _j are microblog collections supporting clusters C _i and C _j , and D _i ∩ _{D j} ≠ 0, then there is inter-cluster overlap between clusters C _i and C _j .

定义5：微博情感语义隶属度，本发明将微博doc_j对簇C_i的情感语义隶属度函数定义如下： $Score (C_{i} &LeftArrow; {doc}_{j}) = \frac{Σ_{l = 1}^{n} \max k = 1,2, . . ., m {sim (f_{ik}, t_{jl})}}{n} .$ Definition 5: Microblog emotional semantic membership degree, the present invention defines the emotional semantic membership function of microblog doc _j to cluster C _i as follows: $Score (C_{i} &LeftArrow; {doc}_{j}) = \frac{Σ_{l = 1}^{no} \max k = 1,2, . . ., m {sim (f_{ik}, t_{jl})}}{no} .$

其中，簇频繁1-项集{f_i1，f_i2，...，f_im}表示初始簇C_i的情感特征项，{t_j1，t_j2，...，t_jn}表示初始簇C_i中微博文本doc_j的特征项；sim(f_ik，t_jl)为簇特征项f_jk和文本特征项t_jl在《知网》中定义的语义相似度，n为微博文本doc_j特征项数目，m为簇特征项数目。 Among them, the cluster frequent 1-itemset {f _i1 , f _i2 ,..., fi _im } represents the emotional feature item of the initial cluster C _i , {t _j1 , t _j2 ,..., t _jn } represents the initial cluster C The feature item of the microblog text doc _j in _i ; sim(fi _ik , t _jl ) is the semantic similarity between the cluster feature item f _jk and the text feature item t _jl defined in HowNet, and n is the microblog text doc _j The number of feature items, m is the number of cluster feature items.

算法：微博语义隶属度初始簇重叠消减算法 Algorithm: Microblog Semantic Membership Initial Cluster Overlap Subtraction Algorithm

输入：带有重叠的初始簇C₁，C₂，...，C_n Input: Initial clusters C ₁ , C ₂ , ..., C _n with overlaps

输出：重叠消减后的初始簇C′₁，C′₂，...，C′_n Output: initial clusters C′ ₁ , C′ ₂ , ..., C′ _n after overlap reduction

方法： method:

doc_j执行初始簇重叠消减后，再删除那些初始簇分离后大小为0的空簇，最终即得到最终候选情感簇。 After doc _j performs initial cluster overlap reduction, delete those empty clusters whose size is 0 after the initial cluster separation, and finally get the final candidate emotional clusters.

2.3)基于语义相似度的凝聚式情感聚类 2.3) Agglomerative sentiment clustering based on semantic similarity

通过初始情感簇间重叠消减可得到微博聚类情感检测的候选情感簇，但这些情感簇都可归属于某一个大情感，因此有必要再对候选情感簇进行凝聚式层次聚类，合并情感簇。 Candidate emotion clusters for microblog clustering emotion detection can be obtained by overlapping subtraction among initial emotion clusters, but these emotion clusters can all belong to a certain large emotion, so it is necessary to perform agglomerative hierarchical clustering on candidate emotion clusters and merge emotion cluster.

定义6：簇特征向量。针对候选情感簇CT_i，挖掘出CT_i的簇频繁1-项集，即构成该簇的簇特征向量，记为 Definition 6: Cluster eigenvectors. For the candidate emotion cluster CT _i , the cluster frequent 1-itemset of CT _i is mined, that is, the cluster feature vector that constitutes the cluster, denoted as

定义7：簇相似度矩阵。记两个不同候选情感簇CT_i和CT_j的簇特征向量分别为：和则CT_i和CT_j的特征项构成的簇语义相似度矩阵按表5的方式定义。 Definition 7: Cluster similarity matrix. Note that the cluster feature vectors of two different candidate emotion clusters CT _i and CT _j are: and Then the cluster semantic similarity matrix formed by the feature items of CT _i and CT _j is defined in the manner of Table 5.

表5簇语义相似度矩阵定义表 Table 5 Cluster Semantic Similarity Matrix Definition Table

定义8：情感簇语义相似度。为避免过多非关键特征词对簇间语义相似度的噪音，仅选取相似度矩阵中语义相似度最大k组特征项对进行候选情感间相似度计算，记为{sim(t_it_j)₁，sim(t_it_j)₂，...，sim(t_it_j)_k}，候选情感簇的语义相似度定义为： Definition 8: Semantic similarity of emotion clusters. In order to avoid the noise of too many non-key feature word pairs in the semantic similarity between clusters, only the k feature item pairs with the largest semantic similarity in the similarity matrix are selected to calculate the similarity between candidate emotions, which is denoted as {sim(t _i t _j ) ₁ , sim(t _i t _j ) ₂ ,..., sim(t _i t _j ) _k }, the semantic similarity of candidate emotion clusters is defined as:

算法：候选情感簇层次聚类 Algorithm: Hierarchical Clustering of Candidate Sentiment Clusters

输入：候选情感簇CT{CT₁，CT₂..CT_i}， Input: Candidate emotion cluster CT{CT ₁ , CT ₂ ..CT _i },

λ(两个簇合并最小阀值)，μ(最小簇数目) λ (minimum threshold for merging two clusters), μ (minimum number of clusters)

输出：情感簇CT′ Output: emotion cluster CT′

Step 1：抽取各个候选情感簇的特征向量，计算候选情感簇的语义相似度。 Step 1: Extract the feature vectors of each candidate emotional cluster, and calculate the semantic similarity of the candidate emotional clusters.

sim(CT_i，CT_j)＝sim(CT_j，CT_i)，即该相似度矩阵为一个对称矩阵。 sim(CT _i , CT _j )=sim(CT _j , CT _i ), that is, the similarity matrix is a symmetric matrix.

max{sim(CT_i，CT_j)}， max{sim(CT _i , CT _j )},

若max{sim(CT_i，CT_j)}≤λ，执行Step 6；否则，执行Step 4。 If max{sim(CT _i , CT _j )}≤λ, execute Step 6; otherwise, execute Step 4.

Step 4：若max{sim(CT_i，CT_j)}＞λ，CT_i和CT_j之间的相似性较大，故将CT_i和CT_j两个簇合并，形成一个新的簇CT_i′，删除原CT_i，并重新计算簇特征向量，更新语义相似度矩阵。 Step 4: If max{sim(CT _i , CT _j )}>λ, the similarity between CT _i and CT _j is relatively large, so the two clusters CT _i and CT _j are merged to form a new cluster CT _i ′, delete the original CT _i , and recalculate the cluster feature vector, and update the semantic similarity matrix.

Step 5：若簇间语义相似度矩阵的行数或列数小于等于预设的最小簇数目μ，执行Step 6；否则，聚类尚未结束，重新回到Step 3。 Step 5: If the number of rows or columns of the inter-cluster semantic similarity matrix is less than or equal to the preset minimum number of clusters μ, execute Step 6; otherwise, the clustering has not yet ended, and return to Step 3.

本实施例中，将微博表情符号集和情感词汇集等进行统一化特征处理，这样选择获得的情感词集不但可有效降低文本特征维度，更能够保留原始微博集中的显性情感信息。 In the present embodiment, the microblog emoticon set and emotional vocabulary set etc. are subjected to unified feature processing, so that the emotional word set obtained by selecting can not only effectively reduce the text feature dimension, but also retain the explicit emotional information in the original microblog collection.

采用最大频繁项集聚类获得显性情感初始簇，通过《知网》语义库扩展短文本隐含的语义信息后再计算微博语义相似度，提出一种基于语义隶属度划分的初始簇重叠消减方法。 Using the maximum frequent itemset clustering to obtain the initial cluster of explicit emotion, and then calculate the semantic similarity of microblog after expanding the semantic information implied by the short text through the "HowNet" semantic database, and propose an initial cluster overlap based on the division of semantic membership reduction method.

通过定义初始簇间的语义相似度，给出一种面向微博情感的凝聚式层次聚类方法，利用聚类参数可调整获得最佳的微博情感分类，基于情感分类结果最终实现精准的情感分析。 By defining the semantic similarity between the initial clusters, an agglomerative hierarchical clustering method for microblog sentiment is given. The clustering parameters can be adjusted to obtain the best microblog sentiment classification, and finally achieve accurate sentiment based on the sentiment classification results. analyze.

本发明公开的微博情感分析方法所涉及的所有算法和实施步骤，理论依据充分、实施步骤详细、分析结果精准，可广泛应用于社交网络的舆情监测等。 All the algorithms and implementation steps involved in the microblog emotion analysis method disclosed by the present invention have sufficient theoretical basis, detailed implementation steps, and accurate analysis results, and can be widely used in public opinion monitoring on social networks and the like.

实例：为了证实本发明所提的方法对微博针对某个事件情感分析的检测效果，本发明从新浪微博广场上通过关键字搜索，获取2014年3月8日至2014年5月12日之间关于“马航事件”的44524条微博数据，“马航”事件情感变化如下图3所示。 Example: In order to confirm the detection effect of the method proposed in the present invention on microblog for the detection effect of a certain event sentiment analysis, the present invention obtains from March 8, 2014 to May 12, 2014 by keyword search on Sina Weibo Square. There are 44,524 microblog data about the "Malaysia Airlines Incident", and the emotional changes of the "Malaysia Airlines" incident are shown in Figure 3 below.

结合图3“马航”事件情感变化趋势与“马航”事件实际发展状况，下面就几个关键时间点进行分析： Combining the emotional change trend of the "Malaysia Airlines" incident in Figure 3 and the actual development of the "Malaysia Airlines" incident, the following is an analysis of several key time points:

3月8日，马航官网发布第一份声明：确认北京时间8日2点40分MH370航班与塔台失去联系。微博情感为“悲伤”、“惊讶”、“恐惧”，表现民众对受难乘客的担心、对该航空安全的震惊和恐惧心理，“高兴”和“喜好”情感处于较低水平。 On March 8, the official website of Malaysia Airlines released the first statement: confirming that flight MH370 lost contact with the control tower at 2:40 Beijing time on the 8th. The emotions on Weibo are "sadness", "surprise" and "fear", expressing the public's worry about the passengers in distress, shock and fear of the aviation safety, while the emotions of "happiness" and "like" are at a relatively low level.

3月9日，马来交通部长确认2位持假护照者票号相连。因失联飞机已经40多个小时没有消息，民众“悲伤”情感明显上升，且出现持假护照事件，“恐惧”、“厌恶”情感同时上升。 On March 9, the Malay Minister of Transport confirmed that the ticket numbers of the two false passport holders were connected. Since there has been no news of the missing plane for more than 40 hours, people's "sadness" has risen significantly, and there have been incidents of false passports, and "fear" and "disgust" have also risen.

3月10日，马来西亚官方承认失联航班有被劫机可能。民众“悲伤”情感持续，因存在“劫机”情况，疑似恐怖袭击事件，民众“恐惧”情感继续上升。 On March 10, Malaysian officials admitted that the missing flight might have been hijacked. People's "sadness" continues, and people's "fear" continues to rise due to "hijacking" and suspected terrorist attacks.

3月12日，马来西亚方面被质疑是否刻意隐瞒信息或拖延搜救进程。故“愤怒”情感大幅度升高，且占当天微博量的58％。 On March 12, Malaysia was questioned whether it deliberately concealed information or delayed the search and rescue process. Therefore, the emotion of "anger" rose sharply, and accounted for 58% of the microblog volume that day.

3月24日，马总理召开新闻发布会，失联多日的马航MH370客机坠入南印度洋，机上无人幸存。“悲伤”情感达到最高，民众对该噩耗深表痛心。 On March 24, Prime Minister Ma held a press conference. The Malaysia Airlines MH370, which had been missing for many days, crashed into the southern Indian Ocean. No one on board survived. The emotion of "sorrow" reached the highest level, and the people expressed their deep sorrow for the bad news.

随着时间的推移，整个马航事件进入后期的反省、处理阶段，民众关注点开始逐渐转移。 With the passage of time, the entire Malaysia Airlines incident has entered the later stage of reflection and processing, and the focus of public attention has gradually shifted.

Claims

1. merge a Chinese microblog emotional analytical approach that is dominant and recessive character, it is characterized in that: described Chinese microblog emotional analytical approach comprises the following steps:

1) microblogging dominant character process, specifically comprises following process:

1.1) emoticon process: build emotional symbol storehouse according to the expression that microblogging carries, according to 7 class sensibility classification methods, emotion is divided into happiness, hobby, indignation, sadness, fear, detest, surprised seven classifications, the frequency of occurrences is come the emoticon of front 150, do unitized process, namely emotional symbol table is first set up, 150 emoticons are put into emotional symbol table, judge whether this emotional symbol belongs to emotional symbol table by lookup table mode, if then extract emotional symbol, after converting emotion classification to, write affective characteristics table;

1.2) these emotion word are put into vocabulary by emotion word process: the emotion vocabulary setting up a sentiment dictionary, judge, by whether being emotion word after text participle, if then extract emotion word, and to write affective characteristics table by the mode of tabling look-up;

First set up the emotion vocabulary of a network words, these network words are put into vocabulary, by the emotion classification of lookup table mode judging section content of microblog;

2) microblogging recessive character process: create initial emotion bunch based on frequent item set, each initial emotion bunch text, containing frequent item set, adopts the Chinese semantic similarity model knowing net, is separated each initial emotion bunch according to maximum simple semantic degree principle; Finally, by semantic similarity matrix between definition bunch, complete the Agglomerative hierarchical clustering of microblog emotional bunch, and optimize and obtain final emotion bunch, realize microblog emotional analysis.

2. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 1, is characterized in that: described step 2) comprise following process:

2.1) Frequent Itemsets Mining Association Rules Algorithm Apriori is adopted to calculate Mining Frequent word set

Frequent item set is utilized to divide the initial emotion bunch of structure, frequent tendency word set microblogging will be comprised and be divided into one bunch, obtain based on the initial emotion bunch of frequent item set, simultaneously, to the frequent item set of initial emotion bunch be described as corresponding emotion bunch temporary mark, represent this initial emotion bunch emotional semantic by the frequent item set extracting each initial emotion bunch;

2.2) microblogging simple semantic degree initial cluster overlap abatement

Every bar microblogging is belonged to an emotion bunch, and between compute cluster, lap is to the emotional semantic degree of membership of initial emotion bunch, is finally undertaken bunch distributing by maximum simple semantic degree principle; Deleting the rear size of those initial cluster separation is again the sky bunch of 0, and the initial cluster after overlapping abatement is called candidate's emotion bunch;

2.3) based on the coagulation type emotion cluster of semantic similarity: carry out Agglomerative hierarchical clustering to candidate's emotion bunch, emotion bunch is merged.

3. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 2, is characterized in that: described step 2.1) in,

Definition 1: to certain collection X in database E, if the number of times that item collection X occurs in database E is greater than preset ratio, then title X is the frequent item set of database E, and this preset ratio is called minimum support;

If text to be regarded as affairs, the project in the corresponding affairs of text vocabulary, then can be expressed as text d: d=<t ₁, t ₂..., t _n>, wherein n represents the feature vocabulary quantity that text d comprises;

Definition 2: to certain word set W of text set D, if the support s of W in D (W) >=min_s, then title power set W is the frequent term set of text set D, and min_s is global minima support;

Scan text collection D, utilizes word frequency Trend Degree to add up the number of times of candidate appearance, collects the item collection meeting minimum support min_s and set, is designated as frequent item set; Utilize the frequent k-item collection structure Strong association rule produced, utilize frequent k-item collection to construct candidate (k+1)-item collection, iterate until candidate (k+1)-Xiang Jiwei is empty.

4. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 2, is characterized in that: described step 2.2) in,

Definition 3: if microblogging doc _jbe assigned to initial emotion bunch C _iin, then claim microblogging doc _jsupport bunch C _i;

Definition 4: note D _iand D _jsupport bunch C _iand C _jmicroblogging set, and D _i∩ D _j≠ 0, then claim bunch C _iwith a bunch C _joverlapping between existing bunch;

Definition 5: microblog emotional simple semantic degree, the present invention is by microblogging doc _jto a bunch C _iemotional semantic membership function be defined as follows:

Score (C_{i} &LeftArrow; {doc}_{j}) = \frac{Σ_{l = 1}^{n} \max k = 1,2, . . ., m {sim (f_{ik}, t_{jl})}}{n};

Wherein, bunch frequent 1-item collection { f _i1, f _i2..., f _imrepresent initial cluster C _iaffective characteristics item, { t _j1, t _j2..., t _jnrepresent initial cluster C _imiddle microblogging text doc _jcharacteristic item; Sim (f _ik, t _jl) be a bunch characteristic item f _ikwith text feature item t _jlthe semantic similarity of definition in " knowing net ", n is microblogging text doc _jcharacteristic item number, m is a bunch characteristic item number.

5. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 2, is characterized in that: described step 2.3) in,

Definition 6: bunch proper vector, for candidate's emotion bunch CT _i, excavate CT _ibunch frequent 1-item collection, namely form bunch proper vector of this bunch, be designated as

Definition 7: bunch similarity matrix, remembers two different candidate's emotion bunch CT _iand CT _ja bunch proper vector be respectively: with wherein n and m submeter representation feature vocabulary quantity, then CT _iand CT _jcharacteristic item form a bunch semantic similarity matrix define by the mode of table 1;

Table 1

Definition 8: emotion bunch semantic similarity, choosing the maximum k stack features of semantic similarity in similarity matrix item to carrying out Similarity Measure between candidate's emotion, being designated as { simt _it _j1, simt _it _j2..., simt _it _jk, the semantic similarity of candidate's emotion bunch is defined as:

sim ({CT}_{i}, {CT}_{j}) = \frac{Σ_{l = 1}^{k} sim {(t_{i} t_{j})}_{l}}{k}

Coagulation type emotion cluster process based on semantic similarity is as follows:

Step 1: the proper vector extracting each candidate's emotion bunch, the semantic similarity of calculated candidate emotion bunch;

Step 2: the semantic similarity matrix building candidate emotion bunch, from the definition of bunch similarity

SimCT _i, CT _j=simCT _j, CT _i, namely this similarity matrix is a symmetric matrix;

Step 3: select similarity between maximum bunch from similarity matrix, be designated as max{simCT _i, CT _j,

If max{simCT _i, CT _j}≤C, performs Step 6; Otherwise, perform Step 4;

Step 4: if max{simCT _i, CT _j> λ, CT _iand CT _jbetween similarity comparatively large, therefore by CT _iand CT _jtwo bunches of merging, form a new bunch CT _i', delete former CT _i, and compute cluster proper vector again, update semantics similarity matrix;

Step 5: if bunch between the line number of semantic similarity matrix or columns be less than or equal to default minimum number of clusters order μ, perform Step 6; Otherwise cluster not yet terminates, come back to Step 3;

Step 6: Agglomerative hierarchical clustering terminates, obtains emotion clustering cluster CT '.

6. a kind of as described in one of Claims 1 to 5 merges dominant and Chinese microblog emotional analytical approach that is recessive character, it is characterized in that: described step 1.2) in, collect negative word set, whether with negative word before parsing emotion vocabulary, if having, negative word and emotion word are write affective characteristics table in the lump.