[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN104516947A - Chinese microblog emotion analysis method fused with dominant and recessive characters - Google Patents

Chinese microblog emotion analysis method fused with dominant and recessive characters Download PDF

Info

Publication number
CN104516947A
CN104516947A CN201410723617.6A CN201410723617A CN104516947A CN 104516947 A CN104516947 A CN 104516947A CN 201410723617 A CN201410723617 A CN 201410723617A CN 104516947 A CN104516947 A CN 104516947A
Authority
CN
China
Prior art keywords
bunch
emotion
emotional
cluster
frequent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410723617.6A
Other languages
Chinese (zh)
Other versions
CN104516947B (en
Inventor
陈铁明
缪茹一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Zero Seven Technology Co ltd
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201410723617.6A priority Critical patent/CN104516947B/en
Publication of CN104516947A publication Critical patent/CN104516947A/en
Application granted granted Critical
Publication of CN104516947B publication Critical patent/CN104516947B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种融合显性和隐性特征的中文微博情感分析方法,包括以下步骤:1)微博显性特征处理,1.1)表情符号处理;1.2)情感词处理;2)微博隐性特征处理:基于频繁项集创建初始情感簇,每个初始情感簇文本都含有频繁项集,采用知网的中文语义相似度模型,根据最大语义隶属度原则分离各个初始情感簇;最后,通过定义簇间语义相似度矩阵,完成微博情感簇的凝聚式层次聚类,并优化得到最终的情感簇,实现微博情感分析。本发明提供一种灵活性较高、可靠性较好的融合显性和隐性特征的中文微博情感分析方法。

A Chinese microblog sentiment analysis method that combines dominant and recessive features, comprising the following steps: 1) microblog dominant feature processing, 1.1) emoticon processing; 1.2) emotional word processing; 2) microblog recessive feature processing : Create initial sentiment clusters based on frequent itemsets. Each initial sentiment cluster text contains frequent itemsets. Using HowNet’s Chinese semantic similarity model, each initial sentiment cluster is separated according to the principle of maximum semantic membership; finally, by defining the inter-cluster The semantic similarity matrix completes the cohesive hierarchical clustering of microblog sentiment clusters, and optimizes the final sentiment clusters to realize microblog sentiment analysis. The invention provides a Chinese microblog emotion analysis method with high flexibility and good reliability which integrates dominant and recessive features.

Description

一种融合显性和隐性特征的中文微博情感分析方法A Sentiment Analysis Method of Chinese Microblogs Combining Dominant and Implicit Features

技术领域 technical field

本发明涉及互联网舆情内容分析技术领域,尤其是一种文微博情感分析方法。 The invention relates to the technical field of Internet public opinion content analysis, in particular to a text microblog sentiment analysis method.

背景技术 Background technique

情感分析(Sentiment analysis)是对带有情感色彩的主观性文本进行分析、处理、归纳和推理的过程,目的是从用户发布的带有主观感情色彩的文本信息中提取用户观点,并判断其情感极性。 Sentiment analysis (Sentiment analysis) is the process of analyzing, processing, summarizing and inferring subjective text with emotional color, the purpose is to extract the user's point of view from the text information with subjective emotional color published by the user, and judge its emotion polarity.

由于人类情感复杂,情感类别划分没有统一标准。常见方法例如把情感划分任务分为两种:主、客观信息的二元分类,对主观信息的情感分类,包括最常见的褒贬二元分类以及更细致的多元分类。对于多元分类,也有研究提出了四类情感:angry愤怒,disgusting厌恶,happy高兴,sad悲伤,或者七类情感:anger愤怒、disgust厌恶、fear恐惧、happiness高兴、like喜好、sadness悲伤、surprise惊讶等。 Due to the complexity of human emotions, there is no uniform standard for the division of emotion categories. Common methods, for example, divide the emotion classification task into two types: binary classification of subjective and objective information, and emotional classification of subjective information, including the most common binary classification of praise and criticism and more detailed multivariate classification. For multivariate classification, some studies have proposed four types of emotions: anger, disgusting, happy, sad, or seven types of emotions: anger, disgust, fear, happiness, like, sadness, surprise, etc. .

对于情感监测方法,国外方法有提出距离监督学习方法对Twitter中的消息进行情感分类,即给定一个检索词,消息自动被分为正面或负面信息,抽取Twitter中含有表情图标的消息作为训练集,最后利用朴素贝叶斯、最大熵以及支持向量机等算法进行分类;若内针对中文微博则有提出基于层次结构的多策略方法对新浪微博数据展开情感监测研究,并在特征提取时采用了主题相关特征,实验结果显示,使用主题相关的特征后所获得的最高准确率由66.467%提升到67.283%,但该方法分析过程较为繁琐。 For emotion monitoring methods, foreign methods have proposed a distance supervised learning method for emotional classification of messages in Twitter, that is, given a search term, messages are automatically classified into positive or negative information, and messages containing emoticons in Twitter are extracted as training sets , and finally use algorithms such as naive Bayesian, maximum entropy, and support vector machines to classify; for Chinese microblogs, a multi-strategy method based on hierarchical structure is proposed to carry out emotion monitoring research on Sina microblog data, and when feature extraction Using topic-related features, the experimental results show that the highest accuracy rate obtained after using topic-related features is increased from 66.467% to 67.283%, but the analysis process of this method is cumbersome.

微博具有原创性、不可预见性等特点,单条微博字数在140以内,融合了网络用语和表情符号等显性特征以及微博语义情感等隐性特征, 这给微博情感分析带了新的挑战。微博中广泛存在谐音词、简写词等,如“稀饭”代表“喜欢”、“杯具”代表“悲剧”等,且这些词汇随时间不断变化,并不断有新词出现,有必要建立特定的网络用语词典;微博表情符号通常可直接表达情感,但表情符号五花八门,需要建立特定的表情符号情感分类;此外,一条微博中可能包含多个不同情感,情感分析一般以博主的主要情感为准。现有技术中无法分析中文微博情感。 Weibo has the characteristics of originality and unpredictability. The number of words in a single Weibo is less than 140. It combines the dominant features such as Internet language and emoticons and the recessive features of Weibo semantic emotion. This brings new insights into Weibo sentiment analysis. challenge. There are widely homophonic words and abbreviations in Weibo, such as "porridge" means "like", "cupware" means "tragedy", etc., and these words change with time, and new words appear constantly, it is necessary to establish a specific A dictionary of online terms; microblog emoticons can usually directly express emotions, but there are various emoticons, and a specific emoticon emotion classification needs to be established; in addition, a microblog may contain multiple different emotions, and sentiment analysis is generally based on the blogger’s main Emotion prevails. In the prior art, it is impossible to analyze the emotion of Chinese Weibo.

发明内容 Contents of the invention

为了克服现有技术中无法分析中文微博情感的不足,本发明提供一种灵活性较高、可靠性较好的融合显性和隐性特征的中文微博情感分析方法。 In order to overcome the disadvantage of being unable to analyze Chinese microblog sentiment in the prior art, the present invention provides a Chinese microblog sentiment analysis method with high flexibility and good reliability which integrates dominant and recessive features.

本发明解决其技术问题所采用的技术方案是: The technical solution adopted by the present invention to solve its technical problems is:

一种融合显性和隐性特征的中文微博情感分析方法,所述中文微博情感分析方法包括以下步骤: A Chinese microblog sentiment analysis method combining dominant and recessive features, said Chinese microblog sentiment analysis method comprising the following steps:

1)微博显性特征处理,具体包括以下过程: 1) Microblog dominant feature processing, specifically including the following processes:

1.1)表情符号处理:根据微博自带的表情构建情感符号库,依据7类情感分类方法,将情感分为高兴、喜好、愤怒、悲伤、恐惧、厌恶、惊讶七个类别,将出现频率排在前150的表情符号,作统一化处理,即先建立情感符号表,将150个表情符号放入情感符号表,通过查表方式判断该情感符号是否属于情感符号表,若是则提取情感符号,通过转换成情感类别后写入情感特征表; 1.1) Emoticon processing: construct an emotional symbol library based on the emoticons that come with Weibo, and divide emotions into seven categories: happiness, liking, anger, sadness, fear, disgust, and surprise according to the seven types of emotion classification methods, and rank them by frequency For the first 150 emoticons, perform unified processing, that is, create an emoticon table first, put 150 emoticons into the emoticon table, and judge whether the emoticon belongs to the emoticon table by looking up the table, and if so, extract the emoticon, Write into the emotional feature table after converting into an emotional category;

1.2)情感词处理:首先建立一个基于情感词典的情感词表,将微博中的情感词放入词表中,通过查表的方式判断通过文本分词后是否是情感词,若是则提取情感词,并写入情感特征表; 1.2) Emotional word processing: first create an emotional word list based on the emotional dictionary, put the emotional words in Weibo into the word list, and judge whether it is an emotional word after text segmentation by looking up the table, and if so, extract the emotional word , and write into the emotional feature table;

再建立一个基于网络词汇的情感词表,将微博中的网络词汇放入词表中,通过查表方式判定部分微博内容的情感类别; Then build an emotional vocabulary based on online vocabulary, put the online vocabulary in Weibo into the vocabulary, and determine the emotional category of some Weibo content by looking up the table;

2)微博隐性特征处理:基于频繁项集创建初始情感簇,每个初始情感簇文本都含有频繁项集,采用知网的中文语义相似度模型,根据最大语义隶属度原则分离各个初始情感簇;最后,通过定义簇间语义相似度矩阵,完成微博情感簇的凝聚式层次聚类,并优化得到最终的情感簇,实现微博情感分析。 2) Microblog recessive feature processing: create initial sentiment clusters based on frequent itemsets, each initial sentiment cluster text contains frequent itemsets, use HowNet’s Chinese semantic similarity model, and separate each initial sentiment according to the principle of maximum semantic membership Clusters; finally, by defining the semantic similarity matrix between clusters, the agglomerative hierarchical clustering of microblog sentiment clusters is completed, and the final sentiment clusters are optimized to realize microblog sentiment analysis.

再进一步,所述步骤2)包括以下过程: Further, the step 2) includes the following processes:

2.1)采用频繁集挖掘算法Apriori来计算挖掘频繁词集 2.1) Use the frequent set mining algorithm Apriori to calculate and mine frequent word sets

利用频繁项集划分构造初始情感簇,将包含频繁趋势词集微博划分为一个簇,得到基于频繁项集初始情感簇,同时,将描述初始情感簇的频繁项集作为对应情感簇临时标识,通过抽取各个初始情感簇的频繁项集来代表这个初始情感簇情感语义; Use the frequent itemset division to construct the initial sentiment cluster, divide the microblog containing the frequent trend word set into a cluster, and obtain the initial sentiment cluster based on the frequent itemset, and at the same time, use the frequent itemset describing the initial sentiment cluster as the temporary identifier of the corresponding sentiment cluster, By extracting the frequent itemsets of each initial emotional cluster to represent the emotional semantics of this initial emotional cluster;

2.2)微博语义隶属度初始簇重叠消减  2.2) Initial cluster overlap reduction of microblog semantic membership degree

将每条微博归属到一个情感簇,计算簇间重叠部分对初始情感簇的情感语义隶属度,最后按最大语义隶属度原则进行簇分配;再删除那些初始簇分离后大小为0的空簇,重叠消减后的初始簇称为候选情感簇; Assign each microblog to an emotional cluster, calculate the emotional semantic membership degree of the overlapping part between the clusters to the initial emotional cluster, and finally assign clusters according to the principle of maximum semantic membership degree; then delete those empty clusters whose size is 0 after the initial cluster separation , the initial cluster after overlap reduction is called the candidate sentiment cluster;

2.3)基于语义相似度的凝聚式情感聚类:对候选情感簇进行凝聚式层次聚类,合并情感簇。 2.3) Cohesive sentiment clustering based on semantic similarity: perform cohesive hierarchical clustering on candidate sentiment clusters, and merge sentiment clusters.

再进一步,所述步骤2.1)中, Further, in the step 2.1),

定义1:对数据库E中某个项集X,若项集X在数据库E中出现的次数大于预设比例,则称X是数据库E的频繁项集,这个预设比例称作最小支持度; Definition 1: For an item set X in database E, if the number of occurrences of item set X in database E is greater than the preset ratio, X is said to be a frequent itemset of database E, and this preset ratio is called the minimum support;

若将文本看成一条事务,文本词汇对应事务中的项目,则可将文本d表示为:d=<t1,t2,...,tn>,其中n表示文本d包含的特征词汇数量; If the text is regarded as a transaction, and the vocabulary of the text corresponds to the items in the transaction, then the text d can be expressed as: d=<t 1 , t 2 ,...,t n >, where n represents the characteristic vocabulary contained in the text d quantity;

定义2:对文本集D的某个词集W,若W在D中的支持度s(W)≥min_s,则称势集W是文本集D的频繁词集,min_s为全局最小支持度; Definition 2: For a word set W of text set D, if the support degree of W in D is s(W)≥min_s, then the potential set W is called the frequent word set of text set D, and min_s is the global minimum support degree;

扫描文本集D,利用词频趋势度统计候选项集出现的次数,收集满足最小支持度min_s设定的项集,记为频繁项集;利用产生的频繁k-项集构造强关联规则,利用频繁k-项集构造候选(k+1)-项集,反复迭代直至候选(k+1)-项集为空。 Scan the text set D, use the word frequency trend to count the number of occurrences of candidate item sets, collect itemsets that meet the minimum support min_s setting, and record them as frequent itemsets; use the generated frequent k-itemsets to construct strong association rules, use frequent The k-itemset constructs the candidate (k+1)-itemset, and iterates repeatedly until the candidate (k+1)-itemset is empty.

更进一步,所述步骤2.2)中, Further, in the step 2.2),

定义3:若微博docj被分配到初始情感簇Ci中,则称微博docj支持簇CiDefinition 3: If the microblog doc j is assigned to the initial emotional cluster C i , it is said that the microblog doc j supports the cluster C i ;

定义4:记Di和Dj是支持簇Ci和Cj微博集合,并且Di∩Dj≠0,则称簇Ci和簇Cj存在簇间重叠; Definition 4: Note that D i and D j are microblog sets that support clusters C i and C j , and D iD j ≠ 0, then there is inter-cluster overlap between clusters C i and C j ;

定义5:微博情感语义隶属度,本发明将微博docj对簇Ci的情感语义隶属度函数定义如下: Score ( C i &LeftArrow; doc j ) = &Sigma; l = 1 n max k = 1,2 , . . . , m { sim ( f ik , t jl ) } n ; 其中,簇频繁1-项集{fi1,fi2,...,fim}表示初始簇Ci的情感特征项,{tj1,tj2,...,tjn}表示初始簇Ci中微博文本docj的特征项;sim(fik,tjl)为簇特征项fjk和文本特征项tjl在《知网》中定义的语义相似度,n为微博文本docj特征项数目,m为簇特征项数目。 Definition 5: Microblog emotional semantic membership degree, the present invention defines the emotional semantic membership function of microblog doc j to cluster C i as follows: Score ( C i &LeftArrow; doc j ) = &Sigma; l = 1 no max k = 1,2 , . . . , m { sim ( f ik , t jl ) } no ; Among them, the cluster frequent 1-itemset {f i1 , f i2 ,..., fi im } represents the emotional feature item of the initial cluster C i , {t j1 , t j2 ,..., t jn } represents the initial cluster C The feature item of the microblog text doc j in i ; sim(fi ik , t jl ) is the semantic similarity between the cluster feature item f jk and the text feature item t jl defined in HowNet, and n is the microblog text doc j The number of feature items, m is the number of cluster feature items.

又进一步,所述步骤2.3)中, Still further, in the step 2.3),

定义6:簇特征向量,针对候选情感簇CTi,挖掘出CTi的簇频繁1-项集,即构成该簇的簇特征向量,记为 Definition 6: Cluster eigenvector. For the candidate emotional cluster CT i , the cluster frequent 1-itemset of CT i is mined, that is, the cluster eigenvector that constitutes the cluster, denoted as

定义7:簇相似度矩阵,记两个不同候选情感簇CTi和CTj的簇特征向量分别为:其中n和m分表表示特征词汇数量,则CTi和CTj的特征项构成的簇语义相似度矩阵按表1的方式定义;  Definition 7: Cluster similarity matrix, remember the cluster feature vectors of two different candidate emotional clusters CT i and CT j are: and Where n and m sub-tables represent the number of feature words, then the cluster semantic similarity matrix composed of feature items of CT i and CT j is defined in the manner of Table 1;

表1 Table 1

定义8:情感簇语义相似度,选取相似度矩阵中语义相似度最大k组特征项对进行候选情感间相似度计算,记为{sim(titj)1,sim(titj)2,...,sim(titj)k},候选情感簇的语义相似度定义为: Definition 8: Semantic similarity of emotional clusters, select k feature item pairs with the largest semantic similarity in the similarity matrix to calculate the similarity between candidate emotions, denoted as {sim(t i t j ) 1 , sim(t i t j ) 2 ,..., sim(t i t j ) k }, the semantic similarity of candidate emotion clusters is defined as:

simsim (( CTCT ii ,, CTCT jj )) == &Sigma;&Sigma; ll == 11 kk simsim (( tt ii tt jj )) ll kk

基于语义相似度的凝聚式情感聚类过程如下: The process of agglomerative sentiment clustering based on semantic similarity is as follows:

Step 1:抽取各个候选情感簇的特征向量,计算候选情感簇的语义相似度; Step 1: Extract the feature vectors of each candidate emotional cluster, and calculate the semantic similarity of the candidate emotional clusters;

Step 2:构建候选情感簇的语义相似度矩阵,由簇相似度的定义可知 Step 2: Construct the semantic similarity matrix of candidate emotional clusters, which can be known from the definition of cluster similarity

sim(CTi,CTj)=sim(CTj,CTi),即该相似度矩阵为一个对称矩阵; sim(CT i , CT j )=sim(CT j , CT i ), that is, the similarity matrix is a symmetric matrix;

Step 3:从相似度矩阵中选择最大的簇间相似度,记为 Step 3: Select the largest inter-cluster similarity from the similarity matrix, denoted as

max{sim(CTi,CTj)}, max{sim(CT i , CT j )},

若max{sim(CTi,CTj)}≤λ,执行Step 6;否则,执行Step 4; If max{sim(CT i , CT j )}≤λ, execute Step 6; otherwise, execute Step 4;

Step 4:若max{sim(CTi,CTj)}>λ,CTi和CTj之间的相似性较大,故将CTi和CTj两个簇合并,形成一个新的簇CTi′,删除原CTi,并重新计算簇特征向量,更新语义相似度矩阵; Step 4: If max{sim(CT i , CT j )}>λ, the similarity between CT i and CT j is relatively large, so the two clusters CT i and CT j are merged to form a new cluster CT i ′, delete the original CT i , and recalculate the cluster feature vector, and update the semantic similarity matrix;

Step 5:若簇间语义相似度矩阵的行数或列数小于等于预设的最小簇数目μ,执行Step 6;否则,聚类尚未结束,重新回到Step 3; Step 5: If the number of rows or columns of the inter-cluster semantic similarity matrix is less than or equal to the preset minimum number of clusters μ, execute Step 6; otherwise, the clustering has not yet ended, and return to Step 3;

Step 6:凝聚式层次聚类结束,得到情感聚类簇CT′。 Step 6: The agglomerative hierarchical clustering ends, and the emotional cluster CT′ is obtained.

所述步骤1.2)中,收集否定词集,解析情感词汇前是否带有否定词,若有则将否定词与情感词一并写入情感特征表。 In the step 1.2), collect the set of negative words, analyze whether there are negative words before the emotional vocabulary, and if so, write the negative words and emotional words into the emotional feature table.

本发明的技术构思为:本发明将以表情符号为基础,结合大连理工大学信息检索研究室标注的中文本体资源以及《知网》HowNet提供的情感分析词汇集(均为公开资源库),构建表情符号库、情感词语词典以及网络用语词典;从中提取显性情感特征,并融合隐性语义特征,采用基于同类情感微博文本相似度较大、不同情感微博文本相似度较小的聚类思想进行情感分析。聚类无需训练过程和预先对文档手工标注类别,直接基于频繁项集和语义聚类算法,具有较好的灵活性和自动化处理能力。 The technical concept of the present invention is: the present invention will be based on emoticons, combined with the Chinese ontology resources marked by the Information Retrieval Research Office of Dalian University of Technology and the sentiment analysis vocabulary set provided by HowNet (both are public resource libraries) to construct Emoji library, emotional word dictionary and Internet term dictionary; extract explicit emotional features from them, and integrate implicit semantic features, and use clustering based on the similarity of similar emotional microblog texts and the small similarity of different emotional microblog texts Thoughts for sentiment analysis. Clustering does not require a training process and manual labeling of documents in advance, and is directly based on frequent itemsets and semantic clustering algorithms, which has good flexibility and automatic processing capabilities.

本发明的有益效果主要表现在:灵活性较高、可靠性较好。 The beneficial effects of the invention are mainly manifested in: higher flexibility and better reliability.

附图说明 Description of drawings

图1是包含不同表情符号数目的抽样微博数量比例图。 Figure 1 is a graph of the proportion of sampled microblogs containing different numbers of emoticons.

图2是结合频繁项集和语义聚类的微博情感分类方法的流程图。 Figure 2 is a flowchart of a microblog sentiment classification method combining frequent itemsets and semantic clustering.

图3是“马航”事件情感变化趋势的示意图。 Figure 3 is a schematic diagram of the emotional change trend of the "Malaysia Airlines" incident.

具体实施方式 Detailed ways

下面结合附图对本发明作进一步描述。 The present invention will be further described below in conjunction with the accompanying drawings.

参照图1~图3,一种融合显性和隐性特征的中文微博情感分析方 法,微博表情符号是一种直观显性的情感特征,而内容语义则是隐性的,且对情感判定具有决定性作用,因此本发明提出将两种特征因素融合的微博情感分析方法。首先构建情感分析词典、网络用语词典以及表情符号库,定义微博频繁特征词集,根据频繁特征词集,利用最大频繁项集获得微博初始情感簇;针对初始簇间存在文本重叠情况,提出基于短文本扩展语义隶属度的簇间重叠消减算法,获得完全分离的初始簇;根据簇语义相似度矩阵,给出凝聚式情感聚类方法。 Referring to Figures 1 to 3, a Chinese microblog sentiment analysis method that integrates explicit and implicit features, microblog emoticons are an intuitive and explicit emotional feature, while content semantics are implicit, and have Sentiment judgment plays a decisive role, so the present invention proposes a microblog sentiment analysis method that combines two characteristic factors. Firstly, the sentiment analysis dictionary, the network term dictionary and the emoticon library are constructed, and the frequent feature word set of Weibo is defined. According to the frequent feature word set, the initial emotion cluster of Weibo is obtained by using the maximum frequent itemset; in view of the text overlap between the initial clusters, a proposed Based on the inter-cluster overlap reduction algorithm of extended semantic membership of short text, completely separated initial clusters are obtained; according to the cluster semantic similarity matrix, an agglomerative emotional clustering method is given.

本发明的中文微博情感分析方法包括如下三个步骤: Chinese microblog emotion analysis method of the present invention comprises following three steps:

1)、微博显性特征处理 1), Weibo dominant feature processing

1.1)表情符号处理  1.1) Emoji processing

英文微博上的表情符号通常是用户自己输入,如“:)”;新浪微博平台提供的表情符号是用中括号包含的文本表达,如表情对应的文本为“[呵呵]”。表情符号在微博中使用广泛,如随机抽取5000条新浪微博,包含表情符号的微博数为1071,比例为21.24%。单条微博中可能包含多个表情符号,图1给出了包含不同表情符号数目的微博量抽样统计,结果表明:新浪微博用户使用1个表情符号的比例约为62%,使用2-5个表情符号的比例约为30%,说明微博用户更乐于使用单表情符号。 The emoticons on English Weibo are usually input by users themselves, such as ":)"; the emoticons provided by the Sina Weibo platform are expressed by text enclosed in square brackets, such as expression The corresponding text is "[呵呵]". Emoticons are widely used in microblogs. For example, 5000 Sina Weibo posts were randomly selected, and the number of microblogs containing emoticons was 1071, accounting for 21.24%. A single microblog may contain multiple emoticons. Figure 1 shows the sampling statistics of the number of microblogs containing different numbers of emoticons. The results show that the proportion of Sina Weibo users who use one emoticon is about 62%, and those who use 2- The proportion of 5 emoji is about 30%, indicating that Weibo users are more willing to use a single emoji.

本发明采用新浪微博自带的表情构建情感符号库,依据7类情感分类方法,将情感分为高兴、喜好、愤怒、悲伤、恐惧、厌恶、惊讶七个类别。将出现频率排在前150表情符号,作统一化处理,即先建立情感符号表,将150个表情符号放入情感符号表,如表2所示。通过查表方式判断该情感符号是否属于情感符号表,若是则提取情感符号,通过转换成情感类别后写入情感特征表。实验表明对表情符号统一化处理有利于产生更好聚类效果,从而实现更精准的情感分析。 The present invention adopts the emoticons that come with Sina Weibo to construct an emotional symbol library, and divides emotions into seven categories of happiness, liking, anger, sadness, fear, disgust, and surprise according to the seven types of emotion classification methods. The top 150 emoticons with the highest frequency of occurrence are unified, that is, the emotional symbol table is established first, and 150 emoticons are put into the emotional symbol table, as shown in Table 2. Determine whether the emotional symbol belongs to the emotional symbol table by means of table lookup, if so, extract the emotional symbol, convert it into an emotional category, and write it into the emotional feature table. Experiments show that the unified processing of emoji is beneficial to produce better clustering effect, so as to achieve more accurate sentiment analysis.

表2情感类别和每个类别的典型表情符号 Table 2 Emotion categories and typical emoticons for each category

1.2)情感词处理  1.2) Emotional word processing

情感词最能体现微博的文本情感,故情感词典和网络词汇词典的构建是微博情感倾向性判定的基础工作。 Emotional words can best reflect the text emotion of Weibo, so the construction of emotional dictionary and network vocabulary dictionary is the basic work of determining the emotional orientation of Weibo.

中文情感词汇分类:情感词汇复杂,词性较多,包括形容词、名词、副词等,仅考虑词性选择情感词并不科学,如名词(“垃圾”、“棒槌”)都带有负面情感,而大多数名词并不带情感色彩,选用会降低分类性能。本发明采用大连理工大学信息检索研究室提供的中文本体资源,包含27467个中文情感词。如表3所示,先建立一个情感词典的情感词表,将这些情感词放入词表中。通过查表的方式判断通过文本分词后是否是情感词,若是则提取情感词,并写入情感特征表。 Chinese emotional lexicon classification: Emotional vocabulary is complex and has many parts of speech, including adjectives, nouns, adverbs, etc. It is not scientific to choose emotional words only considering the part of speech. Most nouns do not have emotional color, and the selection will reduce the classification performance. The present invention adopts the Chinese ontology resource provided by the Information Retrieval Laboratory of Dalian University of Technology, which contains 27467 Chinese emotional words. As shown in Table 3, first establish an emotional word list of an emotional dictionary, and put these emotional words into the word list. Check the table to determine whether the text is an emotional word after word segmentation, and if so, extract the emotional word and write it into the emotional feature table.

表3本发明选用的中文本体资源情感分类表 Table 3 The Chinese Ontology Resource Sentiment Classification Table selected by the present invention

此外,还收集了“不”、“没有”、“不可能”、“很难”等微博中的否定词集,解析情感词汇前是否带有否定词,若有则将否定词与情感词一并写入情感特征表。 In addition, the negative word sets in Weibo such as "no", "no", "impossible" and "difficult" are also collected, and whether there are negative words before the emotional words are analyzed, and if there are, the negative words are combined with the emotional words Write it into the emotion feature table together.

网络词汇词典构建:微博情感往往具有原创性,随着网络发展不断有新词出现,包括谐音词、简写词、网络语言等,所以本发明构建网络词汇词典用于微博情感的情感倾向性判定。通过社交网络搜集、整理,共采用141个网络用词,分别进行情感标注以及作统一化处理,即先建立一个网络词汇的情感词表,将这些网络词汇放入词表中。许多网络用词在没有上下文的语境下,情感倾向性是有歧义的,文本只保留情感明显的网络用词,部分网络用词及其情感倾向性标注如表4所示。同样,基于网络词汇词典也可通过查表方式可直接判定部分微博内容的情感类别。 Internet vocabulary dictionary construction: Microblog emotions are often original. With the development of the Internet, new words appear continuously, including homophonic words, abbreviated words, Internet languages, etc. Therefore, the present invention constructs an online vocabulary dictionary for the emotional tendency of microblog emotions determination. Through the collection and arrangement of social networks, a total of 141 Internet words were used, which were respectively labeled with emotions and unified. That is, firstly, an emotional word list of Internet words was established, and these Internet words were put into the word list. Many online words have ambiguous emotional tendencies in the context of no context, and the text only retains online words with obvious emotions. Some online words and their emotional tendencies are marked in Table 4. Similarly, based on the network vocabulary dictionary, the emotional category of some microblog content can be directly determined by looking up the table.

表4部分网络用词及其倾向性的情感标注实例 Table 4 Examples of emotion labeling of some network words and their tendencies

2)、微博隐性特征处理 2), Weibo recessive feature processing

FIHC(Frequent Itemset-based Hierarchical Clustering,基于频繁项集的层次聚类算法)是目前业界应用较广泛的一种文本聚类算法。该算法以聚类簇为中心,并且直接用频繁项集来衡量簇之间聚合程度,并且认为:隶属于相同关系文档之间共享较多频繁项集,隶属于不同关系共享较少频繁项集,使用频繁项集的概念来对文本进行划分。 FIHC (Frequent Itemset-based Hierarchical Clustering) is a text clustering algorithm widely used in the industry at present. The algorithm is centered on clustering clusters, and directly uses frequent itemsets to measure the degree of aggregation between clusters, and believes that: documents belonging to the same relationship share more frequent itemsets, and documents belonging to different relationships share less frequent itemsets , use the concept of frequent itemsets to divide the text.

微博内容词性和语义都可视为微博的隐性情感特征。本发明采用FIHC算法“先建簇后消重再凝聚”的思想,提出一种结合频繁项集和语义聚类的新方法,聚类主要过程如图2所示。 Both the speech and semantics of Weibo content can be regarded as the implicit emotional characteristics of Weibo. The present invention adopts the idea of "creating clusters first, then deduplication and then agglomeration" of the FIHC algorithm, and proposes a new method combining frequent itemsets and semantic clustering. The main process of clustering is shown in Figure 2.

情感分类的主要流程为:首先,基于频繁项集创建初始情感簇,每个初始情感簇文本都含有频繁项集,这导致初始情感簇间产生重叠文本;为了更精准消除初始情感簇间文本重叠,采用知网的中文语义相似度模型,根据最大语义隶属度原则分离各个初始情感簇;最后,通过定义簇间语义相似度矩阵,完成微博情感簇的凝聚式层次聚类,并优化得到最终的情感簇,实现微博情感分析。 The main process of sentiment classification is as follows: First, create initial sentiment clusters based on frequent itemsets, and each initial sentiment cluster text contains frequent itemsets, which leads to overlapping texts between initial sentiment clusters; in order to more accurately eliminate text overlap between initial sentiment clusters , using the Chinese semantic similarity model of HowNet, and separating each initial emotional cluster according to the principle of maximum semantic membership; finally, by defining the semantic similarity matrix between clusters, complete the cohesive hierarchical clustering of microblog emotional clusters, and optimize to obtain the final Sentiment clusters to realize microblog sentiment analysis.

2.1)获取频繁项集方法 2.1) Obtain frequent itemset method

定义1:对数据库E中某个项集X,若项集X在数据库E中出现 的次数大于预设比例,则称X是数据库E的频繁项集,这个预设比例称作最小支持度。 Definition 1: For an itemset X in the database E, if the number of occurrences of the itemset X in the database E is greater than the preset ratio, then X is said to be a frequent itemset of the database E, and this preset ratio is called the minimum support.

若将文本看成一条事务,文本词汇对应事务中的项目,则可将文本d表示为:d=<t1,t2,...,tn> If the text is regarded as a transaction, and the vocabulary of the text corresponds to the items in the transaction, then the text d can be expressed as: d=<t 1 , t 2 ,...,t n >

定义2:对文本集D的某个词集W,若W在D中的支持度s(W)≥min_s,则称势集W是文本集D的频繁词集,min_s为全局最小支持度。 Definition 2: For a word set W of text set D, if the support degree of W in D is s(W)≥min_s, then the potential set W is called a frequent word set of text set D, and min_s is the global minimum support degree.

本发明采用频繁集挖掘算法Apriori来计算挖掘频繁词集。 The invention adopts frequent set mining algorithm Apriori to calculate and mine frequent word sets.

算法:Apriori算法 Algorithm: Apriori Algorithm

输入:微博数据,最小簇支持度min_s Input: Weibo data, minimum cluster support min_s

输出:微博数据中的频繁项集 Output: frequent itemsets in Weibo data

方法: method:

第一步,扫描文本集D,利用词频趋势度统计候选项集出现的次数,收集满足最小支持度min_s设定的项集,记为频繁项集; The first step is to scan the text set D, use the word frequency trend to count the number of occurrences of candidate item sets, collect itemsets that meet the minimum support min_s setting, and record them as frequent itemsets;

第二步,利用产生的频繁k-项集构造强关联规则,利用频繁k-项集构造候选(k+1)-项集,反复迭代直至候选(k+1)-项集为空。  The second step is to use the generated frequent k-itemsets to construct strong association rules, use frequent k-itemsets to construct candidate (k+1)-itemsets, and iterate repeatedly until the candidate (k+1)-itemsets are empty. the

频繁项集描述微博中情感信息。本发明利用频繁项集划分构造初始情感簇,将包含频繁趋势词集微博划分为一个簇,得到基于频繁项集初始情感簇,同时,将描述初始情感簇的频繁项集作为对应情感簇临时标识,通过抽取各个初始情感簇的频繁项集来代表这个初始情感簇情感语义。 Frequent itemsets describe sentiment information in microblogs. The present invention constructs initial emotion clusters by dividing frequent itemsets, divides microblogs containing frequent trend word sets into one cluster, and obtains initial emotion clusters based on frequent itemsets. At the same time, the frequent itemsets describing the initial emotion clusters are used as temporary Identification, by extracting the frequent itemsets of each initial emotional cluster to represent the emotional semantics of this initial emotional cluster.

2.2)微博语义隶属度初始簇重叠消减:微博文字表达具有简洁性、随意性,同一情感微博具有不同表述,一条微博中可能包含多个不同情感,导致初始情感簇之间存在大量文本重叠,情感分析应以博主主要情感为准,需要将每条微博归属到一个情感簇。 2.2) The initial cluster overlap and reduction of the semantic membership degree of microblog: the text expression of microblog is concise and random, the same emotional microblog has different expressions, and a microblog may contain multiple different emotions, resulting in a large number of initial emotional clusters. The text overlaps, and the sentiment analysis should be based on the main sentiment of the blogger, and each microblog needs to be assigned to an sentiment cluster.

从语义层面出发,本发明引入《知网》语义库扩展语义信息,计算簇间重叠部分对初始情感簇的情感语义隶属度,最后按最大语义隶属度原则进行簇分配。 Starting from the semantic level, the present invention introduces the extended semantic information of the "HowNet" semantic database, calculates the emotional semantic membership of the overlapping parts between the clusters to the initial emotional clusters, and finally allocates the clusters according to the principle of the maximum semantic membership.

定义3:若微博docj被分配到初始情感簇Ci中,则称微博docj支持簇CiDefinition 3: If the microblog doc j is assigned to the initial sentiment cluster C i , then the microblog doc j is said to support the cluster C i .

定义4:记Di和Dj是支持簇Ci和Cj微博集合,并且Di∩Dj≠0,则称簇Ci和簇Cj存在簇间重叠。 Definition 4: Note that D i and D j are microblog collections supporting clusters C i and C j , and D iD j ≠ 0, then there is inter-cluster overlap between clusters C i and C j .

定义5:微博情感语义隶属度,本发明将微博docj对簇Ci的情感语义隶属度函数定义如下: Score ( C i &LeftArrow; doc j ) = &Sigma; l = 1 n max k = 1,2 , . . . , m { sim ( f ik , t jl ) } n . Definition 5: Microblog emotional semantic membership degree, the present invention defines the emotional semantic membership function of microblog doc j to cluster C i as follows: Score ( C i &LeftArrow; doc j ) = &Sigma; l = 1 no max k = 1,2 , . . . , m { sim ( f ik , t jl ) } no .

其中,簇频繁1-项集{fi1,fi2,...,fim}表示初始簇Ci的情感特征项,{tj1,tj2,...,tjn}表示初始簇Ci中微博文本docj的特征项;sim(fik,tjl)为簇特征项fjk和文本特征项tjl在《知网》中定义的语义相似度,n为微博文本docj特征项数目,m为簇特征项数目。 Among them, the cluster frequent 1-itemset {f i1 , f i2 ,..., fi im } represents the emotional feature item of the initial cluster C i , {t j1 , t j2 ,..., t jn } represents the initial cluster C The feature item of the microblog text doc j in i ; sim(fi ik , t jl ) is the semantic similarity between the cluster feature item f jk and the text feature item t jl defined in HowNet, and n is the microblog text doc j The number of feature items, m is the number of cluster feature items.

算法:微博语义隶属度初始簇重叠消减算法 Algorithm: Microblog Semantic Membership Initial Cluster Overlap Subtraction Algorithm

输入:带有重叠的初始簇C1,C2,...,Cn Input: Initial clusters C 1 , C 2 , ..., C n with overlaps

输出:重叠消减后的初始簇C′1,C′2,...,C′n Output: initial clusters C′ 1 , C′ 2 , ..., C′ n after overlap reduction

方法: method:

docj执行初始簇重叠消减后,再删除那些初始簇分离后大小为0的空簇,最终即得到最终候选情感簇。 After doc j performs initial cluster overlap reduction, delete those empty clusters whose size is 0 after the initial cluster separation, and finally get the final candidate emotional clusters.

2.3)基于语义相似度的凝聚式情感聚类 2.3) Agglomerative sentiment clustering based on semantic similarity

通过初始情感簇间重叠消减可得到微博聚类情感检测的候选情感簇,但这些情感簇都可归属于某一个大情感,因此有必要再对候选情感簇进行凝聚式层次聚类,合并情感簇。 Candidate emotion clusters for microblog clustering emotion detection can be obtained by overlapping subtraction among initial emotion clusters, but these emotion clusters can all belong to a certain large emotion, so it is necessary to perform agglomerative hierarchical clustering on candidate emotion clusters and merge emotion cluster.

定义6:簇特征向量。针对候选情感簇CTi,挖掘出CTi的簇频繁1-项集,即构成该簇的簇特征向量,记为 Definition 6: Cluster eigenvectors. For the candidate emotion cluster CT i , the cluster frequent 1-itemset of CT i is mined, that is, the cluster feature vector that constitutes the cluster, denoted as

定义7:簇相似度矩阵。记两个不同候选情感簇CTi和CTj的簇特征向量分别为:则CTi和CTj的特征项构成的簇语义相似度矩阵按表5的方式定义。  Definition 7: Cluster similarity matrix. Note that the cluster feature vectors of two different candidate emotion clusters CT i and CT j are: and Then the cluster semantic similarity matrix formed by the feature items of CT i and CT j is defined in the manner of Table 5.

表5簇语义相似度矩阵定义表 Table 5 Cluster Semantic Similarity Matrix Definition Table

定义8:情感簇语义相似度。为避免过多非关键特征词对簇间语义相似度的噪音,仅选取相似度矩阵中语义相似度最大k组特征项对进行候选情感间相似度计算,记为{sim(titj)1,sim(titj)2,...,sim(titj)k},候选情感簇的语义相似度定义为: Definition 8: Semantic similarity of emotion clusters. In order to avoid the noise of too many non-key feature word pairs in the semantic similarity between clusters, only the k feature item pairs with the largest semantic similarity in the similarity matrix are selected to calculate the similarity between candidate emotions, which is denoted as {sim(t i t j ) 1 , sim(t i t j ) 2 ,..., sim(t i t j ) k }, the semantic similarity of candidate emotion clusters is defined as:

simsim (( CTCT ii ,, CTCT jj )) == &Sigma;&Sigma; ll == 11 kk simsim (( tt ii tt jj )) ll kk

算法:候选情感簇层次聚类 Algorithm: Hierarchical Clustering of Candidate Sentiment Clusters

输入:候选情感簇CT{CT1,CT2..CTi}, Input: Candidate emotion cluster CT{CT 1 , CT 2 ..CT i },

λ(两个簇合并最小阀值),μ(最小簇数目) λ (minimum threshold for merging two clusters), μ (minimum number of clusters)

输出:情感簇CT′ Output: emotion cluster CT′

Step 1:抽取各个候选情感簇的特征向量,计算候选情感簇的语义相似度。 Step 1: Extract the feature vectors of each candidate emotional cluster, and calculate the semantic similarity of the candidate emotional clusters.

Step 2:构建候选情感簇的语义相似度矩阵,由簇相似度的定义可知 Step 2: Construct the semantic similarity matrix of candidate emotional clusters, which can be known from the definition of cluster similarity

sim(CTi,CTj)=sim(CTj,CTi),即该相似度矩阵为一个对称矩阵。 sim(CT i , CT j )=sim(CT j , CT i ), that is, the similarity matrix is a symmetric matrix.

Step 3:从相似度矩阵中选择最大的簇间相似度,记为 Step 3: Select the largest inter-cluster similarity from the similarity matrix, denoted as

max{sim(CTi,CTj)}, max{sim(CT i , CT j )},

若max{sim(CTi,CTj)}≤λ,执行Step 6;否则,执行Step 4。 If max{sim(CT i , CT j )}≤λ, execute Step 6; otherwise, execute Step 4.

Step 4:若max{sim(CTi,CTj)}>λ,CTi和CTj之间的相似性较大,故将CTi和CTj两个簇合并,形成一个新的簇CTi′,删除原CTi,并重新计算簇特征向量,更新语义相似度矩阵。 Step 4: If max{sim(CT i , CT j )}>λ, the similarity between CT i and CT j is relatively large, so the two clusters CT i and CT j are merged to form a new cluster CT i ′, delete the original CT i , and recalculate the cluster feature vector, and update the semantic similarity matrix.

Step 5:若簇间语义相似度矩阵的行数或列数小于等于预设的最小簇数目μ,执行Step 6;否则,聚类尚未结束,重新回到Step 3。 Step 5: If the number of rows or columns of the inter-cluster semantic similarity matrix is less than or equal to the preset minimum number of clusters μ, execute Step 6; otherwise, the clustering has not yet ended, and return to Step 3.

Step 6:凝聚式层次聚类结束,得到情感聚类簇CT′。 Step 6: The agglomerative hierarchical clustering ends, and the emotional cluster CT′ is obtained.

本实施例中,将微博表情符号集和情感词汇集等进行统一化特征 处理,这样选择获得的情感词集不但可有效降低文本特征维度,更能够保留原始微博集中的显性情感信息。 In the present embodiment, the microblog emoticon set and emotional vocabulary set etc. are subjected to unified feature processing, so that the emotional word set obtained by selecting can not only effectively reduce the text feature dimension, but also retain the explicit emotional information in the original microblog collection.

采用最大频繁项集聚类获得显性情感初始簇,通过《知网》语义库扩展短文本隐含的语义信息后再计算微博语义相似度,提出一种基于语义隶属度划分的初始簇重叠消减方法。 Using the maximum frequent itemset clustering to obtain the initial cluster of explicit emotion, and then calculate the semantic similarity of microblog after expanding the semantic information implied by the short text through the "HowNet" semantic database, and propose an initial cluster overlap based on the division of semantic membership reduction method.

通过定义初始簇间的语义相似度,给出一种面向微博情感的凝聚式层次聚类方法,利用聚类参数可调整获得最佳的微博情感分类,基于情感分类结果最终实现精准的情感分析。 By defining the semantic similarity between the initial clusters, an agglomerative hierarchical clustering method for microblog sentiment is given. The clustering parameters can be adjusted to obtain the best microblog sentiment classification, and finally achieve accurate sentiment based on the sentiment classification results. analyze.

本发明公开的微博情感分析方法所涉及的所有算法和实施步骤,理论依据充分、实施步骤详细、分析结果精准,可广泛应用于社交网络的舆情监测等。 All the algorithms and implementation steps involved in the microblog emotion analysis method disclosed by the present invention have sufficient theoretical basis, detailed implementation steps, and accurate analysis results, and can be widely used in public opinion monitoring on social networks and the like.

实例:为了证实本发明所提的方法对微博针对某个事件情感分析的检测效果,本发明从新浪微博广场上通过关键字搜索,获取2014年3月8日至2014年5月12日之间关于“马航事件”的44524条微博数据,“马航”事件情感变化如下图3所示。 Example: In order to confirm the detection effect of the method proposed in the present invention on microblog for the detection effect of a certain event sentiment analysis, the present invention obtains from March 8, 2014 to May 12, 2014 by keyword search on Sina Weibo Square. There are 44,524 microblog data about the "Malaysia Airlines Incident", and the emotional changes of the "Malaysia Airlines" incident are shown in Figure 3 below.

结合图3“马航”事件情感变化趋势与“马航”事件实际发展状况,下面就几个关键时间点进行分析: Combining the emotional change trend of the "Malaysia Airlines" incident in Figure 3 and the actual development of the "Malaysia Airlines" incident, the following is an analysis of several key time points:

3月8日,马航官网发布第一份声明:确认北京时间8日2点40分MH370航班与塔台失去联系。微博情感为“悲伤”、“惊讶”、“恐惧”,表现民众对受难乘客的担心、对该航空安全的震惊和恐惧心理,“高兴”和“喜好”情感处于较低水平。 On March 8, the official website of Malaysia Airlines released the first statement: confirming that flight MH370 lost contact with the control tower at 2:40 Beijing time on the 8th. The emotions on Weibo are "sadness", "surprise" and "fear", expressing the public's worry about the passengers in distress, shock and fear of the aviation safety, while the emotions of "happiness" and "like" are at a relatively low level.

3月9日,马来交通部长确认2位持假护照者票号相连。因失联飞机已经40多个小时没有消息,民众“悲伤”情感明显上升,且出现持假护照事件,“恐惧”、“厌恶”情感同时上升。 On March 9, the Malay Minister of Transport confirmed that the ticket numbers of the two false passport holders were connected. Since there has been no news of the missing plane for more than 40 hours, people's "sadness" has risen significantly, and there have been incidents of false passports, and "fear" and "disgust" have also risen.

3月10日,马来西亚官方承认失联航班有被劫机可能。民众“悲伤”情感持续,因存在“劫机”情况,疑似恐怖袭击事件,民众“恐惧”情感继续上升。 On March 10, Malaysian officials admitted that the missing flight might have been hijacked. People's "sadness" continues, and people's "fear" continues to rise due to "hijacking" and suspected terrorist attacks.

3月12日,马来西亚方面被质疑是否刻意隐瞒信息或拖延搜救进程。故“愤怒”情感大幅度升高,且占当天微博量的58%。 On March 12, Malaysia was questioned whether it deliberately concealed information or delayed the search and rescue process. Therefore, the emotion of "anger" rose sharply, and accounted for 58% of the microblog volume that day.

3月24日,马总理召开新闻发布会,失联多日的马航MH370客机坠入南印度洋,机上无人幸存。“悲伤”情感达到最高,民众对该噩耗深表痛心。 On March 24, Prime Minister Ma held a press conference. The Malaysia Airlines MH370, which had been missing for many days, crashed into the southern Indian Ocean. No one on board survived. The emotion of "sorrow" reached the highest level, and the people expressed their deep sorrow for the bad news.

随着时间的推移,整个马航事件进入后期的反省、处理阶段,民众关注点开始逐渐转移。 With the passage of time, the entire Malaysia Airlines incident has entered the later stage of reflection and processing, and the focus of public attention has gradually shifted.

Claims (6)

1. merge a Chinese microblog emotional analytical approach that is dominant and recessive character, it is characterized in that: described Chinese microblog emotional analytical approach comprises the following steps:
1) microblogging dominant character process, specifically comprises following process:
1.1) emoticon process: build emotional symbol storehouse according to the expression that microblogging carries, according to 7 class sensibility classification methods, emotion is divided into happiness, hobby, indignation, sadness, fear, detest, surprised seven classifications, the frequency of occurrences is come the emoticon of front 150, do unitized process, namely emotional symbol table is first set up, 150 emoticons are put into emotional symbol table, judge whether this emotional symbol belongs to emotional symbol table by lookup table mode, if then extract emotional symbol, after converting emotion classification to, write affective characteristics table;
1.2) these emotion word are put into vocabulary by emotion word process: the emotion vocabulary setting up a sentiment dictionary, judge, by whether being emotion word after text participle, if then extract emotion word, and to write affective characteristics table by the mode of tabling look-up;
First set up the emotion vocabulary of a network words, these network words are put into vocabulary, by the emotion classification of lookup table mode judging section content of microblog;
2) microblogging recessive character process: create initial emotion bunch based on frequent item set, each initial emotion bunch text, containing frequent item set, adopts the Chinese semantic similarity model knowing net, is separated each initial emotion bunch according to maximum simple semantic degree principle; Finally, by semantic similarity matrix between definition bunch, complete the Agglomerative hierarchical clustering of microblog emotional bunch, and optimize and obtain final emotion bunch, realize microblog emotional analysis.
2. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 1, is characterized in that: described step 2) comprise following process:
2.1) Frequent Itemsets Mining Association Rules Algorithm Apriori is adopted to calculate Mining Frequent word set
Frequent item set is utilized to divide the initial emotion bunch of structure, frequent tendency word set microblogging will be comprised and be divided into one bunch, obtain based on the initial emotion bunch of frequent item set, simultaneously, to the frequent item set of initial emotion bunch be described as corresponding emotion bunch temporary mark, represent this initial emotion bunch emotional semantic by the frequent item set extracting each initial emotion bunch;
2.2) microblogging simple semantic degree initial cluster overlap abatement
Every bar microblogging is belonged to an emotion bunch, and between compute cluster, lap is to the emotional semantic degree of membership of initial emotion bunch, is finally undertaken bunch distributing by maximum simple semantic degree principle; Deleting the rear size of those initial cluster separation is again the sky bunch of 0, and the initial cluster after overlapping abatement is called candidate's emotion bunch;
2.3) based on the coagulation type emotion cluster of semantic similarity: carry out Agglomerative hierarchical clustering to candidate's emotion bunch, emotion bunch is merged.
3. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 2, is characterized in that: described step 2.1) in,
Definition 1: to certain collection X in database E, if the number of times that item collection X occurs in database E is greater than preset ratio, then title X is the frequent item set of database E, and this preset ratio is called minimum support;
If text to be regarded as affairs, the project in the corresponding affairs of text vocabulary, then can be expressed as text d: d=<t 1, t 2..., t n>, wherein n represents the feature vocabulary quantity that text d comprises;
Definition 2: to certain word set W of text set D, if the support s of W in D (W) >=min_s, then title power set W is the frequent term set of text set D, and min_s is global minima support;
Scan text collection D, utilizes word frequency Trend Degree to add up the number of times of candidate appearance, collects the item collection meeting minimum support min_s and set, is designated as frequent item set; Utilize the frequent k-item collection structure Strong association rule produced, utilize frequent k-item collection to construct candidate (k+1)-item collection, iterate until candidate (k+1)-Xiang Jiwei is empty.
4. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 2, is characterized in that: described step 2.2) in,
Definition 3: if microblogging doc jbe assigned to initial emotion bunch C iin, then claim microblogging doc jsupport bunch C i;
Definition 4: note D iand D jsupport bunch C iand C jmicroblogging set, and D i∩ D j≠ 0, then claim bunch C iwith a bunch C joverlapping between existing bunch;
Definition 5: microblog emotional simple semantic degree, the present invention is by microblogging doc jto a bunch C iemotional semantic membership function be defined as follows: Score ( C i &LeftArrow; doc j ) = &Sigma; l = 1 n max k = 1,2 , . . . , m { sim ( f ik , t jl ) } n ;
Wherein, bunch frequent 1-item collection { f i1, f i2..., f imrepresent initial cluster C iaffective characteristics item, { t j1, t j2..., t jnrepresent initial cluster C imiddle microblogging text doc jcharacteristic item; Sim (f ik, t jl) be a bunch characteristic item f ikwith text feature item t jlthe semantic similarity of definition in " knowing net ", n is microblogging text doc jcharacteristic item number, m is a bunch characteristic item number.
5. a kind of dominant and Chinese microblog emotional analytical approach that is recessive character of merging as claimed in claim 2, is characterized in that: described step 2.3) in,
Definition 6: bunch proper vector, for candidate's emotion bunch CT i, excavate CT ibunch frequent 1-item collection, namely form bunch proper vector of this bunch, be designated as
Definition 7: bunch similarity matrix, remembers two different candidate's emotion bunch CT iand CT ja bunch proper vector be respectively: with wherein n and m submeter representation feature vocabulary quantity, then CT iand CT jcharacteristic item form a bunch semantic similarity matrix define by the mode of table 1;
Table 1
Definition 8: emotion bunch semantic similarity, choosing the maximum k stack features of semantic similarity in similarity matrix item to carrying out Similarity Measure between candidate's emotion, being designated as { simt it j1, simt it j2..., simt it jk, the semantic similarity of candidate's emotion bunch is defined as:
sim ( CT i , CT j ) = &Sigma; l = 1 k sim ( t i t j ) l k
Coagulation type emotion cluster process based on semantic similarity is as follows:
Step 1: the proper vector extracting each candidate's emotion bunch, the semantic similarity of calculated candidate emotion bunch;
Step 2: the semantic similarity matrix building candidate emotion bunch, from the definition of bunch similarity
SimCT i, CT j=simCT j, CT i, namely this similarity matrix is a symmetric matrix;
Step 3: select similarity between maximum bunch from similarity matrix, be designated as max{simCT i, CT j,
If max{simCT i, CT j}≤C, performs Step 6; Otherwise, perform Step 4;
Step 4: if max{simCT i, CT j> λ, CT iand CT jbetween similarity comparatively large, therefore by CT iand CT jtwo bunches of merging, form a new bunch CT i', delete former CT i, and compute cluster proper vector again, update semantics similarity matrix;
Step 5: if bunch between the line number of semantic similarity matrix or columns be less than or equal to default minimum number of clusters order μ, perform Step 6; Otherwise cluster not yet terminates, come back to Step 3;
Step 6: Agglomerative hierarchical clustering terminates, obtains emotion clustering cluster CT '.
6. a kind of as described in one of Claims 1 to 5 merges dominant and Chinese microblog emotional analytical approach that is recessive character, it is characterized in that: described step 1.2) in, collect negative word set, whether with negative word before parsing emotion vocabulary, if having, negative word and emotion word are write affective characteristics table in the lump.
CN201410723617.6A 2014-12-03 2014-12-03 A kind of Chinese microblog emotional analysis method for merging dominant and recessive character Expired - Fee Related CN104516947B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410723617.6A CN104516947B (en) 2014-12-03 2014-12-03 A kind of Chinese microblog emotional analysis method for merging dominant and recessive character

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410723617.6A CN104516947B (en) 2014-12-03 2014-12-03 A kind of Chinese microblog emotional analysis method for merging dominant and recessive character

Publications (2)

Publication Number Publication Date
CN104516947A true CN104516947A (en) 2015-04-15
CN104516947B CN104516947B (en) 2017-08-22

Family

ID=52792246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410723617.6A Expired - Fee Related CN104516947B (en) 2014-12-03 2014-12-03 A kind of Chinese microblog emotional analysis method for merging dominant and recessive character

Country Status (1)

Country Link
CN (1) CN104516947B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354216A (en) * 2015-09-28 2016-02-24 哈尔滨工业大学 Chinese microblog topic information processing method
CN106354856A (en) * 2016-09-05 2017-01-25 北京百度网讯科技有限公司 Enhanced deep neural network search method and device based on artificial intelligence
CN106445914A (en) * 2016-09-13 2017-02-22 清华大学 Microblog emotion classifier establishing method and device
CN106570109A (en) * 2016-11-01 2017-04-19 深圳市前海点通数据有限公司 Method for automatically generating knowledge points of question bank through text analysis
CN106598942A (en) * 2016-11-17 2017-04-26 天津大学 Expression analysis and deep learning-based social network sentiment analysis method
CN107608962A (en) * 2017-09-12 2018-01-19 电子科技大学 Pushing away based on complex network especially big selects data analysing method
CN107943800A (en) * 2016-10-09 2018-04-20 郑州大学 A kind of microblog topic public sentiment calculates the method with analysis
CN108334573A (en) * 2018-01-22 2018-07-27 北京工业大学 High relevant microblog search method based on clustering information
CN108595472A (en) * 2018-03-07 2018-09-28 合肥工业大学 A kind of government website public sentiment monitoring system based on semantic analysis
CN109829634A (en) * 2019-01-18 2019-05-31 北京工业大学 A kind of adaptive patent Research Team, colleges and universities recognition methods
CN109933664A (en) * 2019-03-12 2019-06-25 中南大学 An Improved Method for Fine-Grained Sentiment Analysis Based on Sentiment Word Embedding
CN110060772A (en) * 2019-01-24 2019-07-26 暨南大学 A kind of job psychograph character analysis method based on social networks
CN110619073A (en) * 2019-08-30 2019-12-27 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN111221962A (en) * 2019-11-18 2020-06-02 重庆邮电大学 A Text Sentiment Analysis Method Based on New Word Expansion and Complex Sentence Expansion
CN111581355A (en) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 Subject detection method, apparatus and computer storage medium for threat intelligence
CN112905736A (en) * 2021-01-27 2021-06-04 郑州轻工业大学 Unsupervised text emotion analysis method based on quantum theory
CN113609865A (en) * 2021-08-09 2021-11-05 上海明略人工智能(集团)有限公司 Text emotion recognition method and device, electronic equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102163191A (en) * 2011-05-11 2011-08-24 北京航空航天大学 Short text emotion recognition method based on HowNet
US20110225043A1 (en) * 2010-03-12 2011-09-15 Yahoo! Inc. Emotional targeting
CN102663046A (en) * 2012-03-29 2012-09-12 中国科学院自动化研究所 Sentiment analysis method oriented to micro-blog short text
CN103077207A (en) * 2012-12-28 2013-05-01 深圳先进技术研究院 Method and system for analyzing microblog happiness index
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110225043A1 (en) * 2010-03-12 2011-09-15 Yahoo! Inc. Emotional targeting
CN102163191A (en) * 2011-05-11 2011-08-24 北京航空航天大学 Short text emotion recognition method based on HowNet
CN102663046A (en) * 2012-03-29 2012-09-12 中国科学院自动化研究所 Sentiment analysis method oriented to micro-blog short text
CN103077207A (en) * 2012-12-28 2013-05-01 深圳先进技术研究院 Method and system for analyzing microblog happiness index
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354216A (en) * 2015-09-28 2016-02-24 哈尔滨工业大学 Chinese microblog topic information processing method
CN105354216B (en) * 2015-09-28 2018-09-07 哈尔滨工业大学 A kind of Chinese microblog topic information processing method
CN106354856B (en) * 2016-09-05 2020-02-21 北京百度网讯科技有限公司 Artificial intelligence-based deep neural network enhanced search method and device
CN106354856A (en) * 2016-09-05 2017-01-25 北京百度网讯科技有限公司 Enhanced deep neural network search method and device based on artificial intelligence
CN106445914A (en) * 2016-09-13 2017-02-22 清华大学 Microblog emotion classifier establishing method and device
CN106445914B (en) * 2016-09-13 2020-06-19 清华大学 Construction method and construction device of microblog emotion classifier
CN107943800A (en) * 2016-10-09 2018-04-20 郑州大学 A kind of microblog topic public sentiment calculates the method with analysis
CN106570109A (en) * 2016-11-01 2017-04-19 深圳市前海点通数据有限公司 Method for automatically generating knowledge points of question bank through text analysis
CN106570109B (en) * 2016-11-01 2020-07-24 深圳市点通数据有限公司 Method for automatically generating question bank knowledge points through text analysis
CN106598942A (en) * 2016-11-17 2017-04-26 天津大学 Expression analysis and deep learning-based social network sentiment analysis method
CN107608962A (en) * 2017-09-12 2018-01-19 电子科技大学 Pushing away based on complex network especially big selects data analysing method
CN108334573A (en) * 2018-01-22 2018-07-27 北京工业大学 High relevant microblog search method based on clustering information
CN108334573B (en) * 2018-01-22 2021-02-26 北京工业大学 High-correlation microblog retrieval method based on clustering information
CN108595472A (en) * 2018-03-07 2018-09-28 合肥工业大学 A kind of government website public sentiment monitoring system based on semantic analysis
CN109829634A (en) * 2019-01-18 2019-05-31 北京工业大学 A kind of adaptive patent Research Team, colleges and universities recognition methods
CN110060772B (en) * 2019-01-24 2022-07-01 暨南大学 Occupational psychological character analysis method based on social network
CN110060772A (en) * 2019-01-24 2019-07-26 暨南大学 A kind of job psychograph character analysis method based on social networks
CN109933664A (en) * 2019-03-12 2019-06-25 中南大学 An Improved Method for Fine-Grained Sentiment Analysis Based on Sentiment Word Embedding
CN109933664B (en) * 2019-03-12 2021-09-07 中南大学 An Improved Method for Fine-Grained Sentiment Analysis Based on Sentiment Word Embedding
CN110619073A (en) * 2019-08-30 2019-12-27 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN110619073B (en) * 2019-08-30 2022-04-22 北京影谱科技股份有限公司 Method and device for constructing video subtitle network expression dictionary based on Apriori algorithm
CN111221962A (en) * 2019-11-18 2020-06-02 重庆邮电大学 A Text Sentiment Analysis Method Based on New Word Expansion and Complex Sentence Expansion
CN111221962B (en) * 2019-11-18 2023-05-26 重庆邮电大学 Text emotion analysis method based on new word expansion and complex sentence pattern expansion
CN111581355A (en) * 2020-05-13 2020-08-25 杭州安恒信息技术股份有限公司 Subject detection method, apparatus and computer storage medium for threat intelligence
CN111581355B (en) * 2020-05-13 2023-07-25 杭州安恒信息技术股份有限公司 Threat information topic detection method, device and computer storage medium
CN112905736A (en) * 2021-01-27 2021-06-04 郑州轻工业大学 Unsupervised text emotion analysis method based on quantum theory
CN112905736B (en) * 2021-01-27 2023-09-19 郑州轻工业大学 An unsupervised text sentiment analysis method based on quantum theory
CN113609865A (en) * 2021-08-09 2021-11-05 上海明略人工智能(集团)有限公司 Text emotion recognition method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN104516947B (en) 2017-08-22

Similar Documents

Publication Publication Date Title
CN104516947B (en) A kind of Chinese microblog emotional analysis method for merging dominant and recessive character
Khan et al. TOM: Twitter opinion mining framework using hybrid classification scheme
CN104268160B (en) A kind of OpinionTargetsExtraction Identification method based on domain lexicon and semantic role
CN103559233B (en) Network neologisms abstracting method and microblog emotional analysis method and system in microblogging
Akaichi et al. Text mining facebook status updates for sentiment classification
Akaichi Social networks' Facebook'statutes updates mining for sentiment classification
CN111914087B (en) Public opinion analysis method
CN105183717B (en) A kind of OSN user feeling analysis methods based on random forest and customer relationship
CN105335352A (en) Entity identification method based on Weibo emotion
TW201638803A (en) Text mining system and tool
CN104915443B (en) A kind of abstracting method of Chinese microblogging evaluation object
CN107239512B (en) A microblog spam comment identification method combined with comment relationship network graph
CN107943800A (en) A kind of microblog topic public sentiment calculates the method with analysis
CN104794208A (en) Sentiment classification method and system based on contextual information of microblog text
Reganti et al. Modeling satire in English text for automatic detection
Atmadja et al. Comparison on the rule based method and statistical based method on emotion classification for Indonesian Twitter text
CN106354818A (en) Dynamic user attribute extraction method based on social media
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
Biba et al. Sentiment analysis through machine learning: an experimental evaluation for Albanian
Daouadi et al. Organization vs. Individual: Twitter User Classification.
Demirci Emotion analysis on Turkish tweets
Sintaha et al. An empirical study and analysis of the machine learning algorithms used in detecting cyberbullying in social media
Zhang et al. Exploring deep recurrent convolution neural networks for subjectivity classification
Fernandes et al. Analysis of product Twitter data though opinion mining
Ilavarasan A Survey on Sarcasm detection and challenges

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20181212

Address after: 320000 Zhejiang Hangzhou Xihu District Xixi Xintiandi commercial center (AD) 11 buildings 5 floors 501 rooms

Patentee after: HANGZHOU YUNPINLYU INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 310014 Zhejiang University of Technology, 18 Zhaowang Road, Zhaohui six District, Hangzhou, Zhejiang

Patentee before: Zhejiang University of Technology

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 310000 Zhejiang Hangzhou Xihu District Xixi Xintiandi commercial center (AD) 11 buildings 5 floors 501 rooms

Patentee after: Hangzhou zero seven Technology Co.,Ltd.

Address before: 320000 Zhejiang Hangzhou Xihu District Xixi Xintiandi commercial center (AD) 11 buildings 5 floors 501 rooms

Patentee before: HANGZHOU YUNPINLYU INFORMATION TECHNOLOGY Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170822