CN106446147A

CN106446147A - Emotion analysis method based on structuring features

Info

Publication number: CN106446147A
Application number: CN201610839375.6A
Authority: CN
Inventors: 苏育挺; 王慧晶; 张静
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-09-20
Filing date: 2016-09-20
Publication date: 2017-02-22

Abstract

The invention discloses an emotion analysis method based on structuring features. The emotion analysis method includes the steps of collecting Twitter text data; building a Twitter text database; collecting existing emotion polarity value dictionaries; manually establishing related auxiliary dictionaries; preprocessing the Twitter text database; defining an emotion score influence factor, extracting language features of information, and updating the value of the emotion score influence factor every time one language feature is extracted; calculating the emotion polarity values of the Twitter text data through the emotion polarity value dictionaries and the emotion score influence factor. According to the emotion analysis method based on the structuring features, it is avoided that in supervision methods, a large amount of marked data is required to train a classifier, and analysis and generalization are difficult; the CPU processing requirement, the internal storage requirement and the overhead for calculating training time are reduced.

Description

A Sentiment Analysis Method Based on Structural Features

技术领域technical field

本发明涉及一种情感分析方法。特别是涉及一种无监督的基于结构化特征的情感分析方法。The invention relates to a sentiment analysis method. In particular, it concerns an unsupervised method for sentiment analysis based on structured features.

背景技术Background technique

随着社交媒体的出现和流行，越来越多的用户倾向于通过不同的社交媒体平台分享他们的独特见解或者简单表达他们的情感和情绪。在这些社交平台中Twitter成为最流行的网站之一，据2016年统计数据表明，其目前已拥有超过645,000,000的注册用户，平均每天发的tweet数量超过190,000,000条。通过Twitter的API我们可以获取大量丰富的数据，使我们能够充分的对这些数据进行探测和挖掘，为情感分析的良好契机。从而帮助我们推断大众对于各类事物的观点，利用这些结论我们可以做出更加明智的预测和选择，基于Twitter文本数据的情感分析，顺理成章地成为了当下的研究热点。With the emergence and popularity of social media, more and more users tend to share their unique insights or simply express their emotions and emotions through different social media platforms. Among these social platforms, Twitter has become one of the most popular websites. According to statistics in 2016, it currently has more than 645,000,000 registered users, and the average number of tweets per day exceeds 190,000,000. Through Twitter's API, we can obtain a large amount of rich data, so that we can fully detect and mine these data, which is a good opportunity for sentiment analysis. So as to help us infer the public's views on various things, and use these conclusions to make more informed predictions and choices. Sentiment analysis based on Twitter text data has naturally become a current research hotspot.

针对Twitter文本数据的情感分析主要涉及自然语言处理、观点挖掘和情感分类等技术。目前实现情感分析的方法主要有两种：一种是基于词典的无监督方法，这种方法主要依赖于包含了大量带有情感极性信息的情感词典，如LIWC^[1]、ANEW^[2]、AFINN^[3]、VADER^[4]、SentiWordNet^[5]等；第二种方法即监督方法，这种方法通过机器学习算法从大量带标注的数据中提取特征来训练分类器，如SVM(Support Vector Machine)、Bayes、DecisionTree等。最常使用的特征是n-grams(文本中连续的1个、2个、3个或多个独立文本单元)的存在与否或者使用频率。然而这种方法在训练阶段需要大量带标注的数据，因此就CPU处理、内存需求及训练时间而言计算开销较大。此外，对于相当大一部分的数据，有监督分类器所预测的决策分值非常接近于决策边界，这暗示着分类器对于文本到底属于哪一类是非常不确定的，因此，分配给这类数据的标签要不是完全错误的要么对了也是偶然情况^[6]。因此在这里我们倾向于选用基于词典的无监督方法实现情感分析。Sentiment analysis for Twitter text data mainly involves technologies such as natural language processing, opinion mining and sentiment classification. At present, there are two main methods for implementing sentiment analysis: one is the unsupervised method based on the dictionary, which mainly relies on a sentiment dictionary containing a large number of sentiment polarity information, such as LIWC ^[1] , ANEW ^[2] , AFINN ^[3] , VADER ^[4] , SentiWordNet ^[5] , etc.; the second method is the supervised method, which uses machine learning algorithms to extract features from a large number of labeled data to train classifiers, such as SVM (Support Vector Machine), Bayes, DecisionTree, etc. The most commonly used features are the presence or frequency of n-grams (consecutive 1, 2, 3 or more independent text units in a text). However, this approach requires a large amount of labeled data during the training phase, and thus is computationally expensive in terms of CPU processing, memory requirements, and training time. In addition, for a considerable portion of the data, the decision scores predicted by the supervised classifier are very close to the decision boundary, which implies that the classifier is very uncertain about which class the text belongs to. Therefore, assigning to such data The labels are either completely wrong or right by chance ^[6] . Therefore, here we tend to choose dictionary-based unsupervised methods for sentiment analysis.

基于Twitter文本数据的情感分析领域目前面临的主要挑战主要是Twitter文本自身的特点所带来的：比如一条tweet的长度被限制在140字以内，这样为我们提供的信息就相对有限；除了其不规则的语言结构和语法表达方式，一条tweet中还可能包含了许多的缩略词、符号表情、话题标签、俚语、链接地址等等，这使得情感提取和观点挖掘变得困难。现有的常用传统自然语言处理技术(Natural Language Preprocessing,NLP)如分词、标准化、词性标注等能够有效的应用于正常书写的规范文本上，而对于Twitter数据却不再适用。The main challenges currently facing the field of sentiment analysis based on Twitter text data are mainly brought about by the characteristics of Twitter text itself: for example, the length of a tweet is limited to 140 characters, which provides us with relatively limited information; Regular language structure and grammatical expression, a tweet may also contain many acronyms, emoticons, hashtags, slang, link addresses, etc., which makes emotion extraction and opinion mining difficult. Existing commonly used traditional natural language processing techniques (Natural Language Preprocessing, NLP) such as word segmentation, standardization, part-of-speech tagging, etc. can be effectively applied to normal written normative texts, but they are no longer applicable to Twitter data.

发明内容Contents of the invention

本发明所要解决的技术问题是，提供一种避免了监督类方法中需要大量被标注的数据来训练分类器的基于结构化特征的情感分析方法。The technical problem to be solved by the present invention is to provide a structural feature-based sentiment analysis method that avoids the need for a large amount of labeled data to train classifiers in supervised methods.

本发明所采用的技术方案是：一种基于结构化特征的情感分析方法，包括如下步骤：The technical scheme adopted in the present invention is: a kind of sentiment analysis method based on structural feature, comprises the following steps:

1)采集Twitter文本数据，建立Twitter文本数据库；1) Collect Twitter text data and build a Twitter text database;

2)收集现有的情感极性值词典，优先选取由人工手动生成的情感词典；2) Collect existing sentiment polarity value dictionaries, and preferentially select sentiment dictionaries manually generated;

3)手动建立相关辅助字典，包括：标准单词字典、否定词字典、增强修饰词字典、减弱修饰词字典和网络俚语字典；3) Manually establish relevant auxiliary dictionaries, including: standard word dictionary, negative word dictionary, enhanced modifier dictionary, weakened modifier dictionary and Internet slang dictionary;

4)对所述Twitter文本数据库进行预处理，包括：4) Preprocess the Twitter text database, including:

(1)首先对Twitter文本数据库中的数据进行分词；(1) First, word segmentation is performed on the data in the Twitter text database;

(2)进行标准化；(2) Standardize;

(3)对文本进行词性标注(Part-of-Speech Tagging，POS Tagging)；(3) Part-of-Speech Tagging (POS Tagging) on the text;

5)定义情感分值影响因子，对步骤4)预处理得到的信息进行语言特征提取，所述的语言特征包括词语级别的语言特征、短语级别的语言特征和句子级别的语言特征，每提取一个语言特征就更新一次情感分值影响因子的数值；5) define affective score influence factor, carry out language feature extraction to the information that step 4) preprocessing obtains, described language feature comprises the language feature of word level, the language feature of phrase level and the language feature of sentence level, extracts each Language features will update the value of the affective score impact factor once;

6)利用步骤2)得到的情感极性值词典和步骤5)得到的情感分值影响因子为每条Twitter文本数据计算情感极性值。6) Use the emotional polarity value dictionary obtained in step 2) and the emotional score impact factor obtained in step 5) to calculate the emotional polarity value for each Twitter text data.

步骤2)所述的情感极性值词典包括：3个人工手动生成的情感词典AFINN、SentiStrength和VADER，以及一个自动生成的情感词典Opinion Observer。The sentiment polarity value dictionary in step 2) includes: 3 manually generated sentiment dictionaries AFINN, SentiStrength and VADER, and an automatically generated sentiment dictionary Opinion Observer.

步骤4)第(1)步所述的分词，是将Twitter文本数据分割成最小有意义的独立文本单元，同时分别标注每个独立文本单元的类型。Step 4) The word segmentation described in step (1) is to divide the Twitter text data into the smallest meaningful independent text units, and mark the type of each independent text unit respectively.

步骤4)第(2)步所述的标准化，是利用标准英文字典将使用重复字母的独立文本单元改为标准形式，识别Twitter文本数据中的符号表情、图片表情和文字表情，并判断和标注相应情感极性。Step 4) The standardization described in the step (2) is to utilize the standard English dictionary to change the independent text unit using repeated letters into a standard form, to recognize symbol expressions, picture expressions and text expressions in the Twitter text data, and to judge and label Corresponding emotional polarity.

步骤4)第(3)步所述的对文本进行词性标注，是标注每个独立文本单元的词性类别。Step 4) The part-of-speech tagging of the text described in step (3) is to mark the part-of-speech category of each independent text unit.

步骤5)所述的定义情感分值影响因子，是为每个独立文本单元t引入一个情感分值影响因子IF_t，其中IF_t≥0，初始值为1，用以反应所述的语言特征对独立文本单元的情感强度增强或者减弱的程度，情感分值影响因子公式如下：The definition of affective score impact factor described in step 5) is to introduce an affective score impact factor IF _t for each independent text unit t, wherein IF _t ≥ 0, the initial value is 1, in order to reflect the language features For the extent to which the emotional strength of an independent text unit is enhanced or weakened, the formula of the emotional score influencing factor is as follows:

其中指更新后独立文本单元t的情感分值影响因子，指更新前的独立文本单元t的情感分值影响因子，p指某一特征，P指所有能够影响情感分值影响因子的特征集合。in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the sentiment score influencing factor of the independent text unit t before updating, p refers to a certain feature, and P refers to all feature sets that can affect the sentiment score influencing factor.

步骤5)中对词语级别的语言特征提取包括：In step 5), the language feature extraction of word level includes:

如果一个独立文本单元中的字母全部大写，则全部大写标志s^AllCaps＝1，否则s^AllCaps＝0，并更新情感分值影响因子公式IF_t：If the letters in an independent text unit are all capitalized, then the all capitalized flag s ^AllCaps = 1, otherwise s ^AllCaps = 0, and update the emotional score influencing factor formula IF _t :

其中指更新后独立文本单元t的情感分值影响因子，指更新前的独立文本单元t的情感分值影响因子；in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating;

如果一个独立文本单元使用重复字母，则为每个独立文本单元分配一个伸长因子中t_orig表示原始独立文本单元，t_norm表示规范化后的独立文本单元，并更新情感分值影响因子公式IF_t：If an individual text unit uses repeated letters, assign an elongation factor to each individual text unit middle t _orig represents the original independent text unit, t _norm represents the normalized independent text unit, and updates the emotional score impact factor formula IF _t :

步骤5)中对短语级别的语言特征提取包括：In step 5), the language feature extraction of phrase level includes:

利用步骤3)手动建立的否定词字典，确定含有否定内容短语的开始，将句号、问号、感叹号和非标准的独立文本单元确定为含有否定内容短语的结束标志，并更新情感分值影响因子公式IF_t：Using the negative word dictionary created manually in step 3), determine the beginning of phrases containing negative content, determine period, question mark, exclamation mark and non-standard independent text units as the end symbols of phrases containing negative content, and update the sentiment score impact factor formula IF _t :

其中t是在否定范围之内的独立文本单元。where t is an independent text unit within the negated range.

利用步骤3)手动建立的增强修饰词字典和减弱修饰词字典，找出Twitter文本数据所有的修饰词按照下式计算修饰词的伸展因子：Use the enhanced modifier dictionary and weakened modifier dictionary manually established in step 3) to find out all the modifiers in the Twitter text data The stretch factor for modifiers is calculated according to the following formula:

其中m表示某个修饰词，M^DM表示修饰词集合，若某个修饰词m的字母全部大写则修饰词全部大写标志否则若某个修饰词m使用了重复字母，则使用重复字母标志否则 Among them, m represents a modifier, M ^DM represents a set of modifiers, if the letters of a modifier m are all capitalized, the modifiers are all capitalized otherwise If a modifier m uses repeated letters, use the repeated letter flag otherwise

并更新情感分值影响因子公式IF_t：And update the emotional score impact factor formula IF _t :

步骤5)中对句子级别的语言特征提取包括：In step 5), the language feature extraction of sentence level includes:

通过X but Y的句子结构，确定使用转折连词(如but、yet等)的Twitter文本数据，标记连词之前的部分X和连词之后的部分Y，并更新情感分值影响因子公式IF_t：Through the sentence structure of X but Y, determine the Twitter text data using turning conjunctions (such as but, yet, etc.), mark the part X before the conjunction and the part Y after the conjunction, and update the emotional score impact factor formula IF _t :

其中，如果独立文本单元t在X中，则如果在Y中，则 where, if the independent text unit t is in X, then If in Y, then

通过以下三种句子结构：If X,Y、If X(,)then Y和Y,if X.，确定使用条件句的Twitter文本数据，句子结构中X是条件句而Y是结果句，并更新情感分值影响因子公式IF_t：Through the following three sentence structures: If X, Y, If X(,) then Y, and Y, if X., determine the Twitter text data using conditional sentences, where X is a conditional sentence and Y is a result sentence in the sentence structure, and update Emotional score impact factor formula IF _t :

其中，独立文本单元t在X中。where the independent text unit t is in X.

步骤6)所述的计算情感极性值，包括：Step 6) described computing emotion polarity value, comprises:

(1)计算每个独立文本单元t的基本情感极性值，设L为所用的情感极性值词典集，L_t＝{l∈L_t|t∈l}表示包含独立文本单元t的情感词典的子集，由下式得到每个独立文本单元t的基本情感极性值s_t：(1) Calculate the basic sentiment polarity value of each independent text unit t, let L be the dictionary set of sentiment polarity value used, L _t = {l∈L _t |t∈l} means the sentiment including the independent text unit t A subset of the dictionary, the basic emotional polarity value s _t of each independent text unit t is obtained by the following formula:

其中score(l,t)为词典l中给出的每个独立文本单元t的基本情感极性值，|L_t|表示包含独立文本单元t的情感极性值词典的个数；Where score(l, t) is the basic emotional polarity value of each independent text unit t given in the dictionary l, |L _t | indicates the number of emotional polarity value dictionaries containing independent text unit t;

(2)对每个独立文本单元t，利用步骤5)得到的情感分值影响因子IF_t更新每个独立文本单元t的基本情感极性值s_t：(2) For each independent text unit t, update the basic emotional polarity value s _t of each independent text unit t by using the emotional score influence factor IF _t obtained in step 5):

(3)为每条Twitter文本数据T计算整体情感分值S_T：(3) Calculate the overall emotional score S _{T for each Twitter text data T} :

本发明的一种基于结构化特征的情感分析方法，避免了监督类方法中需要大量被标注的数据来训练分类器，难以分析并进行一般化，降低了CPU处理、内存需求及训练时间的计算开销。本发明的有益效果具体是：A sentiment analysis method based on structured features of the present invention avoids the need for a large amount of labeled data to train classifiers in supervised methods, which is difficult to analyze and generalize, and reduces CPU processing, memory requirements and training time calculations overhead. The beneficial effects of the present invention are specifically:

1、避免了采用基于机器学习的有监督类方法，无需依赖于大量被标注的数据来训练分类器从而实现情感分析；1. Avoid the use of supervised methods based on machine learning, and do not need to rely on a large amount of labeled data to train classifiers to achieve sentiment analysis;

2、采用了精细的情感感知的预处理器，从而能够有效的处理非正式的社交媒体文本信息，提高了后续处理的效率和分类准确率；2. Adopting a fine emotional perception preprocessor, which can effectively process informal social media text information, and improve the efficiency of subsequent processing and classification accuracy;

3、提出了一种结构化的特征提取模式，从而可以方便的更新我们所定义的情感分值影响因子，进而完善情感分值的计算过程。3. A structured feature extraction mode is proposed, which can easily update the impact factors of the emotional score we defined, and then improve the calculation process of the emotional score.

附图说明Description of drawings

图1是本发明基于结构化特征的情感分析方法的流程图。Fig. 1 is a flow chart of the sentiment analysis method based on structured features of the present invention.

具体实施方式detailed description

下面结合实施例和附图对本发明的一种基于结构化特征的情感分析方法做出详细说明。A structural feature-based sentiment analysis method of the present invention will be described in detail below with reference to the embodiments and the accompanying drawings.

如图1所示，本发明的一种基于结构化特征的情感分析方法，包括如下步骤：As shown in Figure 1, a kind of sentiment analysis method based on structured feature of the present invention, comprises the following steps:

2)收集现有的情感极性值词典，优先选取由人工手动生成的情感词典；所述的情感极性值词典包括：3个人工手动生成的情感词典AFINN、SentiStrength和VADER，以及一个自动生成的情感词典Opinion Observer。表1给出了情感极性值词典及其特点的概述。2) Collect existing emotional polarity value dictionaries, and preferentially select emotional dictionaries manually generated; the emotional polarity value dictionaries include: 3 manually generated emotional dictionaries AFINN, SentiStrength and VADER, and an automatically generated The Sentiment Dictionary Opinion Observer. Table 1 gives an overview of the sentiment polarity value dictionary and its characteristics.

表1情感词典概述Table 1 Overview of sentiment lexicon

3)手动建立相关辅助字典，包括：标准单词字典、否定词字典、增强修饰词字典、减弱修饰词字典和网络俚语字典；表2给出了我们所用字典的综述。3) Manually build relevant auxiliary dictionaries, including: standard word dictionary, negative word dictionary, enhanced modifier dictionary, weakened modifier dictionary and Internet slang dictionary; Table 2 gives a summary of the dictionaries we use.

表2辅助字典概述Table 2 Auxiliary dictionary overview

(1)首先对Twitter文本数据库中的数据进行分词。所述的分词，是将Twitter文本数据分割成最小有意义的独立文本单元，同时分别标注每个独立文本单元的类型，如单词、话题标签、符号表情、链接地址等。通过正则表达式来匹配不同类型的独立文本单元并为其标注相应标签。(1) First, segment the data in the Twitter text database. The word segmentation is to divide the Twitter text data into the smallest meaningful independent text units, and at the same time mark the type of each independent text unit, such as words, hashtags, emoticons, link addresses, etc. Use regular expressions to match different types of independent text units and label them accordingly.

(2)进行标准化。所述的标准化，是利用标准英文字典将使用重复字母的独立文本单元改为标准形式，识别Twitter文本数据中的符号表情、图片表情和文字表情，并判断和标注相应情感极性。具体如下：(2) Standardize. The standardization is to use a standard English dictionary to change independent text units that use repeated letters into a standard form, to identify symbolic expressions, picture expressions and text expressions in Twitter text data, and to judge and mark the corresponding emotional polarity. details as follows:

a.字母伸长.字母伸长指使用重复的字母来增加词语的表达力度，首先基于语音编码从D^en和D^slang中找单词的索引，如果我们的标准化处理器遇到一个不存在于这两个字典中的单词，则通过语音编码来确认匹配的选项，然后计算输入与每个匹配的选项之间的Levenshtein距离衡量其相似性，返回最佳匹配。a. Letter elongation. Letter elongation refers to the use of repeated letters to increase the expression of words. First, the index of the word is found from D ^en and D ^slang based on the phonetic code. If our normalization processor encounters a word that does not exist in this Words in the two dictionaries are then phonetically encoded to identify matching options, then the Levenshtein distance between the input and each matching option is calculated to measure their similarity, and the best match is returned.

b.符号表情.符号表情是由标点符号或字母组成的面部表情的图形表示如:-),:),:o)等，我们将积极的和消极的符号表情分别标准化为[EMOTICON+]和[EMOTICON-]。b. Symbolic expressions. Symbolic expressions are graphical representations of facial expressions composed of punctuation marks or letters such as :-), :), :o), etc. We normalize positive and negative symbolic expressions as [EMOTICON+] and [ EMOTICON-].

c.图片表情(emoji)。自2010年以来，越来越多的图片表情被加入标准万国码(UNICODE)中-Unicode 8.0，如和符号表情类似，我们将所有的图片表情都标准化对应为预定义的独立文本单元s，如[EMOJI+]、[EMOJI0]、[EMOJI-]。c. Picture expression (emoji). Since 2010, more and more picture emoticons have been added to the standard Unicode (UNICODE)-Unicode 8.0, such as Similar to symbolic emoticons, we normalize all image emoticons into predefined independent text units s, such as [EMOJI+], [EMOJI0], [EMOJI-].

d.文字表情(emotext).最后，我们标准化文字表情如haha、hehe、xixi，我们通过匹配包含至少k个重复字母(目前设k＝2)用正则表达式匹配来找出这些文字表情，然后将每个文字表情标准化为其核心形式，如将hhahahah变为haha。d. Text expressions (emotext). Finally, we standardize text expressions such as haha, hehe, xixi, and we find these text expressions by matching regular expressions that contain at least k repeated letters (currently set k=2), and then Normalizes each text expression to its core form, such as changing hhahahah to haha.

(3)对文本进行词性标注(Part-of-Speech Tagging，POS Tagging)。所述的对文本进行词性标注，是标注每个独立文本单元的词性类别，如名词、形容词、动词等。(3) Perform Part-of-Speech Tagging (POS Tagging) on the text. The part-of-speech tagging of the text is to tag the part-of-speech category of each independent text unit, such as nouns, adjectives, and verbs.

5)定义情感分值影响因子，对步骤4)预处理得到的信息进行语言特征提取，所述的语言特征包括词语级别的语言特征、短语级别的语言特征和句子级别的语言特征，每提取一个语言特征就更新一次情感分值影响因子的数值；所述的定义情感分值影响因子，是为每个独立文本单元t引入一个情感分值影响因子IF_t，其中IF_t≥0，初始值为1，用以反应所述的语言特征对独立文本单元的情感强度增强或者减弱的程度，情感分值影响因子公式如下：5) define affective score influence factor, carry out language feature extraction to the information that step 4) preprocessing obtains, described language feature comprises the language feature of word level, the language feature of phrase level and the language feature of sentence level, extracts each Language features just update the numerical value of the emotional score impact factor once; the definition of the emotional score impact factor is to introduce an emotional score impact factor IF _t for each independent text unit t, where IF _t ≥ 0, the initial value is 1. To reflect the extent to which the language features enhance or weaken the emotional strength of an independent text unit, the formula for the emotional score influencing factor is as follows:

对词语级别的语言特征提取包括：Language feature extraction at the word level includes:

其中指更新后独立文本单元t的情感分值影响因子，指更新前的独立文本单元t的情感分值影响因子。in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.

对短语级别的语言特征提取包括：Phrase-level language feature extraction includes:

其中t是在否定范围之内的独立文本单元，指更新后独立文本单元t的情感分值影响因子，指更新前的独立文本单元t的情感分值影响因子。where t is an independent text unit within the negated range, Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.

中对句子级别的语言特征提取包括：The language feature extraction at the sentence level includes:

其中，如果独立文本单元t在X中，则如果在Y中，则指更新后独立文本单元t的情感分值影响因子，指更新前的独立文本单元t的情感分值影响因子。where, if the independent text unit t is in X, then If in Y, then Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.

6)利用步骤2)得到的情感极性值词典和步骤5)得到的情感分值影响因子为每条Twitter文本数据计算情感极性值。所述的计算情感极性值，包括：6) Use the emotional polarity value dictionary obtained in step 2) and the emotional score impact factor obtained in step 5) to calculate the emotional polarity value for each Twitter text data. The calculation of emotional polarity value includes:

其中score(l,t)为词典l中给出的每个独立文本单元t的基本情感极性值，|L_t|表示包含独立文本单元t的情感极性值词典的个数。Where score(l, t) is the basic sentiment polarity value of each independent text unit t given in dictionary l, |L _t | represents the number of sentiment polarity value dictionaries containing independent text unit t.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the above-mentioned embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

背景技术中的参考文献如下：References in the background technology are as follows:

[1]Pennebaker J W,Francis M E,Booth R J.Linguistic inquiry and wordcount:LIWC 2001[J].Mahway:Lawrence Erlbaum Associates,2001,71:2001..[1] Pennebaker J W, Francis M E, Booth R J. Linguistic inquiry and wordcount: LIWC 2001 [J]. Mahway: Lawrence Erlbaum Associates, 2001, 71: 2001..

[2]Bradley M M,Lang P J.Affective norms for English words(ANEW):Instruction manual and affective ratings[R].Technical report C-1,the centerfor research in psychophysiology,University of Florida,1999.[2] Bradley M M, Lang P J. Affective norms for English words (ANEW): Instruction manual and effective ratings [R]. Technical report C-1, the center for research in psychophysiology, University of Florida, 1999.

[3]Nielsen FA new ANEW:Evaluation of a word list for sentimentanalysis in microblogs[J].arXiv preprint arXiv:1103.2903,2011.[3]Nielsen F A new ANEW:Evaluation of a word list for sentiment analysis in microblogs[J].arXiv preprint arXiv:1103.2903,2011.

[4]Hutto C J,Gilbert E.Vader:A parsimonious rule-based model forsentiment analysis of social media text[C]//Eighth International AAAIConference on Weblogs and Social Media.2014.[4]Hutto C J, Gilbert E.Vader: A parsimonious rule-based model forsentiment analysis of social media text[C]//Eighth International AAAIConference on Weblogs and Social Media.2014.

[5]Baccianella S,Esuli A,Sebastiani F.SentiWordNet 3.0:An EnhancedLexical Resource for Sentiment Analysis and Opinion Mining[C]//LREC.2010,10:2200-2204.[5] Baccianella S, Esuli A, Sebastiani F. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining [C]//LREC.2010,10:2200-2204.

[6]Chikersal P,Poria S,Cambria E.SeNTU:sentiment analysis of tweetsby combining a rule-based classifier with supervised learning[J].SemEval-2015,2015:647.[6] Chikersal P, Poria S, Cambria E. SeNTU: sentiment analysis of tweets by combining a rule-based classifier with supervised learning [J]. SemEval-2015, 2015: 647.

Claims

1. A sentiment analysis method based on structured features, characterized in that, comprising the steps:

1) Collect Twitter text data and build a Twitter text database;

2) Collect existing sentiment polarity value dictionaries, and preferentially select sentiment dictionaries manually generated;

3) Manually establish relevant auxiliary dictionaries, including: standard word dictionary, negative word dictionary, enhanced modifier dictionary, weakened modifier dictionary and Internet slang dictionary;

4) Preprocess the Twitter text database, including:

(1) First, word segmentation is performed on the data in the Twitter text database;

(2) Standardize;

(3) Part-of-Speech Tagging (POS Tagging) on the text;

5) define affective score influence factor, carry out language feature extraction to the information that step 4) preprocessing obtains, described language feature comprises the language feature of word level, the language feature of phrase level and the language feature of sentence level, extracts each Language features will update the value of the affective score impact factor once;

6) Use the emotional polarity value dictionary obtained in step 2) and the emotional score impact factor obtained in step 5) to calculate the emotional polarity value for each Twitter text data.

2. a kind of sentiment analysis method based on structured feature according to claim 1, it is characterized in that, step 2) described emotion polarity value dictionary comprises: the emotion dictionary AFINN, SentiStrength and VADER that 3 artificial manually generate , and an automatically generated sentiment dictionary Opinion Observer.

3. a kind of sentiment analysis method based on structured feature according to claim 1, is characterized in that, the participle described in step 4) (1) step is that Twitter text data is segmented into minimum meaningful independent text unit, while marking the type of each individual text unit separately.

4. a kind of sentiment analysis method based on structured feature according to claim 1, it is characterized in that, the standardization described in the step 4) (2) step is to utilize the independent text unit of repeated letter by utilizing standard English dictionary Change to a standard form to recognize symbolic emoticons, picture emoticons, and text emoticons in Twitter text data, and judge and label the corresponding emotional polarity.

5. a kind of sentiment analysis method based on structured feature according to claim 1, is characterized in that, step 4) described in (3) step is carried out part-of-speech tagging to text, is to mark the part-of-speech of each independent text unit category.

6. a kind of sentiment analysis method based on structured feature according to claim 1, it is characterized in that, step 5) described definition emotion score influence factor is to introduce an emotion score for each independent text unit t The impact factor IF _t , where IF _t ≥ 0, with an initial value of 1, is used to reflect the degree to which the language features enhance or weaken the emotional strength of an independent text unit. The formula for the emotional score impact factor is as follows:

{IF IF}_{t t}^{n no e e w w} = = {IF IF}_{t t}^{o o l l d d} \cdot \cdot 22^{{log log}_{1010} (({Σ Σ}_{p p} p p &Element; &Element; P P))} - - - - - - ((22))

in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the sentiment score influencing factor of the independent text unit t before updating, p refers to a certain feature, and P refers to all feature sets that can affect the sentiment score influencing factor.

7. a kind of sentiment analysis method based on structured feature according to claim 1, is characterized in that, step 5) in the language feature extraction of word level comprises:

If the letters in an independent text unit are all capitalized, then the all capitalized flag s ^AllCaps = 1, otherwise s ^AllCaps = 0, and update the emotional score influencing factor formula IF _t :

{IF IF}_{t t}^{n no e e w w} = = {IF IF}_{t t}^{o o l l d d} \cdot &Center Dot; 22^{{log log}_{1010} ((11 + + {s the s}_{t t}^{A A l l l l C C a a p p s the s}))} - - - - - - ((22))

in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating;

If an individual text unit uses repeated letters, assign an elongation factor to each individual text unit middle t _orig represents the original independent text unit, t _norm represents the normalized independent text unit, and updates the emotional score impact factor formula IF _t :

{IF IF}_{t t}^{n no e e w w} = = {IF IF}_{t t}^{o o l l d d} \cdot &Center Dot; 22^{{log log}_{1010} (({s the s}_{t t}^{E E. x x L L e e n no}))} - - - - - - ((33)) . .

8. a kind of sentiment analysis method based on structured feature according to claim 1, is characterized in that, step 5) in the language feature extraction of phrase level comprises:

Using the negative word dictionary created manually in step 3), determine the beginning of phrases containing negative content, determine period, question mark, exclamation mark and non-standard independent text units as the end symbols of phrases containing negative content, and update the sentiment score impact factor formula IF _t :

{IF IF}_{t t}^{n no e e w w} = = {IF IF}_{t t}^{o o l l d d} \cdot \cdot ((- - 11)) - - - - - - ((44))

where t is an independent text unit within the negated range;

Use the enhanced modifier dictionary and weakened modifier dictionary manually established in step 3) to find out all the modifiers in the Twitter text data The stretch factor for modifiers is calculated according to the following formula:

{s the s}_{t t}^{D D. M m} = = {Σ Σ}_{m m &Element; &Element; {M m}_{t t}^{D D. M m}} ((11 + + {s the s}_{m m}^{A A l l l l C C a a p p s the s} + + {s the s}_{m m}^{E E. x x L L e e n no})) - - - - - - ((55))

Among them, m represents a modifier, M ^DM represents a set of modifiers, if the letters of a modifier m are all capitalized, the modifiers are all capitalized otherwise If a modifier m uses repeated letters, use the repeated letter flag otherwise

And update the emotional score impact factor formula IF _t :

{IF IF}_{t t}^{n no e e w w} = = {IF IF}_{t t}^{o o l l d d} \cdot &Center Dot; 22^{{log log}_{1010} (({s the s}_{t t}^{D D. M m}))} - - - - - - ((66)) . .

9. a kind of sentiment analysis method based on structured feature according to claim 1, is characterized in that, step 5) in the language feature extraction of sentence level comprises:

Through the sentence structure of X but Y, determine the Twitter text data using turning conjunctions (such as but, yet, etc.), mark the part X before the conjunction and the part Y after the conjunction, and update the emotional score impact factor formula IF _t :

{IF IF}_{t t}^{n no e e w w} = = {IF IF}_{t t}^{o o l l d d} \cdot \cdot 22^{{log log}_{1010} (({sgn sgn}_{t t}^{C C O o N N J J}))} - - - - - - ((77))

where, if the independent text unit t is in X, then If in Y, then

Through the following three sentence structures: If X, Y, If X(,) then Y, and Y, if X., determine the Twitter text data using conditional sentences, where X is a conditional sentence and Y is a result sentence in the sentence structure, and update Emotional score impact factor formula IF _t :

{IF IF}_{t t}^{n no e e w w} = = 00 - - - - - - ((88))

where the independent text unit t is in X.

10. a kind of sentiment analysis method based on structured feature according to claim 1, is characterized in that, step 6) described computing emotion polarity value, comprises:

(1) Calculate the basic sentiment polarity value of each independent text unit t, let L be the dictionary set of sentiment polarity value used, L _t = {l∈L _t |t∈l} means the sentiment including the independent text unit t A subset of the dictionary, the basic emotional polarity value s _t of each independent text unit t is obtained by the following formula:

{s the s}_{t t} = = \{\begin{matrix} \frac{{Σ Σ}_{l l &Element; &Element; {L L}_{t t}} s the s c c o o r r e e ((l l,, t t))}{| | {L L}_{t t} | |},, {L L}_{t t} &NotEqual; &NotEqual; 00 \\ 00,, {L L}_{t t} = = 00 \end{matrix} - - - - - - ((99))

Where score(l, t) is the basic emotional polarity value of each independent text unit t given in the dictionary l, |L _t | indicates the number of emotional polarity value dictionaries containing independent text unit t;

(2) For each independent text unit t, update the basic emotional polarity value s _t of each independent text unit t by using the emotional score influence factor IF _t obtained in step 5):

{s the s}_{t t}^{n no e e w w} = = {s the s}_{t t}^{o o l l d d} \cdot \cdot {IF IF}_{t t} - - - - - - ((1010));;

(3) Calculate the overall emotional score S _{T for each Twitter text data T} :

S _T =∑ _t∈T s _t (11).