CN106446147A - Emotion analysis method based on structuring features - Google Patents
Emotion analysis method based on structuring features Download PDFInfo
- Publication number
- CN106446147A CN106446147A CN201610839375.6A CN201610839375A CN106446147A CN 106446147 A CN106446147 A CN 106446147A CN 201610839375 A CN201610839375 A CN 201610839375A CN 106446147 A CN106446147 A CN 106446147A
- Authority
- CN
- China
- Prior art keywords
- text unit
- sentiment
- score
- dictionary
- emotional
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 30
- 230000008451 emotion Effects 0.000 title claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 5
- 230000002996 emotional effect Effects 0.000 claims description 60
- 239000003607 modifier Substances 0.000 claims description 33
- 238000000605 extraction Methods 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 5
- 102000002508 Peptide Elongation Factors Human genes 0.000 claims description 3
- 108010068204 Peptide Elongation Factors Proteins 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000000034 method Methods 0.000 abstract description 14
- 238000012545 processing Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 abstract description 4
- 230000014509 gene expression Effects 0.000 description 18
- 238000004364 calculation method Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 241000989913 Gunnera petaloidea Species 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008921 facial expression Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域technical field
本发明涉及一种情感分析方法。特别是涉及一种无监督的基于结构化特征的情感分析方法。The invention relates to a sentiment analysis method. In particular, it concerns an unsupervised method for sentiment analysis based on structured features.
背景技术Background technique
随着社交媒体的出现和流行,越来越多的用户倾向于通过不同的社交媒体平台分享他们的独特见解或者简单表达他们的情感和情绪。在这些社交平台中Twitter成为最流行的网站之一,据2016年统计数据表明,其目前已拥有超过645,000,000的注册用户,平均每天发的tweet数量超过190,000,000条。通过Twitter的API我们可以获取大量丰富的数据,使我们能够充分的对这些数据进行探测和挖掘,为情感分析的良好契机。从而帮助我们推断大众对于各类事物的观点,利用这些结论我们可以做出更加明智的预测和选择,基于Twitter文本数据的情感分析,顺理成章地成为了当下的研究热点。With the emergence and popularity of social media, more and more users tend to share their unique insights or simply express their emotions and emotions through different social media platforms. Among these social platforms, Twitter has become one of the most popular websites. According to statistics in 2016, it currently has more than 645,000,000 registered users, and the average number of tweets per day exceeds 190,000,000. Through Twitter's API, we can obtain a large amount of rich data, so that we can fully detect and mine these data, which is a good opportunity for sentiment analysis. So as to help us infer the public's views on various things, and use these conclusions to make more informed predictions and choices. Sentiment analysis based on Twitter text data has naturally become a current research hotspot.
针对Twitter文本数据的情感分析主要涉及自然语言处理、观点挖掘和情感分类等技术。目前实现情感分析的方法主要有两种:一种是基于词典的无监督方法,这种方法主要依赖于包含了大量带有情感极性信息的情感词典,如LIWC[1]、ANEW[2]、AFINN[3]、VADER[4]、SentiWordNet[5]等;第二种方法即监督方法,这种方法通过机器学习算法从大量带标注的数据中提取特征来训练分类器,如SVM(Support Vector Machine)、Bayes、DecisionTree等。最常使用的特征是n-grams(文本中连续的1个、2个、3个或多个独立文本单元)的存在与否或者使用频率。然而这种方法在训练阶段需要大量带标注的数据,因此就CPU处理、内存需求及训练时间而言计算开销较大。此外,对于相当大一部分的数据,有监督分类器所预测的决策分值非常接近于决策边界,这暗示着分类器对于文本到底属于哪一类是非常不确定的,因此,分配给这类数据的标签要不是完全错误的要么对了也是偶然情况[6]。因此在这里我们倾向于选用基于词典的无监督方法实现情感分析。Sentiment analysis for Twitter text data mainly involves technologies such as natural language processing, opinion mining and sentiment classification. At present, there are two main methods for implementing sentiment analysis: one is the unsupervised method based on the dictionary, which mainly relies on a sentiment dictionary containing a large number of sentiment polarity information, such as LIWC [1] , ANEW [2] , AFINN [3] , VADER [4] , SentiWordNet [5] , etc.; the second method is the supervised method, which uses machine learning algorithms to extract features from a large number of labeled data to train classifiers, such as SVM (Support Vector Machine), Bayes, DecisionTree, etc. The most commonly used features are the presence or frequency of n-grams (consecutive 1, 2, 3 or more independent text units in a text). However, this approach requires a large amount of labeled data during the training phase, and thus is computationally expensive in terms of CPU processing, memory requirements, and training time. In addition, for a considerable portion of the data, the decision scores predicted by the supervised classifier are very close to the decision boundary, which implies that the classifier is very uncertain about which class the text belongs to. Therefore, assigning to such data The labels are either completely wrong or right by chance [6] . Therefore, here we tend to choose dictionary-based unsupervised methods for sentiment analysis.
基于Twitter文本数据的情感分析领域目前面临的主要挑战主要是Twitter文本自身的特点所带来的:比如一条tweet的长度被限制在140字以内,这样为我们提供的信息就相对有限;除了其不规则的语言结构和语法表达方式,一条tweet中还可能包含了许多的缩略词、符号表情、话题标签、俚语、链接地址等等,这使得情感提取和观点挖掘变得困难。现有的常用传统自然语言处理技术(Natural Language Preprocessing,NLP)如分词、标准化、词性标注等能够有效的应用于正常书写的规范文本上,而对于Twitter数据却不再适用。The main challenges currently facing the field of sentiment analysis based on Twitter text data are mainly brought about by the characteristics of Twitter text itself: for example, the length of a tweet is limited to 140 characters, which provides us with relatively limited information; Regular language structure and grammatical expression, a tweet may also contain many acronyms, emoticons, hashtags, slang, link addresses, etc., which makes emotion extraction and opinion mining difficult. Existing commonly used traditional natural language processing techniques (Natural Language Preprocessing, NLP) such as word segmentation, standardization, part-of-speech tagging, etc. can be effectively applied to normal written normative texts, but they are no longer applicable to Twitter data.
发明内容Contents of the invention
本发明所要解决的技术问题是,提供一种避免了监督类方法中需要大量被标注的数据来训练分类器的基于结构化特征的情感分析方法。The technical problem to be solved by the present invention is to provide a structural feature-based sentiment analysis method that avoids the need for a large amount of labeled data to train classifiers in supervised methods.
本发明所采用的技术方案是:一种基于结构化特征的情感分析方法,包括如下步骤:The technical scheme adopted in the present invention is: a kind of sentiment analysis method based on structural feature, comprises the following steps:
1)采集Twitter文本数据,建立Twitter文本数据库;1) Collect Twitter text data and build a Twitter text database;
2)收集现有的情感极性值词典,优先选取由人工手动生成的情感词典;2) Collect existing sentiment polarity value dictionaries, and preferentially select sentiment dictionaries manually generated;
3)手动建立相关辅助字典,包括:标准单词字典、否定词字典、增强修饰词字典、减弱修饰词字典和网络俚语字典;3) Manually establish relevant auxiliary dictionaries, including: standard word dictionary, negative word dictionary, enhanced modifier dictionary, weakened modifier dictionary and Internet slang dictionary;
4)对所述Twitter文本数据库进行预处理,包括:4) Preprocess the Twitter text database, including:
(1)首先对Twitter文本数据库中的数据进行分词;(1) First, word segmentation is performed on the data in the Twitter text database;
(2)进行标准化;(2) Standardize;
(3)对文本进行词性标注(Part-of-Speech Tagging,POS Tagging);(3) Part-of-Speech Tagging (POS Tagging) on the text;
5)定义情感分值影响因子,对步骤4)预处理得到的信息进行语言特征提取,所述的语言特征包括词语级别的语言特征、短语级别的语言特征和句子级别的语言特征,每提取一个语言特征就更新一次情感分值影响因子的数值;5) define affective score influence factor, carry out language feature extraction to the information that step 4) preprocessing obtains, described language feature comprises the language feature of word level, the language feature of phrase level and the language feature of sentence level, extracts each Language features will update the value of the affective score impact factor once;
6)利用步骤2)得到的情感极性值词典和步骤5)得到的情感分值影响因子为每条Twitter文本数据计算情感极性值。6) Use the emotional polarity value dictionary obtained in step 2) and the emotional score impact factor obtained in step 5) to calculate the emotional polarity value for each Twitter text data.
步骤2)所述的情感极性值词典包括:3个人工手动生成的情感词典AFINN、SentiStrength和VADER,以及一个自动生成的情感词典Opinion Observer。The sentiment polarity value dictionary in step 2) includes: 3 manually generated sentiment dictionaries AFINN, SentiStrength and VADER, and an automatically generated sentiment dictionary Opinion Observer.
步骤4)第(1)步所述的分词,是将Twitter文本数据分割成最小有意义的独立文本单元,同时分别标注每个独立文本单元的类型。Step 4) The word segmentation described in step (1) is to divide the Twitter text data into the smallest meaningful independent text units, and mark the type of each independent text unit respectively.
步骤4)第(2)步所述的标准化,是利用标准英文字典将使用重复字母的独立文本单元改为标准形式,识别Twitter文本数据中的符号表情、图片表情和文字表情,并判断和标注相应情感极性。Step 4) The standardization described in the step (2) is to utilize the standard English dictionary to change the independent text unit using repeated letters into a standard form, to recognize symbol expressions, picture expressions and text expressions in the Twitter text data, and to judge and label Corresponding emotional polarity.
步骤4)第(3)步所述的对文本进行词性标注,是标注每个独立文本单元的词性类别。Step 4) The part-of-speech tagging of the text described in step (3) is to mark the part-of-speech category of each independent text unit.
步骤5)所述的定义情感分值影响因子,是为每个独立文本单元t引入一个情感分值影响因子IFt,其中IFt≥0,初始值为1,用以反应所述的语言特征对独立文本单元的情感强度增强或者减弱的程度,情感分值影响因子公式如下:The definition of affective score impact factor described in step 5) is to introduce an affective score impact factor IF t for each independent text unit t, wherein IF t ≥ 0, the initial value is 1, in order to reflect the language features For the extent to which the emotional strength of an independent text unit is enhanced or weakened, the formula of the emotional score influencing factor is as follows:
其中指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子,p指某一特征,P指所有能够影响情感分值影响因子的特征集合。in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the sentiment score influencing factor of the independent text unit t before updating, p refers to a certain feature, and P refers to all feature sets that can affect the sentiment score influencing factor.
步骤5)中对词语级别的语言特征提取包括:In step 5), the language feature extraction of word level includes:
如果一个独立文本单元中的字母全部大写,则全部大写标志sAllCaps=1,否则sAllCaps=0,并更新情感分值影响因子公式IFt:If the letters in an independent text unit are all capitalized, then the all capitalized flag s AllCaps = 1, otherwise s AllCaps = 0, and update the emotional score influencing factor formula IF t :
其中指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子;in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating;
如果一个独立文本单元使用重复字母,则为每个独立文本单元分配一个伸长因子中torig表示原始独立文本单元,tnorm表示规范化后的独立文本单元,并更新情感分值影响因子公式IFt:If an individual text unit uses repeated letters, assign an elongation factor to each individual text unit middle t orig represents the original independent text unit, t norm represents the normalized independent text unit, and updates the emotional score impact factor formula IF t :
步骤5)中对短语级别的语言特征提取包括:In step 5), the language feature extraction of phrase level includes:
利用步骤3)手动建立的否定词字典,确定含有否定内容短语的开始,将句号、问号、感叹号和非标准的独立文本单元确定为含有否定内容短语的结束标志,并更新情感分值影响因子公式IFt:Using the negative word dictionary created manually in step 3), determine the beginning of phrases containing negative content, determine period, question mark, exclamation mark and non-standard independent text units as the end symbols of phrases containing negative content, and update the sentiment score impact factor formula IF t :
其中t是在否定范围之内的独立文本单元。where t is an independent text unit within the negated range.
利用步骤3)手动建立的增强修饰词字典和减弱修饰词字典,找出Twitter文本数据所有的修饰词按照下式计算修饰词的伸展因子:Use the enhanced modifier dictionary and weakened modifier dictionary manually established in step 3) to find out all the modifiers in the Twitter text data The stretch factor for modifiers is calculated according to the following formula:
其中m表示某个修饰词,MDM表示修饰词集合,若某个修饰词m的字母全部大写则修饰词全部大写标志否则若某个修饰词m使用了重复字母,则使用重复字母标志否则 Among them, m represents a modifier, M DM represents a set of modifiers, if the letters of a modifier m are all capitalized, the modifiers are all capitalized otherwise If a modifier m uses repeated letters, use the repeated letter flag otherwise
并更新情感分值影响因子公式IFt:And update the emotional score impact factor formula IF t :
步骤5)中对句子级别的语言特征提取包括:In step 5), the language feature extraction of sentence level includes:
通过X but Y的句子结构,确定使用转折连词(如but、yet等)的Twitter文本数据,标记连词之前的部分X和连词之后的部分Y,并更新情感分值影响因子公式IFt:Through the sentence structure of X but Y, determine the Twitter text data using turning conjunctions (such as but, yet, etc.), mark the part X before the conjunction and the part Y after the conjunction, and update the emotional score impact factor formula IF t :
其中,如果独立文本单元t在X中,则如果在Y中,则 where, if the independent text unit t is in X, then If in Y, then
通过以下三种句子结构:If X,Y、If X(,)then Y和Y,if X.,确定使用条件句的Twitter文本数据,句子结构中X是条件句而Y是结果句,并更新情感分值影响因子公式IFt:Through the following three sentence structures: If X, Y, If X(,) then Y, and Y, if X., determine the Twitter text data using conditional sentences, where X is a conditional sentence and Y is a result sentence in the sentence structure, and update Emotional score impact factor formula IF t :
其中,独立文本单元t在X中。where the independent text unit t is in X.
步骤6)所述的计算情感极性值,包括:Step 6) described computing emotion polarity value, comprises:
(1)计算每个独立文本单元t的基本情感极性值,设L为所用的情感极性值词典集,Lt={l∈Lt|t∈l}表示包含独立文本单元t的情感词典的子集,由下式得到每个独立文本单元t的基本情感极性值st:(1) Calculate the basic sentiment polarity value of each independent text unit t, let L be the dictionary set of sentiment polarity value used, L t = {l∈L t |t∈l} means the sentiment including the independent text unit t A subset of the dictionary, the basic emotional polarity value s t of each independent text unit t is obtained by the following formula:
其中score(l,t)为词典l中给出的每个独立文本单元t的基本情感极性值,|Lt|表示包含独立文本单元t的情感极性值词典的个数;Where score(l, t) is the basic emotional polarity value of each independent text unit t given in the dictionary l, |L t | indicates the number of emotional polarity value dictionaries containing independent text unit t;
(2)对每个独立文本单元t,利用步骤5)得到的情感分值影响因子IFt更新每个独立文本单元t的基本情感极性值st:(2) For each independent text unit t, update the basic emotional polarity value s t of each independent text unit t by using the emotional score influence factor IF t obtained in step 5):
(3)为每条Twitter文本数据T计算整体情感分值ST:(3) Calculate the overall emotional score S T for each Twitter text data T :
本发明的一种基于结构化特征的情感分析方法,避免了监督类方法中需要大量被标注的数据来训练分类器,难以分析并进行一般化,降低了CPU处理、内存需求及训练时间的计算开销。本发明的有益效果具体是:A sentiment analysis method based on structured features of the present invention avoids the need for a large amount of labeled data to train classifiers in supervised methods, which is difficult to analyze and generalize, and reduces CPU processing, memory requirements and training time calculations overhead. The beneficial effects of the present invention are specifically:
1、避免了采用基于机器学习的有监督类方法,无需依赖于大量被标注的数据来训练分类器从而实现情感分析;1. Avoid the use of supervised methods based on machine learning, and do not need to rely on a large amount of labeled data to train classifiers to achieve sentiment analysis;
2、采用了精细的情感感知的预处理器,从而能够有效的处理非正式的社交媒体文本信息,提高了后续处理的效率和分类准确率;2. Adopting a fine emotional perception preprocessor, which can effectively process informal social media text information, and improve the efficiency of subsequent processing and classification accuracy;
3、提出了一种结构化的特征提取模式,从而可以方便的更新我们所定义的情感分值影响因子,进而完善情感分值的计算过程。3. A structured feature extraction mode is proposed, which can easily update the impact factors of the emotional score we defined, and then improve the calculation process of the emotional score.
附图说明Description of drawings
图1是本发明基于结构化特征的情感分析方法的流程图。Fig. 1 is a flow chart of the sentiment analysis method based on structured features of the present invention.
具体实施方式detailed description
下面结合实施例和附图对本发明的一种基于结构化特征的情感分析方法做出详细说明。A structural feature-based sentiment analysis method of the present invention will be described in detail below with reference to the embodiments and the accompanying drawings.
如图1所示,本发明的一种基于结构化特征的情感分析方法,包括如下步骤:As shown in Figure 1, a kind of sentiment analysis method based on structured feature of the present invention, comprises the following steps:
1)采集Twitter文本数据,建立Twitter文本数据库;1) Collect Twitter text data and build a Twitter text database;
2)收集现有的情感极性值词典,优先选取由人工手动生成的情感词典;所述的情感极性值词典包括:3个人工手动生成的情感词典AFINN、SentiStrength和VADER,以及一个自动生成的情感词典Opinion Observer。表1给出了情感极性值词典及其特点的概述。2) Collect existing emotional polarity value dictionaries, and preferentially select emotional dictionaries manually generated; the emotional polarity value dictionaries include: 3 manually generated emotional dictionaries AFINN, SentiStrength and VADER, and an automatically generated The Sentiment Dictionary Opinion Observer. Table 1 gives an overview of the sentiment polarity value dictionary and its characteristics.
表1情感词典概述Table 1 Overview of sentiment lexicon
3)手动建立相关辅助字典,包括:标准单词字典、否定词字典、增强修饰词字典、减弱修饰词字典和网络俚语字典;表2给出了我们所用字典的综述。3) Manually build relevant auxiliary dictionaries, including: standard word dictionary, negative word dictionary, enhanced modifier dictionary, weakened modifier dictionary and Internet slang dictionary; Table 2 gives a summary of the dictionaries we use.
表2辅助字典概述Table 2 Auxiliary dictionary overview
4)对所述Twitter文本数据库进行预处理,包括:4) Preprocess the Twitter text database, including:
(1)首先对Twitter文本数据库中的数据进行分词。所述的分词,是将Twitter文本数据分割成最小有意义的独立文本单元,同时分别标注每个独立文本单元的类型,如单词、话题标签、符号表情、链接地址等。通过正则表达式来匹配不同类型的独立文本单元并为其标注相应标签。(1) First, segment the data in the Twitter text database. The word segmentation is to divide the Twitter text data into the smallest meaningful independent text units, and at the same time mark the type of each independent text unit, such as words, hashtags, emoticons, link addresses, etc. Use regular expressions to match different types of independent text units and label them accordingly.
(2)进行标准化。所述的标准化,是利用标准英文字典将使用重复字母的独立文本单元改为标准形式,识别Twitter文本数据中的符号表情、图片表情和文字表情,并判断和标注相应情感极性。具体如下:(2) Standardize. The standardization is to use a standard English dictionary to change independent text units that use repeated letters into a standard form, to identify symbolic expressions, picture expressions and text expressions in Twitter text data, and to judge and mark the corresponding emotional polarity. details as follows:
a.字母伸长.字母伸长指使用重复的字母来增加词语的表达力度,首先基于语音编码从Den和Dslang中找单词的索引,如果我们的标准化处理器遇到一个不存在于这两个字典中的单词,则通过语音编码来确认匹配的选项,然后计算输入与每个匹配的选项之间的Levenshtein距离衡量其相似性,返回最佳匹配。a. Letter elongation. Letter elongation refers to the use of repeated letters to increase the expression of words. First, the index of the word is found from D en and D slang based on the phonetic code. If our normalization processor encounters a word that does not exist in this Words in the two dictionaries are then phonetically encoded to identify matching options, then the Levenshtein distance between the input and each matching option is calculated to measure their similarity, and the best match is returned.
b.符号表情.符号表情是由标点符号或字母组成的面部表情的图形表示如:-),:),:o)等,我们将积极的和消极的符号表情分别标准化为[EMOTICON+]和[EMOTICON-]。b. Symbolic expressions. Symbolic expressions are graphical representations of facial expressions composed of punctuation marks or letters such as :-), :), :o), etc. We normalize positive and negative symbolic expressions as [EMOTICON+] and [ EMOTICON-].
c.图片表情(emoji)。自2010年以来,越来越多的图片表情被加入标准万国码(UNICODE)中-Unicode 8.0,如和符号表情类似,我们将所有的图片表情都标准化对应为预定义的独立文本单元s,如[EMOJI+]、[EMOJI0]、[EMOJI-]。c. Picture expression (emoji). Since 2010, more and more picture emoticons have been added to the standard Unicode (UNICODE)-Unicode 8.0, such as Similar to symbolic emoticons, we normalize all image emoticons into predefined independent text units s, such as [EMOJI+], [EMOJI0], [EMOJI-].
d.文字表情(emotext).最后,我们标准化文字表情如haha、hehe、xixi,我们通过匹配包含至少k个重复字母(目前设k=2)用正则表达式匹配来找出这些文字表情,然后将每个文字表情标准化为其核心形式,如将hhahahah变为haha。d. Text expressions (emotext). Finally, we standardize text expressions such as haha, hehe, xixi, and we find these text expressions by matching regular expressions that contain at least k repeated letters (currently set k=2), and then Normalizes each text expression to its core form, such as changing hhahahah to haha.
(3)对文本进行词性标注(Part-of-Speech Tagging,POS Tagging)。所述的对文本进行词性标注,是标注每个独立文本单元的词性类别,如名词、形容词、动词等。(3) Perform Part-of-Speech Tagging (POS Tagging) on the text. The part-of-speech tagging of the text is to tag the part-of-speech category of each independent text unit, such as nouns, adjectives, and verbs.
5)定义情感分值影响因子,对步骤4)预处理得到的信息进行语言特征提取,所述的语言特征包括词语级别的语言特征、短语级别的语言特征和句子级别的语言特征,每提取一个语言特征就更新一次情感分值影响因子的数值;所述的定义情感分值影响因子,是为每个独立文本单元t引入一个情感分值影响因子IFt,其中IFt≥0,初始值为1,用以反应所述的语言特征对独立文本单元的情感强度增强或者减弱的程度,情感分值影响因子公式如下:5) define affective score influence factor, carry out language feature extraction to the information that step 4) preprocessing obtains, described language feature comprises the language feature of word level, the language feature of phrase level and the language feature of sentence level, extracts each Language features just update the numerical value of the emotional score impact factor once; the definition of the emotional score impact factor is to introduce an emotional score impact factor IF t for each independent text unit t, where IF t ≥ 0, the initial value is 1. To reflect the extent to which the language features enhance or weaken the emotional strength of an independent text unit, the formula for the emotional score influencing factor is as follows:
其中指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子,p指某一特征,P指所有能够影响情感分值影响因子的特征集合。in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the sentiment score influencing factor of the independent text unit t before updating, p refers to a certain feature, and P refers to all feature sets that can affect the sentiment score influencing factor.
对词语级别的语言特征提取包括:Language feature extraction at the word level includes:
如果一个独立文本单元中的字母全部大写,则全部大写标志sAllCaps=1,否则sAllCaps=0,并更新情感分值影响因子公式IFt:If the letters in an independent text unit are all capitalized, then the all capitalized flag s AllCaps = 1, otherwise s AllCaps = 0, and update the emotional score influencing factor formula IF t :
其中指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子。in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.
如果一个独立文本单元使用重复字母,则为每个独立文本单元分配一个伸长因子中torig表示原始独立文本单元,tnorm表示规范化后的独立文本单元,并更新情感分值影响因子公式IFt:If an individual text unit uses repeated letters, assign an elongation factor to each individual text unit middle t orig represents the original independent text unit, t norm represents the normalized independent text unit, and updates the emotional score impact factor formula IF t :
其中指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子。in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.
对短语级别的语言特征提取包括:Phrase-level language feature extraction includes:
利用步骤3)手动建立的否定词字典,确定含有否定内容短语的开始,将句号、问号、感叹号和非标准的独立文本单元确定为含有否定内容短语的结束标志,并更新情感分值影响因子公式IFt:Using the negative word dictionary created manually in step 3), determine the beginning of phrases containing negative content, determine period, question mark, exclamation mark and non-standard independent text units as the end symbols of phrases containing negative content, and update the sentiment score impact factor formula IF t :
其中t是在否定范围之内的独立文本单元,指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子。where t is an independent text unit within the negated range, Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.
利用步骤3)手动建立的增强修饰词字典和减弱修饰词字典,找出Twitter文本数据所有的修饰词按照下式计算修饰词的伸展因子:Use the enhanced modifier dictionary and weakened modifier dictionary manually established in step 3) to find out all the modifiers in the Twitter text data The stretch factor for modifiers is calculated according to the following formula:
其中m表示某个修饰词,MDM表示修饰词集合,若某个修饰词m的字母全部大写则修饰词全部大写标志否则若某个修饰词m使用了重复字母,则使用重复字母标志否则 Among them, m represents a modifier, M DM represents a set of modifiers, if the letters of a modifier m are all capitalized, the modifiers are all capitalized otherwise If a modifier m uses repeated letters, use the repeated letter flag otherwise
并更新情感分值影响因子公式IFt:And update the emotional score impact factor formula IF t :
其中指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子。in Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.
中对句子级别的语言特征提取包括:The language feature extraction at the sentence level includes:
通过X but Y的句子结构,确定使用转折连词(如but、yet等)的Twitter文本数据,标记连词之前的部分X和连词之后的部分Y,并更新情感分值影响因子公式IFt:Through the sentence structure of X but Y, determine the Twitter text data using turning conjunctions (such as but, yet, etc.), mark the part X before the conjunction and the part Y after the conjunction, and update the emotional score impact factor formula IF t :
其中,如果独立文本单元t在X中,则如果在Y中,则 指更新后独立文本单元t的情感分值影响因子,指更新前的独立文本单元t的情感分值影响因子。where, if the independent text unit t is in X, then If in Y, then Refers to the influence factor of the sentiment score of the independent text unit t after updating, Refers to the influence factor of the sentiment score of the independent text unit t before updating.
通过以下三种句子结构:If X,Y、If X(,)then Y和Y,if X.,确定使用条件句的Twitter文本数据,句子结构中X是条件句而Y是结果句,并更新情感分值影响因子公式IFt:Through the following three sentence structures: If X, Y, If X(,) then Y, and Y, if X., determine the Twitter text data using conditional sentences, where X is a conditional sentence and Y is a result sentence in the sentence structure, and update Emotional score impact factor formula IF t :
其中,独立文本单元t在X中。where the independent text unit t is in X.
6)利用步骤2)得到的情感极性值词典和步骤5)得到的情感分值影响因子为每条Twitter文本数据计算情感极性值。所述的计算情感极性值,包括:6) Use the emotional polarity value dictionary obtained in step 2) and the emotional score impact factor obtained in step 5) to calculate the emotional polarity value for each Twitter text data. The calculation of emotional polarity value includes:
(1)计算每个独立文本单元t的基本情感极性值,设L为所用的情感极性值词典集,Lt={l∈Lt|t∈l}表示包含独立文本单元t的情感词典的子集,由下式得到每个独立文本单元t的基本情感极性值st:(1) Calculate the basic sentiment polarity value of each independent text unit t, let L be the dictionary set of sentiment polarity value used, L t = {l∈L t |t∈l} means the sentiment including the independent text unit t A subset of the dictionary, the basic emotional polarity value s t of each independent text unit t is obtained by the following formula:
其中score(l,t)为词典l中给出的每个独立文本单元t的基本情感极性值,|Lt|表示包含独立文本单元t的情感极性值词典的个数。Where score(l, t) is the basic sentiment polarity value of each independent text unit t given in dictionary l, |L t | represents the number of sentiment polarity value dictionaries containing independent text unit t.
(2)对每个独立文本单元t,利用步骤5)得到的情感分值影响因子IFt更新每个独立文本单元t的基本情感极性值st:(2) For each independent text unit t, update the basic emotional polarity value s t of each independent text unit t by using the emotional score influence factor IF t obtained in step 5):
(3)为每条Twitter文本数据T计算整体情感分值ST:(3) Calculate the overall emotional score S T for each Twitter text data T :
本领域技术人员可以理解附图只是一个优选实施例的示意图,上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the above-mentioned embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.
背景技术中的参考文献如下:References in the background technology are as follows:
[1]Pennebaker J W,Francis M E,Booth R J.Linguistic inquiry and wordcount:LIWC 2001[J].Mahway:Lawrence Erlbaum Associates,2001,71:2001..[1] Pennebaker J W, Francis M E, Booth R J. Linguistic inquiry and wordcount: LIWC 2001 [J]. Mahway: Lawrence Erlbaum Associates, 2001, 71: 2001..
[2]Bradley M M,Lang P J.Affective norms for English words(ANEW):Instruction manual and affective ratings[R].Technical report C-1,the centerfor research in psychophysiology,University of Florida,1999.[2] Bradley M M, Lang P J. Affective norms for English words (ANEW): Instruction manual and effective ratings [R]. Technical report C-1, the center for research in psychophysiology, University of Florida, 1999.
[3]Nielsen FA new ANEW:Evaluation of a word list for sentimentanalysis in microblogs[J].arXiv preprint arXiv:1103.2903,2011.[3]Nielsen F A new ANEW:Evaluation of a word list for sentiment analysis in microblogs[J].arXiv preprint arXiv:1103.2903,2011.
[4]Hutto C J,Gilbert E.Vader:A parsimonious rule-based model forsentiment analysis of social media text[C]//Eighth International AAAIConference on Weblogs and Social Media.2014.[4]Hutto C J, Gilbert E.Vader: A parsimonious rule-based model forsentiment analysis of social media text[C]//Eighth International AAAIConference on Weblogs and Social Media.2014.
[5]Baccianella S,Esuli A,Sebastiani F.SentiWordNet 3.0:An EnhancedLexical Resource for Sentiment Analysis and Opinion Mining[C]//LREC.2010,10:2200-2204.[5] Baccianella S, Esuli A, Sebastiani F. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining [C]//LREC.2010,10:2200-2204.
[6]Chikersal P,Poria S,Cambria E.SeNTU:sentiment analysis of tweetsby combining a rule-based classifier with supervised learning[J].SemEval-2015,2015:647.[6] Chikersal P, Poria S, Cambria E. SeNTU: sentiment analysis of tweets by combining a rule-based classifier with supervised learning [J]. SemEval-2015, 2015: 647.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610839375.6A CN106446147A (en) | 2016-09-20 | 2016-09-20 | Emotion analysis method based on structuring features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610839375.6A CN106446147A (en) | 2016-09-20 | 2016-09-20 | Emotion analysis method based on structuring features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106446147A true CN106446147A (en) | 2017-02-22 |
Family
ID=58166213
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610839375.6A Pending CN106446147A (en) | 2016-09-20 | 2016-09-20 | Emotion analysis method based on structuring features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106446147A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980650A (en) * | 2017-03-01 | 2017-07-25 | 平顶山学院 | A kind of emotion enhancing word insertion learning method towards Twitter opinion classifications |
CN108681532A (en) * | 2018-04-08 | 2018-10-19 | 天津大学 | A kind of sentiment analysis method towards Chinese microblogging |
CN109697657A (en) * | 2018-12-27 | 2019-04-30 | 厦门快商通信息技术有限公司 | A kind of dining recommending method, server and storage medium |
CN111046137A (en) * | 2019-11-13 | 2020-04-21 | 天津大学 | A Multidimensional Sentiment Tendency Analysis Method |
CN111046136A (en) * | 2019-11-13 | 2020-04-21 | 天津大学 | Method for calculating multi-dimensional emotion intensity value by fusing emoticons and short text |
CN111143564A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Unsupervised multi-target chapter-level emotion classification model training method and unsupervised multi-target chapter-level emotion classification model training device |
CN111312394A (en) * | 2020-01-15 | 2020-06-19 | 东北电力大学 | Psychological health condition evaluation system based on combined emotion and processing method thereof |
CN113094500A (en) * | 2021-03-09 | 2021-07-09 | 山西三友和智慧信息技术股份有限公司 | Word embedding-based twitter emotion classification text processing optimization system |
CN117521813A (en) * | 2023-11-20 | 2024-02-06 | 中诚华隆计算机技术有限公司 | Scenario generation method, device, equipment and chip based on knowledge graph |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663046A (en) * | 2012-03-29 | 2012-09-12 | 中国科学院自动化研究所 | Sentiment analysis method oriented to micro-blog short text |
JP2012226747A (en) * | 2011-04-21 | 2012-11-15 | Palo Alto Research Center Inc | Incorporation of glossary knowledge in svm learning for improvement in feeling classification |
CN104008091A (en) * | 2014-05-26 | 2014-08-27 | 上海大学 | Sentiment value based web text sentiment analysis method |
-
2016
- 2016-09-20 CN CN201610839375.6A patent/CN106446147A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012226747A (en) * | 2011-04-21 | 2012-11-15 | Palo Alto Research Center Inc | Incorporation of glossary knowledge in svm learning for improvement in feeling classification |
CN102663046A (en) * | 2012-03-29 | 2012-09-12 | 中国科学院自动化研究所 | Sentiment analysis method oriented to micro-blog short text |
CN104008091A (en) * | 2014-05-26 | 2014-08-27 | 上海大学 | Sentiment value based web text sentiment analysis method |
Non-Patent Citations (1)
Title |
---|
王志涛 等: "基于词典和规则集的中文微博情感分析", 《计算机工程与应用》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980650A (en) * | 2017-03-01 | 2017-07-25 | 平顶山学院 | A kind of emotion enhancing word insertion learning method towards Twitter opinion classifications |
CN108681532A (en) * | 2018-04-08 | 2018-10-19 | 天津大学 | A kind of sentiment analysis method towards Chinese microblogging |
CN108681532B (en) * | 2018-04-08 | 2022-03-25 | 天津大学 | A sentiment analysis method for Chinese microblogs |
CN109697657A (en) * | 2018-12-27 | 2019-04-30 | 厦门快商通信息技术有限公司 | A kind of dining recommending method, server and storage medium |
CN111046137A (en) * | 2019-11-13 | 2020-04-21 | 天津大学 | A Multidimensional Sentiment Tendency Analysis Method |
CN111046136A (en) * | 2019-11-13 | 2020-04-21 | 天津大学 | Method for calculating multi-dimensional emotion intensity value by fusing emoticons and short text |
CN111143564A (en) * | 2019-12-27 | 2020-05-12 | 北京百度网讯科技有限公司 | Unsupervised multi-target chapter-level emotion classification model training method and unsupervised multi-target chapter-level emotion classification model training device |
CN111312394A (en) * | 2020-01-15 | 2020-06-19 | 东北电力大学 | Psychological health condition evaluation system based on combined emotion and processing method thereof |
CN111312394B (en) * | 2020-01-15 | 2023-09-29 | 东北电力大学 | Psychological health assessment system based on combined emotion and processing method thereof |
CN113094500A (en) * | 2021-03-09 | 2021-07-09 | 山西三友和智慧信息技术股份有限公司 | Word embedding-based twitter emotion classification text processing optimization system |
CN117521813A (en) * | 2023-11-20 | 2024-02-06 | 中诚华隆计算机技术有限公司 | Scenario generation method, device, equipment and chip based on knowledge graph |
CN117521813B (en) * | 2023-11-20 | 2024-05-28 | 中诚华隆计算机技术有限公司 | Scenario generation method, device, equipment and chip based on knowledge graph |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Naseem et al. | Transformer based deep intelligent contextual embedding for twitter sentiment analysis | |
Jain et al. | Sarcasm detection in mash-up language using soft-attention based bi-directional LSTM and feature-rich CNN | |
CN106446147A (en) | Emotion analysis method based on structuring features | |
CN109977416B (en) | Multi-level natural language anti-spam text method and system | |
US8131539B2 (en) | Search-based word segmentation method and device for language without word boundary tag | |
Li et al. | DWWP: Domain-specific new words detection and word propagation system for sentiment analysis in the tourism domain | |
CN110619034A (en) | Text keyword generation method based on Transformer model | |
CN110413768B (en) | A method of automatic generation of article title | |
CN108563638B (en) | Microblog emotion analysis method based on topic identification and integrated learning | |
CN103034626A (en) | Emotion analyzing system and method | |
CN111125367B (en) | Multi-character relation extraction method based on multi-level attention mechanism | |
Rahimi et al. | An overview on extractive text summarization | |
CN107818084B (en) | Emotion analysis method fused with comment matching diagram | |
CN106202584A (en) | A kind of microblog emotional based on standard dictionary and semantic rule analyzes method | |
CN112905736B (en) | An unsupervised text sentiment analysis method based on quantum theory | |
CN109740164B (en) | Power Defect Level Recognition Method Based on Deep Semantic Matching | |
CN111221964B (en) | A Text Generation Method Guided by Evolutionary Trends of Different Faceted Viewpoints | |
CN110705291A (en) | Method and system of word segmentation in the field of ideological and political education based on unsupervised learning | |
Khatun et al. | Authorship Attribution in Bangla literature using Character-level CNN | |
Vora et al. | Classification of tweets based on emotions using word embedding and random forest classifiers | |
CN107133212A (en) | It is a kind of that recognition methods is contained based on integrated study and the text of words and phrases integrated information | |
CN114153973A (en) | Mongolian multimodal sentiment analysis method based on T-M BERT pre-training model | |
Sazzed | A hybrid approach of opinion mining and comparative linguistic analysis of restaurant reviews | |
CN115952794A (en) | Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph | |
Rajalakshmi et al. | Sentimental analysis of code-mixed Hindi language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170222 |