KR101534141B1

KR101534141B1 - Rationale word extraction method and apparatus using genetic algorithm, and sentiment classification method and apparatus using said rationale word

Info

Publication number: KR101534141B1
Application number: KR1020140100538A
Authority: KR
Inventors: 이지형; 김경민; 김희라; 김누리; 이재동
Original assignee: 성균관대학교산학협력단
Priority date: 2014-08-05
Filing date: 2014-08-05
Publication date: 2015-07-07

Abstract

Disclosed is an extraction method of emotion rationale, which comprises the steps of: arbitrarily generating an initial chromosome group based on a word in a document (herein, attributes of each gene of a chromosome mean each word presented in a training data set of the document and an attribute value means whether or not the word is selected); evaluating each chromosome through an evaluation function (the evaluation function is for maximizing accuracy of classification of a document polarity) for the initial chromosome group; generating new chromosome by going through gene calculation based on an evaluation score of each chromosome; and generating an emotion rationale group by determining whether the generated chromosome satisfies termination conditions or not.

Description

TECHNICAL FIELD [0001] The present invention relates to a method and apparatus for extracting emotion based on genetic algorithms, and a method and apparatus for emotion classification using the emotion basis.

본 발명은 감정 근거 추출 방법 및 장치에 관한 것으로, 보다 상세하게는, 문서 내에 포함된 모든 문장 또는 단어들 중 핵심적인 역할을 하는 근거를 추출하여 감성 분석을 수행하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for extracting emotional basis, and more particularly, to a method and apparatus for performing emotional analysis by extracting evidence serving as a key among all sentences or words included in a document.

종래의 감정분석 방법은 크게 감성 사전 기반(Lexicon-based Approach)과 기계학습을 이용한 기법(Text Classification Approach)로 분류된다. 감성사전 기반 기법은 종래에 구축된 감성사전을 이용하여 감성 어휘 포함여부에 따라 문서가 긍정인지 부정인지를 분류하는 방법으로, 대개 감성사전은 도메인을 고려하지 않고 일반적으로 사용되는 어휘의 의미를 고려하여 구축된다. 따라서, 특정 도메인에 대한 감성분석을 하는데 있어 낮은 정확도를 보일 수 있고, 도메인에 따라 성능 결과가 크게 바뀔 수 있다는 문제점을 갖는다. 또한, 도메인에 맞춰 수동으로 감성허위사전을 구축할 경우, 시간적 비용이 매우 크며 도메인에 편향된 결과를 보일 수 있다. 다음으로, 기계학습을 이용한 방법은 문서에 포함된 각 단어들을 하나의 속성(feature)으로 표현하고, 기계학습 알고리즘을 이용하여 감성 분석을 한다. 이는 속성 선택을 어떻게 하냐에 따라 결과가 매우 달라질 수 있기 때문에 속성선택을 하는 기준을 세우는 것이 매우 중요하다. 이를 위해서는 도메인에 관련된 전문가의 사전지식이 필요하다. 따라서, 위 두 가지 종래 방법 모두 많은 시간과 노력이 요구된다는 문제점이 있다.
특히, 대한민국 공개특허 KR10-2013-0092342에 감정 어휘 사전 생성 및 이를 이용한 문서의 감정 강도 계산 시스템이 개시되어 있고, 대한민국 공개특허 KR10-2013-0075124에 텍스트로부터 정서 단어를 추출하여 텍스트에 내재된 지배적인 정서 상태를 분석하는 정서 분석 장치가 개시되어 있는데, 이 역시, 기계학습 감성 분석에 관한 것으로 위와 같은 문제점이 존재한다.Conventional emotional analysis methods are classified into Lexicon-based Approach and Text Classification Approach. The emotion dictionary-based technique is a method of classifying whether a document is positive or negative according to the presence or absence of emotional vocabulary using a conventional emotional dictionary. Generally, the emotional dictionary considers the meaning of a commonly used vocabulary . Therefore, there is a problem that sensitivity can be lowered in sensitivity analysis for a specific domain, and performance results may be greatly changed depending on the domain. In addition, when constructing the emotionally false dictionary manually according to the domain, the time cost is very large and the result may be biased toward the domain. Next, the method using the machine learning expresses each word contained in the document as a feature, and performs emotional analysis using a machine learning algorithm. It is very important to set the criteria for attribute selection because the results can be very different depending on how the attribute selection is made. This requires prior knowledge of the domain experts. Therefore, both conventional methods require a lot of time and effort.
In particular, Korean Patent Laid-Open Publication No. 10-2013-0092342 discloses an emotional vocabulary dictionary creation system and a system for calculating the emotional intensity of a document using the emotional vocabulary dictionary system. Korean Patent Application Publication No. KR-10-2013-0075124 extracts emotional words from texts, Emotion analyzing apparatus for analyzing emotional state, which is also related to machine learning emotional analysis.

상술한 문제점을 해결하기 위한 본 발명의 목적은 저비용으로 높은 정확도의 문서의 감성 분석을 수행하는 감정 근거 추출 방법 및 장치를 제공하는 것이다. SUMMARY OF THE INVENTION An object of the present invention to solve the above problems is to provide a method and apparatus for extracting emotion evidence for performing emotion analysis of a document with high accuracy at low cost.

또한, 본 발명의 다른 목적은 유전 알고리즘을 이용하여 감성분석에 있어 드는 비용을 최소화하고, 문서의 극성을 가장 잘 나타내는 감정 근거를 추출하며, 이를 통해 보다 정확한 감성 분석이 가능해지도록 하는 감정 근거 추출 방법 및 장치를 제공하는 것이다.Another object of the present invention is to provide a method and apparatus for extracting emotional grounds that can minimize the cost of emotional analysis using a genetic algorithm, extract emotional bases that best represent the polarity of the document, And an apparatus.

상기한 목적을 달성하기 위한 본 발명의 적어도 하나의 문서의 감정 근거를 추출하는 방법은 상기 문서 내의 단어를 기반으로 초기 염색체 집합을 임의로 생성하는 단계 - 여기서, 염색체의 각 유전자의 속성은 상기 문서 내의 훈련 데이터 셋에서 나타나는 각 단어들을 의미하고, 속성 값은 단어의 선택 여부를 나타냄 - , 상기 초기 염색체 집합을 대상으로, 평가 함수 - 평가 함수는 문서의 극성을 분류하는 정확도를 최대화하는 함수 - 를 통해 각 염색체를 평가하는 단계, 각 염색체의 평가 점수를 기반으로 유전자 연산을 거쳐 새로운 염색체를 생성하는 단계 및 상기 생성된 염색체가 종료 조건을 만족하는지 판단하여 감정 근거 집합을 생성하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a method of extracting emotion bases from at least one document, the method comprising: arbitrarily generating an initial chromosome set based on words in the document, Means an individual word appearing in a training data set, an attribute value indicates whether a word is selected, a function for maximizing the accuracy of classifying the polarity of the document, an evaluation function for the initial chromosome set, Evaluating each chromosome, generating a new chromosome by performing a gene operation based on the evaluation score of each chromosome, and generating an emotion basis set by determining whether the generated chromosome satisfies the termination condition .

상기 생성된 염색체가 종료 조건을 만족하는지 판단하여 감정 근거 집합을 생성하는 단계는 상기 생성된 염색체가 상기 종료 조건을 만족하는지 판단하는 단계 및 상기 종료 조건을 만족하는 경우, 감정 근거 집합에 포함시키고, 상기 종료 조건을 만족하지 못하는 경우, 상기 염색체 평가 단계 및 상기 새로운 염색체 생성 단계를 반복 수행한 후, 종료 조건 만족 여부를 다시 판단하는 과정을 반복 수행하는 단계를 포함할 수 있다.Generating an emotion basis set by determining whether the generated chromosome satisfies the termination condition includes determining whether the generated chromosome satisfies the termination condition and, if the termination condition is satisfied, Repeating the step of evaluating the chromosome and the step of generating the new chromosome, and repeating the process of determining whether the end condition is satisfied when the end condition is not satisfied.

상기 훈련 데이터 셋에서 나타나는 각 단어들은 대상 문서와 감성 사전 데이터에서 모두 나타나는 단어들이며, 상기 감성 사전 데이터에 따라 긍정(positive) 또는 부정(negative) 점수를 가질 수 있다.Each word appearing in the training data set is a word appearing both in the target document and the emotion dictionary data, and may have a positive or negative score according to the emotion dictionary data.

상기 초기 염색체 집합 내의 각 염색체는 후보 감정 근거 단어들의 선택 여부를 나타내는 바이너리 스트링(binary string)으로 표현될 수 있다.Each chromosome in the initial chromosome set can be expressed as a binary string indicating whether or not the candidate emotion based words are selected.

각 염색체의 평가 함수를 기반으로 유전자 연산을 거쳐 새로운 염색체를 생성하는 단계는, 상기 염색체 집합 내의 복수 개의 염색체 바이너리 스트링들을 대상으로, 소정 인접한 스트링 쌍들을 토너먼트 선택(tournament selection) - 토너먼트 선택시 평가 점수를 고려함 - 에 기반하여 교배를 위해 선택하는 단계, 상기 선택된 스트링 쌍을 가지고, 소정 교배 포인트에서 서브스트링들을 교환함으로써 단일-포인트 교배를 수행하는 단계 및 변이 확률 상수에 기반하여 교배를 수행한 스트링 내의 개별 속성 값을 변이시키는 단계를 포함할 수 있다.The step of generating a new chromosome based on the evaluation function of each of the chromosomes may include generating a new chromosome based on a plurality of chromosome binary strings in the chromosome set by using predetermined adjacent string pairs as a tournament selection, - selecting a candidate for the mating based on the consideration of the mismatch probability, performing a single-point mating by exchanging substrings at a predetermined mating point with the selected string pair, And mutating the individual attribute values.

상기 종료 조건은 유전자 연산을 통해 생성되는 염색체의 평가 점수의 변화량이 수렴하는 형태를 갖는지 여부를 통해 결정될 수 있다.The termination condition can be determined by whether or not the variation amount of the evaluation score of the chromosome generated through the gene operation has a converging shape.

상기 평가 점수는 문서의 극성을 식별하기 위해, 긍정 감성 지수 - 상기 훈련 데이터 셋의 긍정 점수를 합산한 값임 - 및 부정 감성 지수 - 상기 훈련 데이터 셋의 부정 점수를 합산한 값임 - 사이의 차이 값을 기반으로 산출될 수 있다.The evaluation score is a difference value between the positive emotion index - the sum of the positive scores of the training data set - and the negative emotion index - the sum of the negative scores of the training data set - to identify the polarity of the document . &Lt; / RTI >

상기 평가 점수는, 상기 생성된 감정 근거 집합을 이용하여 전체 문서들의 수 대비 정확하게 분류된 문서들의 수의 비율, 상기 생성된 감정 근거 집합을 이용하여 긍정으로 분류된 문서 중 정확하게 긍정인 문서들의 수의 비율 및 상기 생성된 감정 근거 집합을 이용하여 부정으로 분류된 문서 중 정확하게 부정인 문서들의 수의 비율 중 적어도 하나를 기반으로 계산될 수 있다.The evaluation score may be a ratio of the number of correctly classified documents to the total number of documents using the generated emotion basis set, a ratio of the number of correctly identified documents among the affirmative classified documents using the generated emotion basis set And a ratio of the number of correctly negated documents among the documents classified as negative using the ratio and the generated emotion basis set.

상기한 목적을 달성하기 위한 본 발명의 적어도 하나의 문서의 감정 근거를 추출하는 장치는 상기 문서 내의 단어를 기반으로 초기 염색체 집합을 임의로 생성하는 염색체 집합 생성부 - 여기서, 염색체의 각 유전자의 속성은 상기 문서 내의 훈련 데이터 셋에서 나타나는 각 단어들을 의미하고, 속성 값은 단어의 선택 여부를 나타냄 - , 상기 초기 염색체 집합을 대상으로, 평가 함수 - 평가 함수는 문서의 극성을 분류하는 정확도를 최대화하는 함수 - 를 통해 각 염색체를 평가하는 평가부, 각 염색체의 평가 점수를 기반으로 유전자 연산을 거쳐 새로운 염색체를 생성하는 유전자 연산부 및 상기 생성된 염색체가 종료 조건을 만족하는지 판단하여 감정 근거 집합을 생성하는 감정 근거 집합 생성부를 포함할 수 있다.According to an aspect of the present invention, there is provided an apparatus for extracting an emotional basis of at least one document according to the present invention includes a chromosome set generator for randomly generating an initial chromosome set based on words in the document, Means an individual word appearing in a training data set in the document, and an attribute value indicates whether a word is selected. The evaluation function-evaluation function is a function that maximizes the accuracy of classifying the polarity of the document -, an evaluation unit for evaluating each chromosome through a chromosome, a gene operation unit for generating a new chromosome by performing a gene operation based on the evaluation score of each chromosome, and an emotion generation unit for judging whether the generated chromosome satisfies the termination condition And a basis set generation unit.

상기한 목적을 달성하기 위한 본 발명의 적어도 하나의 문서의 감정 근거를 추출하여 감성 분석을 수행하는 방법은 상기 문서 내의 단어를 기반으로 초기 염색체 집합을 임의로 생성하는 단계 - 여기서, 염색체의 각 유전자의 속성은 상기 문서 내의 훈련 데이터 셋에서 나타나는 각 단어들을 의미하고, 속성 값은 단어의 선택 여부를 나타냄 - , 상기 초기 염색체 집합을 대상으로, 평가 함수 - 평가 함수는 문서의 극성을 분류하는 정확도를 최대화하는 함수 - 를 통해 각 염색체를 평가하는 단계, 각 염색체의 평가 점수를 기반으로 유전자 연산을 거쳐 새로운 염색체를 생성하는 단계, 상기 생성된 염색체가 종료 조건을 만족하는지 판단하여 감정 근거 집합을 생성하는 단계 및 상기 생성된 감정 근거 집합을 이용하여 테스트 대상 문서의 감성 분류를 수행하는 단계를 포함할 수 있다.According to another aspect of the present invention, there is provided a method for extracting an emotional basis of at least one document according to the present invention, the method comprising: arbitrarily generating an initial chromosome set based on words in the document, Attribute indicates each word appearing in the training data set in the document, and the attribute value indicates whether the word is selected. The evaluation function-evaluation function for the initial chromosome set maximizes the accuracy of classifying the polarity of the document A step of evaluating each chromosome through a function of generating a chromosome, a step of generating a new chromosome by performing a gene operation based on the score of each chromosome, the step of generating an emotion basis set by judging whether the generated chromosome satisfies the termination condition And emotional classification of the test target document using the generated emotion basis set It can include.

상기 감성 분류 수행 단계는, 상기 생성된 감정 근거 집합을 이용하여 평가를 위한 감성 분석 정확도를 문서들의 수 대비 정확하게 분류된 문서들의 수의 비율을 통해 계산하는 단계를 포함할 수 있다.The step of performing emotional classification may include calculating emotional analysis accuracy for evaluation using the generated emotion basis set through a ratio of the number of correctly classified documents to the number of documents.

상기 감성 분류 수행 단계는, 상기 테스트 대상 문서에 포함된, 감정 근거 집합 단어들을 기반으로 긍정 감성 지수 및 부정 감성 지수를 산출하여 더 높은 값을 갖는 감성을 해당 테스트 대상 문서의 감성으로 분류하는 단계를 포함할 수 있다.The step of performing emotional classification includes classifying the emotion having a higher value into the emotion of the test object document by calculating the positive emotion index and the negative emotion index based on the emotion based set words included in the test object document .

상기한 목적을 달성하기 위한 본 발명의 적어도 하나의 문서의 감정 근거를 추출하여 감성 분석을 수행하는 장치는 상기 문서 내의 단어를 기반으로 초기 염색체 집합을 임의로 생성하고, - 여기서, 염색체의 각 유전자의 속성은 상기 문서 내의 훈련 데이터 셋에서 나타나는 각 단어들을 의미하고, 속성 값은 단어의 선택 여부를 나타냄 - 상기 초기 염색체 집합을 대상으로, 평가 함수 - 평가 함수는 문서의 극성을 분류하는 정확도를 최대화하는 함수 - 를 통해 각 염색체를 평가하여, 각 염색체의 평가 점수를 기반으로 유전자 연산을 거쳐 새로운 염색체를 생성하고 상기 생성된 염색체가 종료 조건을 만족하는지 판단하여 감정 근거 집합을 생성하는 감정 근거 집합 생성부 및 상기 생성된 감정 근거 집합을 이용하여 테스트 대상 문서의 감성 분류를 수행하는 감성 분류부를 포함할 수 있다.According to another aspect of the present invention, there is provided an apparatus for extracting an emotional basis of at least one document according to the present invention, the apparatus comprising: an arbitrary generator for generating an initial chromosome set based on words in the document, Attribute indicates each word appearing in the training data set in the document, and the attribute value indicates whether the word is selected. The evaluation function-evaluation function is used to maximize the accuracy of classifying the polarity of the document And generating a new chromosome based on the evaluation score of each chromosome through a genetic operation and determining whether the generated chromosome satisfies the termination condition to generate an emotion basis set, And emotional classification of the test target document using the generated emotion basis set And an emotional classifier for performing the emotional classifier.

본 발명의 감정 근거 추출 방법 및 장치에 따르면, 유전 알고리즘을 이용하여 감성 분석에 드는 비용을 최소화하고, 문서의 극성을 가장 잘 나타내는 감정 근거를 추출하는 효과가 있다.According to the emotion-based extracting method and apparatus of the present invention, it is possible to minimize the cost of emotional analysis using a genetic algorithm and to extract the emotional basis that best represents the polarity of the document.

또한, 도메인에 대한 전문가의 사전 지식을 요구하지 않고, 수동으로 감성 어휘사전을 구축할 필요가 없기 때문에, 저비용 고효율의 감성분석이 가능하다는 효과가 있다.In addition, since there is no need to manually build an emotional vocabulary dictionary without requiring expert knowledge of the domain, it is possible to perform low-cost and high-sensitivity emotional analysis.

도 1은 본 발명의 일 실시예에 따른 감정 근거 추출 장치가 적용되는 시스템을 나타낸 블록도,
도 2는 본 발명의 일 실시예에 따른 감정 근거 추출 방법을 개략적으로 설명하기 위한 흐름도,
도 3은 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 유전 알고리즘을 설명하기 위한 도면,
도 4는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 초기 염색체 집합 생성 과정을 설명하기 위한 도면,
도 5는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 특정 문서에 대한 감정 평가 및 감성 분류를 수행하는 것을 설명하기 위한 도면,
도 6은 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 훈련 세트와 테스트 세트의 서로 다른 흐름을 설명하기 위한 흐름도,
도 7은 본 발명의 일 실시예에 따른 감정 근거 추출 장치의 구성을 개략적으로 나타낸 블록도,
도 8a는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 성능 검증을 위한 데이터셋에 관련된 내용을 도시한 도면,
도 8b는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 성능 검증을 위한 파라미터에 관련된 내용을 도시한 도면,
도 9는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 세가지 방식과 종래 방식과의 정확도 차이를 비교하기 위한 표를 나타낸 도면,
도 10은 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 세가지 방식과 종래 방식과의 정확도 차이를 비교하기 위한 그래프이다.1 is a block diagram showing a system to which an emotion-based reason extracting apparatus according to an embodiment of the present invention is applied;
FIG. 2 is a flowchart schematically illustrating an emotion-based extracting method according to an embodiment of the present invention;
FIG. 3 is a diagram for explaining a genetic algorithm of the emotion-based extracting method according to an embodiment of the present invention;
FIG. 4 is a diagram for explaining a process of generating an initial chromosome set in the emotion-based extracting method according to an embodiment of the present invention;
FIG. 5 is a diagram for explaining emotion evaluation and emotion classification of a specific document in the emotion-based extracting method according to an embodiment of the present invention;
FIG. 6 is a flow chart for explaining different flows of a training set and a test set of the emotion basis extraction method according to an embodiment of the present invention;
FIG. 7 is a block diagram schematically showing a configuration of an emotion-based reason extracting apparatus according to an embodiment of the present invention;
FIG. 8A is a diagram showing contents related to a data set for performance verification of the emotion-based extracting method according to an embodiment of the present invention; FIG.
FIG. 8B is a diagram showing contents related to parameters for performance verification of the emotion-based extracting method according to an embodiment of the present invention; FIG.
9 is a table for comparing accuracy differences between the three methods and the conventional method of the emotion-based extracting method according to an embodiment of the present invention;
10 is a graph for comparing accuracy differences between the three methods of the emotion-based extracting method and the conventional method according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

감정 근거 추출 시스템Emotion basis extraction system

도 1은 본 발명의 일 실시예에 따른 감정 근거 추출 장치가 적용되는 시스템을 나타낸 블록도이다. 도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 감정 근거 추출 시스템은 감정 근거 추출 장치(100) 및 클라이언트 디바이스(110-1, 110-2, ..., 110-N)를 포함할 수 있다.1 is a block diagram illustrating a system to which an emotion-based extracting apparatus according to an embodiment of the present invention is applied. 1, the emotion-based reason extracting system according to the embodiment of the present invention includes the emotion-based reason extracting apparatus 100 and the client devices 110-1, 110-2, ..., 110-N can do.

도 1을 참조하면, 감정 근거 추출 장치(100)는 인터넷 또는 유무선 통신 네트워크를 이용하여 적어도 하나의 클라이언트 디바이스(110-1, 110-2, ..., 110-N)로부터 전자문서를 수신할 수 있다. 여기서, 전자문서는 문서 파일뿐만 아니라, 온라인 상으로 송수신되는 각종 게시글들(예컨대, 카페 또는 블로그 내의 게시글, 인터넷 기사, 특정 기사에 대한 댓글 등) 및 소셜 네트워크 서비스(SNS)를 통해 전달되는 게시글, 멘션 등을 포함할 수 있다. Referring to FIG. 1, the emotion-based extracting apparatus 100 receives an electronic document from at least one client device 110-1, 110-2, ..., 110-N using the Internet or a wired / wireless communication network . Here, the electronic document includes not only document files, but also various kinds of articles (for example, posts in a café or a blog, comments on a specific article, Internet articles, and the like) transmitted via the Internet, Mentions and the like.

감정 근거 추출 장치(100)는 클라이언트 디바이스(110-1, 110-2, ..., 110-N)로부터 문서를 수신하여 문서의 감정 근거를 추출한다. 이때, 유전 알고리즘을 이용하여 감정 근거 집합을 생성하고, 위 감정 근거 집합을 통해 문서 내의 감정 단어를 추출하여 그 극성을 파악한다. 감정 근거 추출 장치(100)는 훈련 데이터 셋과 테스트 데이터 셋을 분리하여 감정 근거 추출에 이용할 수 있다. 훈련 데이터 셋은 감정 근거 집합을 생성하기 위해 요구되는 문서와 해당 문서의 극성 정보를 포함할 수 있다. 반면, 테스트 데이터 셋은 훈련 데이터 셋을 통해 생성된 감정 근거 집합을 이용하여 문서의 감정 극성을 알아내기 위한 데이터를 의미할 수 있다. 즉, 테스트 데이터 셋의 감정 극성이 본 발명을 통해 최종 획득하고자 하는 정보일 수 있다. The emotion-based reason extracting apparatus 100 receives a document from the client devices 110-1, 110-2, ..., 110-N and extracts the emotion basis of the document. At this time, a genetic algorithm is used to generate an emotion base set, and emotion words in the document are extracted through the above emotion base set, and its polarity is grasped. The emotion-based reason extracting apparatus 100 can separate the training data set and the test data set and use them for emotion-based extraction. The training data set may include a document required to generate an emotional basis set and polarity information of the document. On the other hand, the test dataset may refer to data for determining the emotional polarity of the document using the empirical basis set generated through the training data set. That is, the emotional polarity of the test dataset may be information that is ultimately desired to be acquired through the present invention.

클라이언트 디바이스(110-1, 110-2, ..., 110-N)는 감정 근거 추출 장치(100)로 문서를 전송하는 개인 단말일 수 있다. 개인 단말은 퍼스널 컴퓨터(PC), 태블릿 컴퓨터, 모바일 단말, PDA(Personal Digital Assistant) 등, 유무선 통신을 통해 문서 정보를 감정 근거 추출 장치(100)와 송수신할 수 있는 각종 단말을 포함할 수 있다.
The client devices 110-1, 110-2, ..., 110-N may be personal terminals that transmit documents to the emotion-based extracting apparatus 100. [ The personal terminal may include various terminals capable of transmitting and receiving document information to and from the emotion-based reason extracting apparatus 100 through wired / wireless communication such as a personal computer (PC), a tablet computer, a mobile terminal, and a PDA (Personal Digital Assistant).

감정근거 추출 방법Emotion basis extraction method

도 2는 본 발명의 일 실시예에 따른 감정 근거 추출 방법을 개략적으로 설명하기 위한 흐름도이다.FIG. 2 is a flowchart for schematically explaining a method of extracting emotion evidence according to an embodiment of the present invention.

도 2를 참조하면, 감정 근거 추출 장치(미도시)는 수신된 문서(202)를 미리 저장된 감성 사전(204)과 비교하여, 초기 염색체 집합을 생성한다(S210). 초기 염색체 집합은 임의로 생성될 수 있다. 염색체의 각 유전자는 각각의 속성을 의미하고, 이는 훈련 데이터 셋에서 나타나는 각 단어들을 하나의 속성을 정의할 수 있다. 즉, 데이터 셋의 단어들을 염색체로 정의하여, 데이터 셋 내의 임의의 단어들을 우선 추출하여 초기 염색체 집합을 생성한다. Referring to FIG. 2, an empirical basis extracting apparatus (not shown) compares the received document 202 with a previously stored emotional dictionary 204 to generate an initial chromosome set (S210). An initial chromosome set can be generated arbitrarily. Each gene on the chromosome represents a property, which can define one attribute for each word appearing in the training data set. That is, words of the data set are defined as chromosomes, and arbitrary words in the data set are first extracted to generate an initial chromosome set.

그리고는, 초기 염색체 집합의 구성이 완료되면, 평가 함수를 통해 각 염색체, 즉, 각 단어들의 조합을 평가한다(S220). 평가함수의 목적은 문서의 극성을 분류하는 정확도를 최대화하기 위한 것으로, 문서의 극성을 분류하기 위해, 각 염색체가 선택한 속성집합, 즉, 단어들의 집합을 이용하여 문서의 긍정적 점수와 부정적 점수를 산출할 수 있다. 이렇게, 각각 별도로 구해진 긍정 및 부정적 점수 중 더 큰 점수를 갖는 극성을 문서의 극성으로 판단할 수 있다. 이렇게 분류된 결과에 대한 정확도를 평가 기준으로 이용할 수 있다. Then, when the construction of the initial chromosome set is completed, a combination of each chromosome, i.e., each word is evaluated through an evaluation function (S220). The purpose of the evaluation function is to maximize the accuracy of classifying the polarity of a document. To classify the polarity of a document, the positive and negative scores of the document are calculated using a set of attributes selected by each chromosome, that is, a set of words can do. Thus, it is possible to judge the polarity of the document having a larger score among the separately obtained positive and negative scores as the polarity of the document. The accuracy of the classification results can be used as an evaluation criterion.

그리고는, 평가함수를 거친 다음 각 염색체에 대한 평가 점수를 기반으로 유전 알고리즘을 수행하여 새로운 염색체(세대)를 생성할 수 있다(S230). 여기서, 유전 알고리즘은 선택 단계(S232), 교배 단계(S234) 및 변이 단계(S236)를 포함한다. 이는 추후 보다 상세히 설명한다. Then, a new chromosome (generation) can be generated by performing a genetic algorithm based on the score of each chromosome after the evaluation function (S230). Here, the genetic algorithm includes a selection step S232, a crossing step S234, and a transition step S236. This will be described in more detail later.

새로운 염색체가 생성되고 나면, 종료 조건과 비교하여(S240), 종료 조건을 만족하지 못하면, 다시 평가 점수를 산출하여 유전 알고리즘을 반복 수행하는 단계로 넘어가고, 종료 조건을 만족하면, 마지막으로 생성된 염색체 집합을 감성 근거 단어로써 추출한다(S250). 여기서, 종료 조건은 장치에서 설정한 정확도 값이 반복을 수행하여도 임계값 이상 증가하지 않거나, 또는 유전자 연산의 반복 횟수가 미리 설정한 횟수까지 진행되었는지의 조건을 포함할 수 있다. 여기서, 정확도의 정의는 이하 도 5를 통해 상세히 설명한다.When a new chromosome is generated, it is compared with the termination condition (S240). If the termination condition is not satisfied, the evaluation score is again calculated and the genetic algorithm is repeated. If the termination condition is satisfied, The chromosome set is extracted as an emotion based word (S250). Here, the termination condition may include a condition that the accuracy value set by the apparatus does not increase by more than the threshold value even if the repetition is performed, or whether the number of iterations of the gene operation has progressed to a preset number of times. Here, the definition of the accuracy will be described in detail with reference to FIG.

도 3은 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 유전 알고리즘을 설명하기 위한 도면이다.FIG. 3 is a diagram for explaining a genetic algorithm of the emotion-based extracting method according to an embodiment of the present invention.

도 3을 참조하면, 감정 근거 추출 장치(미도시)는 훈련 문서(302)에 기재된 단어 중 감성 사전(304)에 포함된 단어를 기반으로 생성된다. 여기서, 각각의 염색체(310)는 훈련 데이터 세트와 감성 사전에 각각 나타나는 단어를 대표할 수 있다. 또한, 각 단어는 감성 사전에 따라 긍정 또는 부정 점수를 가지고 있다. 이러한 집합 내의 각각의 염색체(310)는 후보 감정 근거 단어들로의 선택 여부를 나타내는 바이너리 스트링 문자를 이용하여 표현될 수 있다. 즉, 1은 선택된 단어를 의미하고, 0은 선택되지 않은 단어를 의미할 수 있다. 선택된 단어들 및 그것의 감성 점수는 분류자 학습 단계(예컨대, 감정 근거 추출 단계) 및 감성 분류 단계에서 사용될 수 있다.Referring to FIG. 3, an emotion basis extracting device (not shown) is generated based on words included in the emotion dictionary 304 among words described in the training document 302. Here, each of the chromosomes 310 can represent a training data set and a word appearing respectively in the emotion dictionary. In addition, each word has a positive or negative score depending on the emotion dictionary. Each chromosome 310 in this set may be represented using a binary string character indicating whether to select the candidate emotion based words. That is, 1 means a selected word, and 0 means a word which is not selected. The selected words and their emotional scores may be used in the classifier learning stage (e.g., emotion-based extracting stage) and emotional classifying stage.

도 4는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 초기 염색체 집합 생성 과정을 설명하기 위한 도면이다. 4 is a diagram for explaining an initial chromosome set generation process of the emotion-based extracting method according to an embodiment of the present invention.

도 4를 참조하면, 감성 사전(404)과 훈련 세트(402)를 통해 생성된 초기 염색체 집합에서, 먼저 후보 감정 근거들을 나열한다. 후보 감정 근거들은 훈련 세트에 포함된 단어들 중 감성 사전에 포함된 단어들일 수 있다. 즉, 이러한 단어는 도 4의 가운데 부분에 나타나는 표에 기재된 것과 같이, 단어마다 긍정 또는 부정 점수를 가지고 있다. 이를 기반으로 초기 염색체 집단 내의 각 염색체는 해당 단어들의 선택 여부를 달리하고, 서로 다른 선택 여부를 1 또는 0의 비트로 표현하여 스트링의 형태로 생성될 수 있다. Referring to FIG. 4, in the initial chromosome set generated through the emotion dictionary 404 and the training set 402, the candidate emotion bases are listed first. The candidate emotion bases may be words included in the emotion dictionary among the words included in the training set. That is, these words have positive or negative scores for each word, as shown in the table in the middle part of FIG. On the basis of this, each chromosome in the initial chromosome group can be generated in the form of a string by selecting whether or not the corresponding words are selected and expressing different choices of 1 or 0 bits.

다시 도 3으로 돌아가서, 이러한 초기 염색체 집합은 미리 설정된 개수의 염색체(310)들로 구성될 수 있다. 예컨대, n개의 스트링들의 각 염색체(310)는 표준 유전 알고리즘으로써 랜덤하게 생성된다. 그리고는, 최초 감성 분류가 수행되고, 적합한 평가함수를 통해 적합성 평가가 이루어진다. Referring back to FIG. 3, the initial chromosome set can be composed of a predetermined number of chromosomes 310. For example, each chromosome 310 of n strings is randomly generated with a standard genetic algorithm. Then, the initial sensitivity classification is performed, and a conformity assessment is made through an appropriate evaluation function.

유전 알고리즘(유전자 연산)에 있어서, 집합 내의 n 염색체 스트링들로부터, 소정 인접한 염색체 쌍들이 토너먼트 선택에 기반하여 교배될 수 있도록 선택된다. 유전 알고리즘에는 루울렛 휠 선택, 적합도 비례 선택 등 여러 방식이 있지만, 본 발명에서는 바람직하게, 토너먼트 선택이 사용될 수 있다. 토너먼트 선택은 집합 중에 결정된 수의 개체를 무작위로 선택하는 것으로, 그들 중 가장 적응도가 높은 개체를 선택하는 것이며, 이러한 조작을 필요한 회수만큼 반복하여 다음 세대에 남기게 된다. In a genetic algorithm (genetic algorithm), from the n chromosome strings in the set, certain adjacent chromosome pairs are selected so that they can be crossed based on the tournament selection. There are many ways to choose genetic algorithms such as loit wheel selection, fitness-proportional selection, etc. In the present invention, tournament selection can be used preferably. A tournament selection is a random selection of the determined number of entities in the set, selecting the most adaptable entity among them, and repeating this operation for the required number of times in the next generation.

이후, 선택된 염색체를 가지고, 교배가 수행된다. 교배는 재생산 알고리즘 형태로 흉내내는 것으로, 일점 교배, 단순 교배, 다점 교배, 복수점 교배, 균일 교배, 부분 일치 교배 등이 수행될 수 있다. 본 발명의 바람직한 실시예에 따르면, 일점 교배를 통해 염색체 집합을 교배할 수 있다. 즉, 장치는 선택된 한 쌍의 스트링들을 가지고 임의의 교배 포인트 x에서 서브스트링들을 교환함으로써 교배를 수행할 수 있다. 여기서, 교배 포인트 x는 랜덤하게 결정될 수 있다. Thereafter, with the selected chromosome, crossing is performed. Crossbreeding is mimicked in the form of a reproductive algorithm, and one-point crossing, simple crossing, multipoint crossing, multipoint crossing, uniform crossing, partial crossing, etc. can be performed. According to a preferred embodiment of the present invention, a chromosome set can be crossed through single point mating. That is, the device may perform mating by exchanging substrings at any mating point x with a selected pair of strings. Here, the mating point x can be determined at random.

그리고는, 변이가 수행되는데, 변이는 모의진화가 계속되는 동안 재생산과 교배 연산자가 집합을 더욱 진화시키고, 이로 인해 염색체들이 서로 닮아가게 되는데, 잘못하면 유전자의 다양성 결핍이 생기게 되므로, 원치 않는 해로부터 벗어나기 위한 메커니즘이라 할 수 있다. 즉, 염색체 내의 비트를 돌연변이 확률을 토대로 변경하여 초기 세대에서 모든 염색체의 특정 비트가 고정되는 것을 방지할 수 있다. 본 발명의 바람직한 실시예에 따르면, 교배 연산자는 교배 확률 상수 P_m에 기반하여 염색체 스트링 내의 개별 속성 특징들을 랜덤하게 변이시킨다. 이러한 과정을 거쳐 유전자 연산의 한 주기가 완료되고, 염색체는 재구성된다. Then, the mutation is performed, while the mutations continue to evolve as the mock evolution continues, and the reproduction and mating operators further evolve the cluster, which causes the chromosomes to resemble one another, which, if incorrect, results in a lack of gene diversity, Mechanism. That is, it is possible to prevent the specific bits of all the chromosomes from being fixed in the initial generation by changing the bits in the chromosomes based on the mutation probability. According to a preferred embodiment of the present invention, the mating operator randomly varies individual attribute characteristics in the chromosome string based on the mating probability constant P _m . Through this process, one cycle of gene operation is completed, and the chromosome is reconstructed.

도 5는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 특정 문서에 대한 감성 평가 및 감정 분류를 수행하는 것을 설명하기 위한 도면이다.FIG. 5 is a diagram for explaining emotional evaluation and emotion classification for a specific document in the emotion-based extracting method according to an embodiment of the present invention.

도 5를 참조하면, 감정 근거 추출 장치(미도시)는 평가 함수를 통해 해당 문서의 감성 점수를 평가하고, 이를 통해 극성을 파악할 수 있다. 전술한 바와 같이, 평가 함수는 정확도를 최대화시키기 위해 사용된다. 특히, 문서의 극성을 식별하기 위해, 긍정 및 부정 감정 스코어를 각각 구해야 한다. Referring to FIG. 5, the emotion basis extracting device (not shown) evaluates the emotion score of the document through the evaluation function, and thereby, the polarity can be grasped. As described above, the evaluation function is used to maximize the accuracy. In particular, in order to identify the polarity of the document, positive and negative feelings scores must be obtained, respectively.

도 5에 도시된 바와 같이, 긍정 스코어는 및 부정 스코어는 다음과 같이 구해진다.As shown in FIG. 5, the positive scores and the negative scores are obtained as follows.

D는 문서를 의미하고, 이는 단어들의 세트로 이루어진다. 여기서, P와 N은 긍정 및 부정 감정 단어의 세트를 의미할 수 있다. 예컨대, w∈P는 w가 긍정 단어라는 것을 의미한다. 또한, w∈D는 w가 특정 문서에 속하는 단어임을 의미할 수 있다. 이에 따라, 두 가지 점수, 문서 전체의 긍정 점수, Score_p(D)와 부정 점수, Score_n(D)를 각각 표현할 수 있다. 이는 특정 문서에 대한 각 감정 점수의 합산으로 산출될 수 있고, 더 높은 점수를 지시하는 라벨이 해당 문서의 라벨로 할당될 수 있다. 훈련 프로세스에서, 이러한 평가 함수에 대한 최적화 문제는 다음과 같이 정의될 수 있다. 즉, 첫 번째는 정확도의 최대화이고, 두 번째는 해당 문서의 긍정 및 부정 감정 점수 사이의 갭(gap)의 최대화일 수 있다. 이러한 갭은 G(D)로 표현될 수 있다.D stands for a document, which consists of a set of words. Where P and N may mean a set of positive and negative emotion words. For example, w? P means that w is a positive word. Further, w? D may mean that w is a word belonging to a specific document. Thus, two scores, positive scores of the whole document, Score _p (D), negative score, and Score _n (D) can be expressed respectively. This can be calculated as the summation of each emotion score for a particular document, and a label indicating the higher score can be assigned to the label of that document. In the training process, the optimization problem for this evaluation function can be defined as follows. That is, the first may be the maximization of the accuracy, and the second may be the maximum of the gap between the positive and negative feelings of the document. This gap can be expressed as G (D).

문서들의 긍정 및 부정 감정 점수 사이의 차이값을 최대화함으로써, 가장 영향력있는 감정 근거 단어를 찾을 수 있다. 또한, 특정 감정 근거 단어들이 제외된 채 감정 근거들이 추출되는 것을 방지하기 위해, 정확도 또는, 긍정 및 부정 훈련 세트의 정확도가 또한 평가 값으로 고려될 수 있다. By maximizing the difference between the positive and negative emotion scores of the documents, the most influential emotive evidence word can be found. In addition, the accuracy or accuracy of the positive and negative training sets can also be considered as an evaluation value, to prevent emotion bases from being extracted without excluding certain emotion base words.

평가 값은 정확도와 G(D)뿐만 아니라 두 값을 통한 다른 값으로 산출될 수 있다. 예컨대, G(D), 실제 긍정 문서들의 전체 총 수 대비 감성 근거를 통해 정확하게 긍정 문서로 분류되는 수를 통해 산출되는 긍정 정확도 및 실제 부정 문서들의 전체 총 수 대비 감성 근거를 통해 정확하게 부정 문서로 분류되는 수를 통해 산출되는 부정 정확도의, 세 값의 곱으로 표현되는 제 1 정확도가 존재할 수 있다. 또한, G(D)와 일반 정확도 값(일반 정확도 값은 전체 문서 수 대비 감성 근거를 통해 정확하게 분류되는 문서의 수를 의미할 수 있음)의 곱으로 표현되는 제 2 정확도, 및 단지 일반 정확도 값만을 포함하는 제 3 정확도가 존재할 수 있다. 이러한 제 1 내지 제 3 정확도 값이 최대가 되는 조건을 찾아 감정 근거를 추출할 수 있다. The evaluation value can be calculated by not only the accuracy and G (D) but also other values through the two values. For example, G (D) is classified as an affirmative document based on emotion based on the total number of positive and actual fraudulent documents calculated through the number of affirmative documents accurately classified based on emotion based on the total number of actual affirmative documents There may be a first accuracy that is expressed as a product of three values of negative precision, which is calculated through the number of errors. Also, the second accuracy, expressed as the product of G (D) and the general accuracy value (the general accuracy value may mean the number of documents accurately classified by emotion basis versus the total number of documents), and only the general accuracy value A third accuracy may be present. It is possible to find a condition for maximizing the first to third accuracy values and to extract the emotion basis.

또한, 감정 분류에 있어서, 다수의 유전자 연산을 거쳐, 평가 값을 통해, 가장 좋은 염색체를 추출하고, 추출된 염색체를 기반으로 하는 감정 근거를 통해 이하, 대상 문서(예컨대, 테스트 문서)에 대한 감성 분류를 수행할 수 있다. 이때, 가장 좋은 염색체를 사용하여 평가를 위한 일반 분류 정확도는 다음과 같이 산출될 수 있다. Further, in the emotion classification, the best chromosome is extracted through a plurality of gene arithmetic operations through the evaluation value, and the emotion based on the extracted chromosome is used for emotion classification (hereinafter referred to as " Classification can be performed. At this time, the general classification accuracy for evaluation using the best chromosome can be calculated as follows.

분류 정확도 = 정확하게 분류된 문서의 수/총 문서들의 수Classification Accuracy = number of correctly classified documents / total number of documents

추가적으로, 문서 내의 문장들의 일부만의 감성 분류를 수행하였을 때, 해당 문서는 긍정 및 부정 감성의 혼합을 포함할 수 있다. 따라서, 문서 내의 가장 좋은 주요 문장들을 식별한다면, 감성 분류의 성능을 개선시킬 수 있다. 또한, 가장 좋은 주요 문장을 식별하기 위해, 문장이 갖는 감정 단어들 별로 감성 점수를 산출하여, 가장 높은 감성 점수를 갖는 것을 가장 좋은 문장으로 간주할 수 있다. 따라서, 가장 주요한 문장들(주요 문장들의 수는 미리 설정된 값일 수 있음)만이 단지 감성 분류에 사용될 수 있다. 결과적으로, 결과는 문서 내의 소정 문장들만을 유지하면서, 감성 분류의 성능 레벨을 동일하게 유지할 수 있다. Additionally, when performing a sensitive classification of only a portion of the sentences in the document, the document may include a mix of positive and negative sentiments. Thus, if we identify the best key sentences in the document, we can improve the performance of emotion classification. Also, in order to identify the best main sentence, it is possible to calculate the emotion score for each emotion word of the sentence, and to have the highest emotion score as the best sentence. Thus, only the most important sentences (the number of key sentences can be a preset value) can only be used for emotional classification. As a result, the result can maintain the same performance level of emotional classification, while keeping only certain sentences in the document.

도 6은 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 훈련 세트와 테스트 세트의 서로 다른 흐름을 설명하기 위한 흐름도이다.6 is a flowchart for explaining different flows of a training set and a test set of the emotion basis extraction method according to an embodiment of the present invention.

도 6을 참조하면, 훈련 세트(602)는 데이터 전처리(S610) 과정을 통해 데이터 클리닝, 즉, 불용어 등 불필요한 단어를 제거하고, 세그멘테이션, 즉, 각 문서 내의 단어들을 각각 분리시킨다. Referring to FIG. 6, the training set 602 removes unnecessary words such as data cleaning, that is, idle words, through the data preprocessing (S610), and separates the words in each document by segmentation.

그리고는, 감성 사전(606)을 이용하여 데이터 대표화, 즉, 바이너리 스트링을 통해 초기 데이터 집합을 생성한다. Then, the emotion dictionary 606 is used to generate an initial data set through data representation, that is, a binary string.

초기 데이터 집합 생성 후, 유전 알고리즘과 감정 근거 추출 단계를 포함하는 훈련 프로세스(S630)를 통해 초기 데이터 집합의 각 개체에 평가 함수를 기반으로 점수를 부여하고, 진화 과정을 거쳐 더 나은 개체를 생성하며, 이러한 과정을 반복하여, 최종 감정 근거 집합을 생성한다. 이때, 선택된 감정 근거만을 이용했을 때, 가장 좋은 결과(예컨대, 정확도(제 1 내지 3 정확도 포함))가 나올 수 있도록 반복 과정을 거칠 때, 반복 횟수나 미리 지정된 종료 조건의 수렴 여부를 통해 감정 근거 집합을 확정할 수 있다. After the initial data set generation, scores are given to each object of the initial data set based on the evaluation function through a training process (S630) including a genetic algorithm and an emotion basis extraction step, and a better object is generated through the evolution process , This process is repeated to generate a final emotion bases set. At this time, when only the selected emotion basis is used, when the repetition process is performed so that the best result (e.g., accuracy (including the first to third accuracy)) can be obtained, The set can be confirmed.

테스트 세트(604)는 동일하게 데이터 전처리(S610) 및 데이터 대표화(S620) 과정을 거치게 되고, 테스트 프로세스(S640)에서, 분류자 및 감성 분류를 수행하되, 훈련 세트를 통해 생성된 감성 근거 집합을 이용하여 감성 분류를 수행할 수 있다. The test set 604 is subjected to data preprocessing (S610) and data representation (S620) in the same manner. In the test process (S640), the classifier and emotion classification are performed, Can be used to perform emotional classification.

이후, 평가 단계(S650)에서, 교차 검증을 수행할 수 있다. 이는 검증을 하기 위한 방법인데, 예를 들어서 데이터가 100개 있다고 하면, 5-fold cross validation일 경우, 80개는 훈련데이터로, 20개는 테스트데이터로 총 5번을 나누어서 테스트를 진행하고 평균을 이용하여　본 발명에 따른 감성 분류 방법의 성능을 검증할 수 있다.
Then, in the evaluation step S650, cross validation can be performed. For example, if there are 100 data, if 5-fold cross validation, 80 test data are divided into training data and 20 test data are divided into 5 test data. The performance of the sensitivity classification method according to the present invention can be verified.

감정근거 추출 장치Emotion basis extracting device

도 7은 본 발명의 일 실시예에 따른 감정 근거 추출 장치의 구성을 개략적으로 나타낸 블록도이다. 도 7에 도시된 바와 같이, 본 발명의 일 실시예에 따른 감정 근거 추출 장치(700)는 감정 근거 집합 생성부(710) 및 감정 분류 수행부(720)를 포함할 수 있다. FIG. 7 is a block diagram schematically showing a configuration of an emotion-based reason extracting apparatus according to an embodiment of the present invention. As shown in FIG. 7, the emotion-based reason extracting apparatus 700 according to an embodiment of the present invention may include an emotion-based set generating unit 710 and an emotion classification performing unit 720.

감정 근거 집합 생성부(710)는 전처리부(715), 평가지수 산출부(716), 유전자 연산부(717) 및 감정근거 추출부(718)를 포함할 수 있다. The emotion basis set generation unit 710 may include a preprocessing unit 715, an evaluation index calculation unit 716, a gene operation unit 717, and an emotion basis extraction unit 718. [

전처리부(715)는 입력부(711), 단어 추출부(712) 및 집합 생성부(713)를 포함할 수 있다. 입력부(711)는 훈련 세트 및 테스트 세트를 포함하는 대상 문서의 세트를 입력한다. 입력된 문서는 단어 추출부(712)를 통해 단어로 세분화되고, 여기서, 불용어 제거 등의 과정을 통해 순수 단어만이 남게 된다. 집합 생성부(713)는 훈련 문서에 기재된 단어 중 감성 사전에 포함된 단어를 기반으로 초기 염색체 집합을 생성한다. 테스트 문서의 경우도 염색체 집합을 생성할 수 있다. 각각의 염색체는 훈련 데이터 세트와 감성 사전에 각각 나타나는 단어를 대표할 수 있다. 또한, 각 단어는 감성 사전에 따라 긍정 또는 부정 점수를 가지고 있고, 집합 내의 각각의 염색체는 후보 감정 근거 단어들로의 선택 여부를 나타내는 바이너리 스트링 문자을 이용하여 표현될 수 있다. The preprocessing unit 715 may include an input unit 711, a word extraction unit 712, and an aggregation generation unit 713. An input unit 711 inputs a set of target documents including a training set and a test set. The inputted document is subdivided into words through the word extracting unit 712, where only the pure words are left through processes such as elimination of abolition. The set generation unit 713 generates an initial chromosome set based on the words included in the emotion dictionary among the words described in the training document. For test documentation, you can also create a chromosome set. Each chromosome can represent a set of training data and words that appear in the emotion dictionary, respectively. In addition, each word has positive or negative scores according to the emotion dictionary, and each chromosome in the set can be expressed using a binary string character indicating whether to select the candidate emotion based words.

평가 지수 산출부(716)는 각 염색체에 대해 감성 사전을 기반으로 평가 함수를 통해 적합성 평가 값을 산출한다. The evaluation index calculating unit 716 calculates a fitness evaluation value for each chromosome through an evaluation function based on the emotion dictionary.

유전자 연산부(717)는 생성된 염색체 스트링들로부터, 소정 인접한 염색체 쌍들이 토너먼트 선택에 기반하여 교배될 수 있도록 선택한다. 토너먼트 선택은 집합 중에 결정된 수의 개체를 무작위로 선택하는 것으로, 그들 중 가장 적응도가 높은 개체를 선택하는 것이며, 이러한 조작을 필요한 회수만큼 반복하여 다음 세대에 남기게 된다. 그리고는, 유전자 연산부(717)는 선택된 염색체를 가지고, 교배를 수행한다. 본 발명의 바람직한 실시예에 따르면, 일점 교배를 통해 염색체 집합을 교배할 수 있다. 유전자 연산부(717)는 선택된 한 쌍의 스트링들을 가지고 임의의 교배 포인트 x에서 서브스트링들을 교환함으로써 교배를 수행할 수 있다. 교배 포인트 x는 랜덤하게 결정될 수 있다. 유전자 연산부(717)는 모의진화가 계속되는 동안 재생산과 교배 연산자가 집합을 더욱 진화시키고, 이로 인해 염색체들이 서로 닮아가게 되는데, 잘못하면 유전자의 다양성 결핍이 생기게 되므로, 원치 않는 해로부터 벗어나기 위한 메커니즘인 변이를 수행할 수 있다. 유전자 연산부(717)는 염색체 내의 비트를 돌연변이 확률을 토대로 변경하여 초기 세대에서 모든 염색체의 특정 비트가 고정되는 것을 방지할 수 있다. 본 발명의 바람직한 실시예에 따르면, 유전자 연산부(717)의 교배 연산자는 교배 확률 상수 P_m에 기반하여 염색체 스트링 내의 개별 속성 특징들을 랜덤하게 변이시킨다. 이러한 과정을 거쳐 유전자 연산의 한 주기가 완료되고, 염색체는 재구성된다. The gene operation unit 717 selects, from the generated chromosome strings, predetermined adjacent chromosome pairs so that they can be mated based on the tournament selection. A tournament selection is a random selection of the determined number of entities in the set, selecting the most adaptable entity among them, and repeating this operation for the required number of times in the next generation. Then, the gene operation unit 717 carries out mating with the selected chromosome. According to a preferred embodiment of the present invention, a chromosome set can be crossed through single point mating. The gene operation unit 717 can perform mating by exchanging substrings at an arbitrary mating point x with a selected pair of strings. The mating point x can be determined at random. The gene operation unit 717 further evolves the set of reproduction and mating operators while the mock evolution continues, thereby causing the chromosomes to resemble each other. If wrong, the gene lacks diversity. Therefore, a mutation which is a mechanism for escaping unwanted harm Can be performed. The gene operation unit 717 can change the bits in the chromosomes based on the mutation probability and prevent specific bits of all the chromosomes from being fixed in the initial generation. According to a preferred embodiment of the present invention, the mating operator of the gene operation unit 717 randomly changes the individual attribute characteristics in the chromosome string based on the mating probability constant P _m . Through this process, one cycle of gene operation is completed, and the chromosome is reconstructed.

감정 근거 추출부(718)는 유전자 연산부(717)를 통해 새롭게 생성된 염색체 집합이 종료 조건을 만족하는지 여부를 판단하여, 만족하지 않으면, 다시 평가 지수 산출부(716)로 보내고, 만족하면 최종 감정 근거 집합으로 결정할 수 있다.The emotion basis extracting unit 718 determines whether or not the newly generated chromosome set satisfies the end condition through the gene calculating unit 717. If the chromosome set is not satisfied, the emotion basis extracting unit 718 sends the evaluation result to the evaluation index calculating unit 716 again. It can be determined as a set of evidence.

감정 분류 수행부(720)는 집합 생성부(715)를 통해 염색체 집합으로 생성된 테스트 문서 내의 단어에 대해 감정근거 추출부(718)에서 추출된 감정 근거를 기반으로 감정 분류를 수행한다.
The emotion classification performing unit 720 performs emotion classification on the basis of the emotion basis extracted by the emotion basis extracting unit 718 for the words in the test document generated as a chromosome set through the set generating unit 715. [

시뮬레이션 결과Simulation result

도 8a는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 성능 검증을 위한 데이터셋에 관련된 내용을 도시한 도면이고, 도 8b는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 성능 검증을 위한 파라미터에 관련된 내용을 도시한 도면이다.FIG. 8A is a diagram illustrating contents related to a data set for performance verification of the emotion-based reasoning extracting method according to an embodiment of the present invention, FIG. 8B is a flowchart illustrating a performance verification of the emotion-based reasoning extracting method according to an embodiment of the present invention And Fig.

도 8a 내지 8b를 참조하면, 멀티-도메인에 대한 리뷰 데이터를 기반으로 시뮬레이션을 수행하였고, 필터링 후, 랜덤하게 선택된 6000개의 리뷰 데이터를 기반으로 수행하였으며, 파라미터로써 염색체 집합 사이즈는 500으로, 교배 확률을 1.0으로, 변이 확률은 0.015로, 생성 반복 회수는 1000으로 설정하였다. 8A to 8B, simulation was performed based on multi-domain review data, and after filtering, 6000 review data selected randomly were used as a parameter. As a parameter, the chromosome set size was 500, 1.0, the variation probability was 0.015, and the number of repetitions of generation was set to 1,000.

도 9는 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 세가지 방식과 종래 방식과의 정확도 차이를 비교하기 위한 표를 나타낸 도면이다.9 is a table for comparing accuracy differences between the three methods of the emotion-based extracting method and the conventional method according to an embodiment of the present invention.

도 9를 참조하면, 정확도를 G(D)로 설명하였던, 부정 점수와 긍정 점수의 차이값과 긍정 정확도 및 부정 정확도를 근거로 산출한 제 1 방식, 상기 차이값과 일반 정확도를 근거로 정확도를 산출한 제 2 방식, 일반 정확도만을 이용하여 산출한 제 3 방식을 통해 긍정 문서, 부정 문서 및 이에 대한 평균을 구한 결과, 종래 방식인 baseline(일반 감성 사전으로 감성 분류)은 60%의 정확도를, 제 1 방식 내지 제 3 방식은 각각 82.9%, 81.4% 및 85.1%의 정확도를 나타내어, 종래 방식에 비해 현저하게 나은 성능을 나타냄을 확인할 수 있다.Referring to FIG. 9, in the first method, which is based on the difference between the negative score and the positive score and the positive and negative accuracy, which is described by the accuracy G (D), the accuracy is calculated based on the difference value and the general accuracy As a result of averaging the affirmative documents, the irregular documents and the average through the third method calculated using only the calculated second method and the general accuracy, the conventional method baseline (sensitivity classification as a general emotional dictionary) The accuracy of the first through third schemes is 82.9%, 81.4%, and 85.1%, respectively, which is significantly better than the conventional method.

도 10은 본 발명의 일 실시예에 따른 감정 근거 추출 방법의 세가지 방식과 종래 방식과의 정확도 차이를 비교하기 위한 그래프이다.10 is a graph for comparing accuracy differences between the three methods of the emotion-based extracting method and the conventional method according to an embodiment of the present invention.

도 10을 참조하면, 도 9의 표의 내용을 그래프로 나타낸 것인데, 본 발명에 따른 제 1 내지 3 방식의 정확도 결과값이 종래 baseline 방식보다 월등히 좋은 정확도를 나타냄을 확인할 수 있다.
Referring to FIG. 10, the contents of the table of FIG. 9 are shown in a graph. It can be seen that the accuracy results of the first to third methods according to the present invention are much better than those of the conventional baseline method.

이상 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the inventions as defined by the following claims It will be understood that various modifications and changes may be made thereto without departing from the spirit and scope of the invention.

Claims

A method for extracting an emotional basis of at least one document,
Arbitrarily generating an initial chromosome set based on words in the document, wherein each attribute of each gene of the chromosome refers to each word appearing in the training data set in the document, and wherein the property value indicates whether the word is selected;
Evaluating each chromosome through the function of maximizing the accuracy of classifying the polarity of the document;
Generating new chromosomes based on the evaluation score of each chromosome through gene calculation;
And generating an emotion basis set by determining whether the generated chromosome satisfies the termination condition.

The method according to claim 1, wherein the step of generating an emotion basis set by determining whether the generated chromosome satisfies an end condition
Determining whether the generated chromosome satisfies the termination condition; And
And if the end condition is satisfied, the step of adding to the emotion basis set and repeating the chromosome evaluation step and the new chromosome generating step when the end condition is not satisfied, And repeating the step of extracting the emotional basis of the document.

The method according to claim 1,
Wherein each word appearing in the training data set is a word that appears in both the target document and the emotion dictionary data and has a positive or negative score according to the emotion dictionary data.

The method according to claim 1,
Wherein each chromosome in the initial chromosome set is expressed as a binary string indicating whether or not candidate emotion based words are selected.

5. The method according to claim 4, wherein the step of generating a new chromosome by performing a gene operation based on an evaluation function of each chromosome comprises:
Selecting, for a plurality of chromosomal binary strings in the chromosome set, for predetermined mating based on predetermined adjacent string pairs based on tournament selection - taking score points into consideration when selecting a tournament;
Performing a single-point mating with the selected string pair by exchanging substrings at a predetermined mating point; And
And varying individual attribute values in a string in which mating is performed based on a variation probability constant.

The method according to claim 1,
Wherein the termination condition is determined by whether a change amount of an evaluation score of a chromosome generated through a gene operation has a convergence form.

The method according to claim 1,
The evaluation score is a difference value between the positive emotion index - the sum of the positive scores of the training data set - and the negative emotion index - the sum of the negative scores of the training data set - to identify the polarity of the document And extracting the emotional basis of the document.

2. The method according to claim 1,
A ratio of the number of correctly classified documents to the total number of documents using the generated emotion basis set;
A ratio of the number of correctly positive documents among the documents classified as affirmative using the generated emotion basis set; And
And a ratio of the number of correctly denied documents among the documents classified as negative by using the generated emotion basis set.

An apparatus for extracting an emotional basis of at least one document,
A chromosome set generator for arbitrarily generating an initial chromosome set based on words in the document, wherein attributes of respective genes of the chromosome means respective words appearing in a training data set in the document, Indicates -;
An evaluation unit for evaluating each chromosome through the initial chromosome set, an evaluation function-evaluation function for evaluating each chromosome through a function for maximizing the accuracy of classifying the polarity of the document;
A gene operation unit for generating a new chromosome through gene operation based on the score of each chromosome;
And an emotion basis set generation unit for generating an emotion basis set by determining whether the generated chromosome satisfies a termination condition.

A method for performing emotional analysis by extracting an emotional basis of at least one document,
Arbitrarily generating an initial chromosome set based on words in the document, wherein each attribute of each gene of the chromosome refers to each word appearing in the training data set in the document, and wherein the property value indicates whether the word is selected;
Evaluating each chromosome through the function of maximizing the accuracy of classifying the polarity of the document;
Generating new chromosomes based on the evaluation score of each chromosome through gene calculation;
Generating an emotion basis set by determining whether the generated chromosome satisfies an end condition; And
And performing emotional classification of a document to be tested using the generated emotion basis set.

[11] The method of claim 10,
And calculating the sensitivity analysis accuracy for evaluation using the generated emotion basis set through a ratio of the number of documents classified correctly to the number of documents.

[11] The method of claim 10,
Calculating a positive emotion index and a negative emotion index based on the emotion based set words included in the test object document and classifying the emotion having a higher value as emotion of the test target document .

An apparatus for performing emotional analysis by extracting an emotional basis of at least one document,
Wherein an attribute of each gene in the chromosome refers to each word appearing in a training data set in the document and an attribute value indicates whether a word is selected; In the initial chromosome set, the evaluation function - evaluation function evaluates each chromosome through a function that maximizes the accuracy of classifying the polarity of the document - and generates a new chromosome based on the evaluation score of each chromosome An emotion based set generation unit for generating an emotion based set by determining whether the generated chromosome satisfies an end condition; And
And an emotional classifier for performing emotional classification of a document to be tested using the generated emotion bases.