KR102594734B1

KR102594734B1 - Text analysis method using lda topic modeling technique and text analysis apparatus performing the same

Info

Publication number: KR102594734B1
Application number: KR1020210167701A
Authority: KR
Inventors: 이형종; 이웅성
Original assignee: 주식회사 렉스퍼
Priority date: 2021-06-24
Filing date: 2021-11-29
Publication date: 2023-10-26
Also published as: KR20230000397A

Abstract

본 발명은 장치에 구비된 LDA 모델을 통해 제1 상술한 과제를 해결하기 위한 본 발명은 텍스트 분석 방법에 있어서, 상기 장치에 구비된 LDA 모델을 통해 소스 데이터에 포함된 적어도 하나의 주제에 대응되는 고유 식별 정보 및 상기 주제의 발생 확률을 포함하는 언어 출력 값을 획득하는 단계;
상기 언어 출력 값 각각에 대응되는 상기 주제의 발생 횟수를 카운트하는 단계; 상기 언어 출력 값 각각에서 상기 발생 확률을 추출하는 단계; 상기 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 연산하는 단계; 및
상기 언어 출력 값을 기초로 결정되는 상기 소스 데이터의 의미량(richness), 명확도(clarity) 및 노이즈(noise) 중 적어도 하나로 상기 의미 인식율 및 상기 구조 인식율을 결정하는 단계;를 포함할 수 있다.The present invention is a text analysis method for solving the first above-described problem through an LDA model provided in the device, and provides a method for analyzing text that corresponds to at least one topic included in source data through an LDA model provided in the device. Obtaining a language output value including unique identification information and a probability of occurrence of the topic;
counting the number of occurrences of the topic corresponding to each of the language output values; extracting the probability of occurrence from each of the language output values; calculating a probability of occurrence of the language output value relative to the number of occurrences of the topic; and
It may include determining the semantic recognition rate and the structure recognition rate as at least one of richness, clarity, and noise of the source data determined based on the language output value.

Description

Text analysis method using LDA (Latent Dirichlet Allocation) topic modeling technique and text analysis device for performing the same {TEXT ANALYSIS METHOD USING LDA TOPIC MODELING TECHNIQUE AND TEXT ANALYSIS APPARATUS PERFORMING THE SAME}

본 발명은 텍스트 분석 방법에 관한 것으로 구체적으로 LDA(Latent Dirichlet Allocation) 토픽 모델링 기법을 이용한 텍스트 분석 방법 및 텍스트 분석 장치에 관한 것이다.The present invention relates to a text analysis method, and specifically to a text analysis method and text analysis device using the LDA (Latent Dirichlet Allocation) topic modeling technique.

자연어 처리(Natural Language Processing)는 요소 기술로 자연어 분석, 이해, 생성 등이 있으며, 정보 검색, 기계 번역, 질의응답 등 다양한 분야에 응용된다.Natural Language Processing is an element technology that includes natural language analysis, understanding, and generation, and is applied to various fields such as information retrieval, machine translation, and question and answering.

자연어 처리에는 자연어 분석, 자연어 이해, 자연어 생성 등의 기술이 사용된다. 자연어 분석은 그 정도에 따라 형태소 분석(morphological analysis), 통사 분석(syntactic analysis), 의미 분석(semantic analysis) 및 화용 분석(pragmatic analysis)의 4 가지로 나눌 수 있다. 자연어 이해는 컴퓨터가 자연어로 주어진 입력에 따라 동작하게 하는 기술이며, 자연어 생성은 동영상이나 표의 내용 등을 사람이 이해할 수 있는 자연어로 변환하는 기술이다.Natural language processing uses technologies such as natural language analysis, natural language understanding, and natural language generation. Natural language analysis can be divided into four types according to its degree: morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis. Natural language understanding is a technology that allows a computer to operate according to input given in natural language, and natural language generation is a technology that converts the contents of videos or tables into natural language that humans can understand.

최근에는 이러한 자연어 처리에 있어서 신경망 모델(Neural Network model)이 이용되고 있다. Recently, a neural network model has been used in natural language processing.

이러한 신경망 모델은 자연어 처리에 있어서 의미 분석에서 향상된 성능을 제공하고 있으나, 소스 데이터가 적으면 높은 정확도를 제공하지 못하고 일관성 없는 동작으로 구동되는 문제점이 있었다.Although these neural network models provide improved performance in semantic analysis in natural language processing, they have the problem of not providing high accuracy and operating with inconsistent behavior when the source data is small.

공개특허공보 제10-2019-0046631호Public Patent Publication No. 10-2019-0046631

본 발명이 해결하고자 하는 과제는 신경망 모델과 비 신경망 모델을 함께 사용하여 적은 소스 데이터를 사용하는 경우에도 높은 정확도와 일관성 있는 자연어 처리를 수행하는 텍스트 분석 방법 및 텍스트 분석 장치를 제공한다.The problem to be solved by the present invention is to provide a text analysis method and text analysis device that perform natural language processing with high accuracy and consistency even when using a small amount of source data by using a neural network model and a non-neural network model together.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned can be clearly understood by those skilled in the art from the description below.

상술한 과제를 해결하기 위한 본 발명은 텍스트 분석 방법에 있어서, 상기 장치에 구비된 LDA 모델을 통해 소스 데이터에 포함된 적어도 하나의 주제에 대응되는 고유 식별 정보 및 상기 주제의 발생 확률을 포함하는 언어 출력 값을 획득하는 단계;The present invention to solve the above-described problem is a text analysis method, a language including unique identification information corresponding to at least one topic included in source data and a probability of occurrence of the topic through an LDA model provided in the device. Obtaining an output value;

상기 언어 출력 값 각각에 대응되는 상기 주제의 발생 횟수를 카운트하는 단계; 상기 언어 출력 값 각각에서 상기 발생 확률을 추출하는 단계; 상기 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 연산하는 단계; 및 상기 주제 발생 횟수에 대한 상기 언어 출력 값을 기초로 상기 적어도 하나의 주제에 대응되는 의미 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계;를 포함하고,counting the number of occurrences of the topic corresponding to each of the language output values; extracting the probability of occurrence from each of the language output values; calculating a probability of occurrence of the language output value relative to the number of occurrences of the topic; and determining a semantic recognition rate corresponding to the at least one topic and a structure recognition rate of the source data based on the language output value for the number of occurrences of the topic.

이 때 의미 인식율 및 상기 구조 인식율을 결정하는 단계는,At this time, the step of determining the semantic recognition rate and the structural recognition rate is,

상기 언어 출력 값을 기초로 결정되는 상기 소스 데이터의 의미량(richness), 명확도(clarity) 및 노이즈(noise) 중 적어도 하나로 상기 의미 인식율 및 상기 구조 인식율을 결정하는 단계;를 포함할 수 있다.It may include determining the semantic recognition rate and the structure recognition rate as at least one of richness, clarity, and noise of the source data determined based on the language output value.

또한, 적어도 하나의 주제에 대응되는 의미 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계는,In addition, the step of determining the semantic recognition rate corresponding to at least one topic and the structural recognition rate of the source data includes:

상기 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 상기 주제 발생 횟수에 대하여 적분하여 상기 소스 데이터의 의미량(Richness)을 연산하고, 상기 소스 데이터의 의미량을 기초로 상기 상기 주제에 대응되는 의미 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계;를 포함할 수 있다.The probability of occurrence of the language output value with respect to the number of occurrences of the topic is integrated with the number of occurrences of the topic to calculate the richness of the source data, and the meaning corresponding to the topic based on the richness of the source data. It may include determining a recognition rate and a structure recognition rate of the source data.

또한 적어도 하나의 주제에 대응되는 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계는, 상기 소스 데이터의 의미량이 증가함에 따라 상기 의미 인식율 및 상기 구조 인식율은 감소하도록 결정될 수 있다.Additionally, in the step of determining the recognition rate corresponding to at least one topic and the structure recognition rate of the source data, the semantic recognition rate and the structure recognition rate may be determined to decrease as the amount of meaning of the source data increases.

또한 적어도 하나의 주제에 대응되는 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계는, 상기 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률의 변화량을 기초로 상기 소스 데이터의 의미 명확도(Clarity)를 연산하고, 상기 소스 데이터의 의미 명확도를 기초로 상기 의미 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계;를 포함할 수 있다.In addition, the step of determining the recognition rate corresponding to at least one topic and the structure recognition rate of the source data determines the semantic clarity of the source data based on the amount of change in the probability of occurrence of the language output value with respect to the number of occurrences of the topic. It may include calculating and determining the semantic recognition rate and the structure recognition rate of the source data based on the semantic clarity of the source data.

또한 의미 인식율 및 상기 구조 인식율을 결정하는 단계는,In addition, the step of determining the semantic recognition rate and the structural recognition rate is,

상기 의미 명확도가 증가함에 따라 상기 의미 인식율 및 상기 구조 인식율이 증가하도록 결정할 수 있다.As the semantic clarity increases, the semantic recognition rate and the structural recognition rate may be determined to increase.

또한 적어도 하나의 주제에 대응되는 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계는,Additionally, the step of determining a recognition rate corresponding to at least one subject and a structure recognition rate of the source data includes:

상기 주제 발생 횟수의 평균 값을 연산하고, 상기 주제 발생 횟수 평균 값 이하에 대응되는 상기 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률의 변화량을 기초로 상기 소스 데이터의 노이즈를 연산하고,Calculating an average value of the number of topic occurrences, calculating noise of the source data based on a change in the probability of occurrence of the language output value for the number of topic occurrences corresponding to less than or equal to the average number of topic occurrences,

상기 소스 데이터의 노이즈를 기초로 상기 의미 인식율 및 상기 소스 데이터의 구조 인식율을 결정하는 단계;를 포함할 수 있다.It may include determining the semantic recognition rate and the structure recognition rate of the source data based on noise of the source data.

또한 의미 인식율 및 상기 구조 인식율을 결정하는 단계는, 상기 소스 데이터의 노이즈가 증가함에 따라 상기 의미 인식율 및 상기 구조 인식율이 감소하도록 결정할 수 있다.Additionally, in the step of determining the semantic recognition rate and the structural recognition rate, the semantic recognition rate and the structural recognition rate may be determined to decrease as noise of the source data increases.

또한 언어 출력 값을 획득하는 단계는, 상기 적어도 하나의 주제를 상기 소스 데이터 상의 순서와 독립적으로 추출하여 상기 언어 출력 값을 획득할 수 있다.Additionally, in the step of acquiring the language output value, the language output value may be obtained by extracting the at least one topic independently of the order in the source data.

본 발명의 일 실시예에 따른 텍스트 분석 방법은 의미 인식율 및 상기 구조 인식율을 기초로 상기 소스 데이터의 자연어 인식을 수행하여 고급 의미론 특징을 추출(Advanced semantic features)하고 추출된 상기 고급 의미론 특징을 이용하여 비 신경망 모델의 학습을 수행하는 단계;를 더 포함할 수 있다.The text analysis method according to an embodiment of the present invention performs natural language recognition of the source data based on the semantic recognition rate and the structure recognition rate, extracts advanced semantic features, and uses the extracted advanced semantic features. It may further include performing learning of a non-neural network model.

본 발명의 일 실시예에 따른 텍스트 분석 장치는, 자연어 처리를 수행하는 장치에 있어서, 소스 데이터 및 LDA모델을 저장하도록 구성된 메모리; 및A text analysis device according to an embodiment of the present invention includes a device for performing natural language processing, a memory configured to store source data and an LDA model; and

상기 메모리와 통신을 하도록 구성된 적어도 하나의 프로세서;를 포함하고,At least one processor configured to communicate with the memory,

상기 적어도 하나의 프로세서는, 소스데이터를 이용하여 미리 형성된 LDA 모델에서 상기 소스 데이터에 포함된 적어도 하나의 주제에 대응되는 고유한 식별 정보와 발생 확률을 포함하는 언어 출력 값을 획득하고, 상기 언어 출력 값 각각에 대응되는 상기 주제의 발생 횟수를 카운트하고, 상기 언어 출력 값 각각에서 상기 발생 확률을 추출하고, 상기 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 연산하고, 상기 주제 발생 횟수에 대한 상기 언어 출력 값을 기초로 연산된 상기 소스 데이터의 의미량(richness), 명확도(clarity) 및 노이즈(noise) 중 적어도 하나로 상기 의미 인식율 및 상기 구조 인식율을 결정할 수 있다.The at least one processor obtains a language output value including unique identification information and an occurrence probability corresponding to at least one subject included in the source data from an LDA model previously formed using source data, and outputs the language. Count the number of occurrences of the topic corresponding to each value, extract the probability of occurrence from each of the language output values, calculate the probability of occurrence of the language output value for the number of occurrences of the topic, and calculate the probability of occurrence of the language output value for the number of occurrences of the topic. The semantic recognition rate and the structural recognition rate may be determined by at least one of the richness, clarity, and noise of the source data calculated based on the language output value.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명의 일 실시예 따른 텍스트 분석 방법 및 텍스트 분석 장치는, 신경망 모델과 비 신경망 모델을 함께 사용하여 적은 소스 데이터를 사용하는 경우에도 높은 정확도와 일관성 있는 자연어 처리를 수행할 수 있다.The text analysis method and text analysis device according to an embodiment of the present invention can perform natural language processing with high accuracy and consistency even when using a small amount of source data by using a neural network model and a non-neural network model together.

본 발명의 일 실시예에 따른 텍스트 분석 방법 및 텍스트 분석 장치는, 의미 분석(semantic analysis)에 있어서 통계화 된 방법 및 LDA(Latent Dirichlet allocation) 토픽 모델링 기법을 활용하여 텍스트의 인식 난이도를 도출하여 효율적인 자연어 처리를 수행할 수 있다. The text analysis method and text analysis device according to an embodiment of the present invention utilize a statistical method and an LDA (Latent Dirichlet allocation) topic modeling technique in semantic analysis to derive the recognition difficulty of text, thereby efficiently Natural language processing can be performed.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 일 실시예에 따른 텍스트 분석 장치의 블록도를 나타낸 도면이다.
도2 및 도3은 본 발명의 일 실시예에 따른 신경망 모델과 비 신경망 모델의 동작을 설명하기 위한 도면이다.
도4는 본 발명의 일 실시예에 따른 신경망 모델을 설명하기 위한 도면이다.
도 5a 및 도 5b은 본 발명의 일 실시예에 따른 고급 의미 분석을 설명하기 위한 도면이다.
도 6 내지 도 10는 본 발명의 일 실시예에 따른 순서도이다.1 is a block diagram of a text analysis device according to an embodiment of the present invention.
Figures 2 and 3 are diagrams for explaining the operation of a neural network model and a non-neural network model according to an embodiment of the present invention.
Figure 4 is a diagram for explaining a neural network model according to an embodiment of the present invention.
5A and 5B are diagrams for explaining advanced semantic analysis according to an embodiment of the present invention.
6 to 10 are flowcharts according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide a general understanding of the technical field to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for describing embodiments and is not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements. Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be a second component within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

본 명세서에서 소스 데이터는 자연어 처리에 있어서 이용되는 텍스트 데이터를 포함한 데이터를 의미할 수 있다.In this specification, source data may refer to data including text data used in natural language processing.

본 명세서에서 고급 의미론 특징은 소스 데이터의 의미 분석 수행을 통하여 추출된 특징을 의미할 수 있다.In this specification, advanced semantic features may refer to features extracted through performing semantic analysis of source data.

본 명세서에서 하이브리드 모델은 신경망 모델과 비 신경망 모델을 모두 이용하는 자연어 처리 모델을 의미할 수 있다.In this specification, a hybrid model may refer to a natural language processing model that uses both a neural network model and a non-neural network model.

본 명세서에서 언어 출력 값은 소스 데이터에서 추출된 주제에 대응되는 식별 정보와 발생 확률이 매칭된 출력 데이터를 의미할 수 있다.In this specification, language output value may refer to output data whose occurrence probability matches identification information corresponding to a topic extracted from source data.

본 명세서에서 의미 인식율은 텍스트 분석 장치가 소스 데이터에 포함된 텍스트의 의미를 인식하는 비율을 의미할 수 있다.In this specification, the meaning recognition rate may refer to the rate at which a text analysis device recognizes the meaning of text included in source data.

본 명세서에서 구조 인식율은 소스 데이터 상에서 텍스트가 이루는 문단의 구조 등을 의미할 수 있다.In this specification, the structure recognition rate may refer to the structure of paragraphs formed by text in source data.

본 명세서에서 변형 특징은 소스 데이터에 사용된 텍스트가 통상적으로 사용되는 언어로부터 변형되어 사용되는 언어의 형태의 특징을 의미할 수 있다.In this specification, the transformation feature may refer to the characteristics of the form of the language in which the text used in the source data is transformed from the language in which it is commonly used.

본 명세서에서 의미량(Sematic richness)은 소스 데이터에 포함된 텍스트 상에서 의미를 갖는 텍스트의 양에 대응되는 지표를 의미할 수 있다.In this specification, semantic richness may refer to an indicator corresponding to the amount of meaningful text in the text included in the source data.

본 명세서에서 의미 명확도(Semantic clarity)는 소스 데이터 상에 포함된 텍스트가 갖는 각 주제의 구별되는 정도를 의미할 수 있다.In this specification, semantic clarity may refer to the degree to which each subject of the text included in the source data is distinct.

본 명세서에서 소스 데이터의 노이즈(Noise)는 소스 데이터에 포함된 텍스트에서 불필요한 의미에 대응되는 텍스트에 대응되는 비율을 의미할 수 있다.In this specification, noise of source data may refer to the ratio of text corresponding to unnecessary meaning in text included in the source data.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따른 텍스트 분석 장치(10)의 제어블록도를 나타낸 도면이다.Figure 1 is a control block diagram of a text analysis device 10 according to an embodiment of the present invention.

도 1을 참고하면 본 발명의 일 실시예에 따른 텍스트 분석 장치는 메모리, 적어도 하나의 프로세서 및 통신부를 포함할 수 있다.Referring to FIG. 1, a text analysis device according to an embodiment of the present invention may include a memory, at least one processor, and a communication unit.

한편 텍스트 분석 장치는 여러 소스 데이터(D11, D12, D13)를 전달 받을 수 있다.Meanwhile, the text analysis device can receive multiple source data (D11, D12, D13).

제1 소스 데이터(D11)는 고급 의미가 포함된 데이터 베이스를 의미할 수 있다.The first source data D11 may mean a database containing advanced meaning.

또한 제2 소스 데이터(D12)는 통상적인 텍스트가 포함된 데이터 베이스를 의미할 수 있다.Additionally, the second source data D12 may mean a database containing general text.

또한 제3 소스 데이터(D13)는 통상적인 언어에서 변형된 형태의 데이터 베이스를 의미할 수 있다.Additionally, the third source data D13 may mean a database modified from a typical language.

텍스트 분석 장치는 제3 소스 데이터(D13)에 포함된 데이터를 이용하여 비 신경망 모델의 학습과 신경망 모델의 학습을 모두 수행할 수 있다.The text analysis device can perform both learning of a non-neural network model and learning of a neural network model using data included in the third source data D13.

통신부(130)는 데이터 베이스로부터 소스 데이터를 전달받을 수 있다.The communication unit 130 may receive source data from a database.

통신부(130)는 외부 장치와 통신을 가능하게 하는 하나 이상의 구성 요소를 포함할 수 있으며, 예를 들어 근거리 통신 모듈, 유선 통신 모듈 및 무선 통신 모듈 중 적어도 하나를 포함할 수 있다.The communication unit 130 may include one or more components that enable communication with an external device, and may include, for example, at least one of a short-range communication module, a wired communication module, and a wireless communication module.

메모리(110)는 소스 데이터(D11, D12, D13)를 비롯한 자연어 처리를 위한 각종 데이터를 저장할 수 있다.The memory 110 can store various data for natural language processing, including source data D11, D12, and D13.

메모리(110)는 캐쉬, ROM(Read Only Memory), PROM(Programmable ROM), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 및 플래쉬 메모리(Flash memory)와 같은 비휘발성 메모리 소자 또는 RAM(Random Access Memory)과 같은 휘발성 메모리 소자 또는 하드디스크 드라이브(HDD, Hard Disk Drive), CD-ROM과 같은 저장 매체 중 적어도 하나로 구현될 수 있으나 이에 한정되지는 않는다. The memory 110 includes non-volatile memory elements such as cache, read only memory (ROM), programmable ROM (PROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), and flash memory, or RAM ( It may be implemented as at least one of a volatile memory device such as Random Access Memory (Random Access Memory) or a storage medium such as a hard disk drive (HDD) or CD-ROM, but is not limited thereto.

적어도 하나의 프로세서(120)는 자연어 인식부(121), 신경망 모델 학습부(122) 및 비 신경망 모델 학습부(123)을 포함할 수 있다. At least one processor 120 may include a natural language recognition unit 121, a neural network model learning unit 122, and a non-neural network model learning unit 123.

또한 적어도 하나의 프로세서(120)는 상술한 모듈의 동작을 수행하기 위한 중앙 프로세서(124)를 포함할 수 있다.Additionally, at least one processor 120 may include a central processor 124 to perform the operations of the modules described above.

또한 적어도 하나의 프로세서(120)는 상술한 자연어 인식부(121), 신경망 모델 학습부(122) 및 비 신경망 모델 학습부(123)의 동작을 수행할 수 있는 중앙 프로세서(124)를 포함할 수 있다.In addition, at least one processor 120 may include a central processor 124 capable of performing the operations of the natural language recognition unit 121, the neural network model learning unit 122, and the non-neural network model learning unit 123 described above. there is.

적어도 하나의 프로세서(120)는 제1 소스 데이터(D11)를 미리 형성된 LDA모델 통한 상기 제1 소스 데이터(D11)의 자연어 인식을 수행하여 고급 의미론 특징을 추출(Advanced semantic features)할 수 있다.At least one processor 120 may perform natural language recognition of the first source data D11 through a pre-formed LDA model to extract advanced semantic features.

LDA모델은 잠재 디리클레 할당(Latent Dirichlet allocation, LDA)을 의미할 수 있다.The LDA model may refer to latent Dirichlet allocation (LDA).

고급 의미론 특징은 제1 소스 데이터(D11)에 대응되는 텍스트의 의미적 특징을 의미할 수 있다.The advanced semantic feature may refer to the semantic feature of the text corresponding to the first source data D11.

LDA는 주어진 문서에 대하여 각 문서에 어떤 주제들이 존재하는지를 서술하는 대한 확률적 토픽 모델 기법 중 하나를 의미할 수 있다. LDA에 관련된 자세한 설명은 아래에서 설명한다.LDA can refer to one of the probabilistic topic model techniques for describing what topics exist in each document for a given document. A detailed explanation related to LDA is provided below.

적어도 하나의 프로세서(120)는, 제2 소스 데이터(D12)를 이용하여 고급 의미론 특징을 포함한 자연어 특징을 추출할 수 있다.At least one processor 120 may extract natural language features including advanced semantic features using the second source data D12.

즉, 제2 소스 데이터(D12)는 통상적인 자연어 처리에 필요한 학습 데이터를 형성하는데 이용되는 데이터로서 제1 소스 데이터(D11)로 처리되는 고급 의미론 특징뿐만 아니라 담화 기반 특징, 구문 특징, 변형 특징 및 표면적 특징을 포함한 자연어 특징을 추출할 수 있다. 이와 관련된 상세한 설명은 후술한다.That is, the second source data (D12) is data used to form learning data required for typical natural language processing, and includes not only the advanced semantic features processed with the first source data (D11), but also discourse-based features, syntactic features, transformation features, and Natural language features, including surface features, can be extracted. A detailed explanation related to this will be provided later.

또한 적어도 하나의 프로세서(120)는 고급 의미론 특징 및 자연어 특징을 이용하여 자연어 인식을 위한 비 신경망 모델 학습을 수행할 수 있다.Additionally, at least one processor 120 may perform non-neural network model training for natural language recognition using advanced semantic features and natural language features.

반면 적어도 하나의 프로세서(120)는 제2 소스 데이터(D12)를 이용하여 자연어 인식을 위한 신경망 모델 학습을 수행할 수 있다.On the other hand, at least one processor 120 may perform neural network model learning for natural language recognition using the second source data D12.

다만 적어도 하나의 프로세서(120)는 변형 텍스트가 다수 포함된 제3 소스 데이터(D13)의 데이터 베이스를 이용하여 변형 특징(lxSem)이 포함된 자연어 특징을 추가적으로 추출하고, 이를 이용하여 비 신경망 모델의 학습을 수행할 수 있다.However, at least one processor 120 uses a database of third source data D13 containing a large number of modified texts to additionally extract natural language features including modified features (lxSem), and uses this to create a non-neural network model. Learning can be done.

적어도 하나의 프로세서(120)는 신경망 모델이 학습한 결과 데이터를 이용하여 상기 비 신경망 모델의 추가적인 학습을 수행할 수 있다.At least one processor 120 may perform additional learning of the non-neural network model using result data learned by the neural network model.

적어도 하나의 프로세서(120)는 이어서 비 신경망 모델과 신경망 모델을 모두 이용하여 자연어 인식을 위한 하이브리드 모델을 형성할 수 있다. 이와 관련된 상세한 설명은 도 2에서 서술하도록 한다.At least one processor 120 may then form a hybrid model for natural language recognition using both the non-neural network model and the neural network model. A detailed description related to this will be provided in FIG. 2.

한편 적어도 하나의 프로세서(120)는 소스데이터를 이용하여 미리 형성된 LDA 모델에서 소스 데이터에 포함된 적어도 하나의 주제에 대응되는 고유한 식별 정보와 발생 확률을 포함하는 언어 출력 값을 획득할 수 있다.Meanwhile, at least one processor 120 may obtain a language output value including unique identification information and an occurrence probability corresponding to at least one subject included in the source data from a pre-formed LDA model using the source data.

고유 식별 번호는 소스 데이터에 포함된 텍스트 중 의미를 갖는 단위에 대응되는 식별 정보를 의미할 수 있다.The unique identification number may refer to identification information corresponding to a meaningful unit among text included in source data.

또한 발생 확률은 각 의미를 갖는 단위에 텍스트의 발생 확률을 의미할 수 있다.Additionally, the probability of occurrence may refer to the probability of occurrence of text in each meaningful unit.

적어도 하나의 프로세서(120)는 언어 출력 값 각각에 대응되는 주제의 발생 횟수를 카운트할 수 있다.At least one processor 120 may count the number of occurrences of a topic corresponding to each language output value.

또한 적어도 하나의 프로세서(120)는 언어 출력 값 각각에서 발생 확률만을 추출하고 이어서 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 연산할 수 있다.Additionally, at least one processor 120 may extract only the probability of occurrence from each language output value and then calculate the probability of occurrence of the language output value based on the number of occurrences of the topic.

적어도 하나의 프로세서(120)는 주제 발생 횟수에 대한 언어 출력 값을 기초로 상기 적어도 하나의 주제에 대응되는 의미 인식율 및 소스 데이터의 구조 인식율을 결정할 수 있다. At least one processor 120 may determine a semantic recognition rate and a structure recognition rate of source data corresponding to the at least one topic based on the language output value for the number of topic occurrences.

텍스트 분석 장치(10)가 각각을 연산하는 상세한 동작에 대해서는 아래에서 서술하도록 한다.Detailed operations of each calculation by the text analysis device 10 will be described below.

도 1에 도시된 텍스트 분석 장치(10)의 구성 요소들의 성능에 대응하여 적어도 하나의 구성요소가 추가되거나 삭제될 수 있다. 또한, 구성 요소들의 상호 위치는 시스템의 성능 또는 구조에 대응하여 변경될 수 있다는 것은 당해 기술 분야에서 통상의 지식을 가진 자에게 용이하게 이해될 것이다.At least one component may be added or deleted in accordance with the performance of the components of the text analysis device 10 shown in FIG. 1. Additionally, it will be easily understood by those skilled in the art that the mutual positions of the components may be changed in response to the performance or structure of the system.

한편, 도 1에서 도시된 각각의 구성요소는 소프트웨어 및/또는 Field Programmable Gate Array(FPGA) 및 주문형 반도체(ASIC, Application Specific Integrated Circuit)와 같은 하드웨어 구성요소를 의미한다.Meanwhile, each component shown in FIG. 1 refers to software and/or hardware components such as Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC).

도2 및 도3은 본 발명의 일 실시예에 따른 신경망 모델과 비 신경망 모델의 동작을 설명하기위한 도면이다.Figures 2 and 3 are diagrams for explaining the operation of a neural network model and a non-neural network model according to an embodiment of the present invention.

도2를 참고하면 텍스트 분석 장치는 제1 소스 데이터(D21)와 제2 소스 데이터(D23) 그리고 제3 소스 데이터(D22)를 이용하여 신경망 학습 및 비 신경망 학습을 수행할 수 있다.Referring to Figure 2, the text analysis device can perform neural network learning and non-neural network learning using first source data (D21), second source data (D23), and third source data (D22).

구체적으로 텍스트 분석 장치는 제1 소스 데이터(D21)를 이용하여 LDA기법(M21)을 통하여 고급 의미론 특징(F21)을 추출할 수 있다. 텍스트 분석 장치는 위키 피디아와 같은 데이터 베이스 상에서 각 텍스트의 의미량(richness), 명확도(clarity), 노이즈(noise), 발견된 주제의 총 개수를 추출할 수 있다.Specifically, the text analysis device can extract advanced semantic features (F21) using the first source data (D21) through the LDA technique (M21). A text analysis device can extract the richness, clarity, noise, and total number of discovered topics of each text from a database such as Wikipedia.

또한 텍스트 분석 장치는 제2 소스 데이터(D23)에 포함된 자연어의 문장의 상위 구조를 포함하는 담화 기반 특징(Disco, F22)을 포함하는 자연어 특징을 추출할 수 있다.Additionally, the text analysis device may extract natural language features including discourse-based features (Disco, F22) that include the high-level structure of natural language sentences included in the second source data (D23).

담화 기반 특징(F22)은 상위 수준의 종속성 구조를 포함할 수 있다.Discourse-based features (F22) may contain higher-level dependency structures.

담화 기반 특징(F22)은 미시적 및 거시적 수준에서 텍스트 구조의 추세적 특징을 포함할 수 있다. 담화 기반 특징(F22)에는 엔티티 밀도 기능(EnDF)과 엔티티 그리드(EnGF)기능이 포함될 수 있다.Discourse-based features (F22) may include trending features of text structure at micro and macro levels. Discourse-based features (F22) may include entity density function (EnDF) and entity grid (EnGF) functions.

엔티티 밀도 기능은 개체 수에 따른 인식 정도에 관련이 있으며 엔티티 그리드 기능은 해당 구문의 일관성과 관련이 있다.The entity density feature is related to the degree of recognition based on the number of entities, and the entity grid feature is related to the consistency of the corresponding syntax.

적어도 하나의 프로세서는 제2 소스 데이터(D23)에 포함된 자연어를 구성하는 구 및 절의 특징 및 상기 구 및 절의 구조에 대응되는 구문 특징(Synta, F24)을 포함하는 자연어 특징을 추출할 수 있다.At least one processor may extract natural language features including features of phrases and clauses constituting the natural language included in the second source data (D23) and syntax features (Synta, F24) corresponding to the structures of the phrases and clauses.

구문 특징(F24)은 텍스트의 더 긴 처리 시간과 관련이 있다. The syntactic feature (F24) is associated with longer processing times of text.

이러한 구문 특징(F24)은 가독성의 중요한 지표인 텍스트의 전체적인 복잡성에 영향을 미칠 수 있다.These syntactic features (F24) can affect the overall complexity of the text, which is an important indicator of readability.

구문 특징(F24)은 명사, 동사, 부사구의 개수를 포함하여 여러 변형을 구현할 수 있다.The syntactic feature (F24) can implement several variants, including the number of nouns, verbs, and adverbial phrases.

또한 구문 특징(F24)은 평균 파싱 트리 높이에 대한 작업에서 파싱 된 트리의 구조적 형태를 포함할 수 있다.Additionally, syntactic features (F24) may include the structural form of the parsed tree, working on the average parsing tree height.

또한 적어도 하나의 프로세서는, 제2 소스 데이터(D23)에 포함된 자연어의 명사, 동사, 형용사 및 부사의 변형 사용에 대응되는 변형 특징(lxSem, F23)을 포함하는 자연어 특징을 추출할 수 있다.Additionally, at least one processor may extract natural language features including transformation features (lxSem, F23) corresponding to transformational use of nouns, verbs, adjectives, and adverbs of natural language included in the second source data (D23).

변형 특징(F23)은 소스 데이터에 포함된 어휘의 의미, 단어의 난이도 및 단어의 생소함과 관련된 속성을 포함할 수 있다.Modification features (F23) may include properties related to the meaning of the vocabulary included in the source data, the difficulty of the word, and the unfamiliarity of the word.

구체적으로 변형 특징(F23)은 각 단어의 비율을 나타내는 명사, 동사, 형용사, 부사 변형 정도를 포함할 수 있다.Specifically, the transformation feature (F23) may include the degree of transformation of nouns, verbs, adjectives, and adverbs that represent the ratio of each word.

한편 이러한 변형 특징(F23)은 미리 형성된 제3 소스 데이터(D22)를 이용하여 추출할 수도 있다. Meanwhile, this modified feature (F23) can also be extracted using pre-formed third source data (D22).

또한 적어도 하나의 프로세서는 제3 소스 데이터(D22)를 통하여 추출된 변형 특징을 이용하여 비 신경망 모델(M24)의 학습을 수행할 수 있다.Additionally, at least one processor may perform learning of the non-neural network model (M24) using deformed features extracted through the third source data (D22).

즉 프로세서는 일반적인 학습에 이용되는 데이터 베이스 이외에 다른 데이터 베이스를 이용하여 비 신경망 학습(M24)을 수행하고 이후 자연어 처리를 수행하는데 이용할 수 있다.In other words, the processor can perform non-neural network learning (M24) using a database other than the database used for general learning and then use it to perform natural language processing.

또한 프로세서는, 제2 소스 데이터(D23)에 포함된 자연어의 음절 특성에 대응되는 표면적 특징(ShaTr, F25)을 포함하는 자연어 특징을 추출할 수 있다.Additionally, the processor may extract natural language features including surface features (ShaTr, F25) corresponding to syllable features of the natural language included in the second source data (D23).

표면적 특징(F25)은 난이도에 대응되는 특징을 의미할 수 있다.The surface feature (F25) may refer to a feature corresponding to the level of difficulty.

표면적 특징(F25)에는 소스 데이터에 포함된 텍스트의 토큰(Token), 음절 및 문자의 평균 개수가 포함될 수 있다.Surface features (F25) may include the average number of tokens, syllables, and characters of text included in the source data.

한편 도 2와 도 3을 함께 참고하면, 적어도 하나의 프로세서는 제2 소스 데이터(D23)을 이용하여 신경망 학습(M22, M23)을 수행할 수 있고, 제1 소스 데이터(D21), 제2소스 데이터(D23) 및 제3 소스 데이터(D32)를 이용하여 비 신경망 학습(M24)을 수행할 수 있다.Meanwhile, referring to FIGS. 2 and 3 together, at least one processor may perform neural network learning (M22, M23) using the second source data (D23), and the first source data (D21) and the second source Non-neural network learning (M24) may be performed using the data (D23) and the third source data (D32).

한편 프로세서는 이러한 두 모델을 모두 이용하여 하이브리드 모델을 형성할 수 있다.Meanwhile, the processor can use both of these models to form a hybrid model.

도 3을 참고하면 프로세서는 제1 소스 데이터 및 상기 제2 소스 데이터에 포함된 텍스트 양에 기초하여 비 신경망 모델(M31) 및 신경망 모델(M32) 각각에 대응되는 제1 가중치(W1) 및 제2 가중치(W2)를 결정할 수 있다.Referring to FIG. 3, the processor calculates a first weight (W1) and a second weight (W1) corresponding to each of the non-neural network model (M31) and the neural network model (M32) based on the amount of text included in the first source data and the second source data. The weight (W2) can be determined.

일반적으로 신경망 모델(M31)은 적은 데이터 세트에서 낮은 성능이 발휘되고, 소스 데이터가 적은 경우 각 말뭉치 분류 방법에 따른 일관성이 결여될 수 있다.In general, neural network models (M31) perform poorly on small data sets, and when source data is small, there may be a lack of consistency depending on each corpus classification method.

따라서 소스 데이터 양이 작은 경우에는 비 신경망 모델(M31)에 대응되는 가중치(W1)를 높여 소스 데이터가 적은 경우에도 원활한 자연어 처리가 될 수 있도록 학습을 수행하고, 소스 데이터의 양이 많은 경우에는 신경망 모델(M21)에 대응되는 가중치를 높여 자연어 처리의 성능을 향상시킬 수 있다.Therefore, if the amount of source data is small, learning is performed by increasing the weight (W1) corresponding to the non-neural network model (M31) to enable smooth natural language processing even when the source data is small, and if the amount of source data is large, neural network The performance of natural language processing can be improved by increasing the weight corresponding to the model (M21).

프로세서는 제1 가중치(W1) 및 제2 가중치(W2)에 기초하여 비 신경망 모델(M31)이 출력한 결과 데이터와 신경망 모델(M32)이 출력한 결과 데이터를 반영하여 하이브리드 모델(M33)을 형성할 수 있다.The processor forms a hybrid model (M33) by reflecting the result data output by the non-neural network model (M31) and the result data output by the neural network model (M32) based on the first weight (W1) and the second weight (W2). can do.

이렇게 형성된 하이브리드 모델(M33)은 적은 소스 데이터 양에도 효과적이며 일관적인 자연어 처리를 수행할 수 있다.The hybrid model (M33) formed in this way can perform effective and consistent natural language processing even with a small amount of source data.

정리하면 프로세서는 제1 소스 데이터, 제2 소스 데이터 및 제3 소스 데이터를 이용하여 비 신경망 학습을 수행하여 비 신경망 학습모델(M31)을 형성할 수 있다.In summary, the processor can form a non-neural network learning model (M31) by performing non-neural network learning using first source data, second source data, and third source data.

또한 프로세서는 제2 소스 데이터를 이용하여 신경망 학습을 수행하여 신경망 모델(M32)을 형성할 수 있다. Additionally, the processor may form a neural network model (M32) by performing neural network learning using the second source data.

한편 프로세서는 학습에 필요한 소스 데이터가 많으면 신경망 모델(M32) 위주로 학습을 수행하고 학습에 필요한 소스 데이터의 양이 적으면 비 신경망 모델(M31)위주로 학습을 수행할 수 있고 이러한 비중은 각 모델에 대응되는 가중치(W1, W2)를 변경하여 이루어질 수 있다.On the other hand, if the amount of source data required for learning is large, the processor can perform learning primarily on the neural network model (M32). If the amount of source data required for learning is small, the processor can perform learning primarily on the non-neural network model (M31). This proportion corresponds to each model. This can be achieved by changing the weights (W1, W2).

한편 도 2 및 도 3에서 설명한 본 발명의 동작은 본 발명의 일 실시예에 불과하며 프로세서가 자연어 처리 능력 향상을 위하여 하이브리드 모델을 형성하는 실시예에는 그 제한이 없다.Meanwhile, the operation of the present invention described in FIGS. 2 and 3 is only an embodiment of the present invention, and there is no limitation to the embodiment in which the processor forms a hybrid model to improve natural language processing capabilities.

도 4는 본 발명의 일 실시예에 따른 RNN을 이용한 신경망 모델을 설명하기 위한 도면이다.Figure 4 is a diagram for explaining a neural network model using RNN according to an embodiment of the present invention.

일반적으로 자연어 처리를 하는데 있어서 순환 신경망(Recurrent neural network , RNN)이 이용될 수 있다.In general, a recurrent neural network (RNN) can be used in natural language processing.

순환 신경망(Recurrent neural network, RNN)은 인공 신경망의 한 종류로, 유닛 간의 연결이 순환적 구조를 갖는 특징을 갖고 있다. Recurrent neural network (RNN) is a type of artificial neural network, and is characterized by a circular structure in the connections between units.

이러한 구조는 시변적 동적 특징을 모델링 할 수 있도록 신경망 내부에 상태를 저장할 수 있게 해주므로, 순방향 신경망과 달리 내부의 메모리를 이용해 시퀀스 형태의 입력을 처리할 수 있다.This structure allows the state to be stored inside the neural network so that time-varying dynamic features can be modeled, so unlike a forward neural network, input in the form of a sequence can be processed using internal memory.

따라서 순환 인공 신경망은 자연어 인식과 같이 시변적 특징을 지니는 데이터를 처리하는데 적용될 수 있다. 이러한 특성 때문에 RNN은 프로세서가 자연어 처리를 수행하는데 이용될 수 있다.Therefore, recurrent artificial neural networks can be applied to process data with time-varying characteristics, such as natural language recognition. Because of these characteristics, RNNs can be used by processors to perform natural language processing.

즉 프로세서는 신경망 모델을 통하여 학습을 수행하는데 있어서 RNN을 이용한 신경망 모델을 이용할 수 있다.In other words, the processor can use a neural network model using RNN to perform learning through a neural network model.

구체적으로 프로세서가 이용하는 RNN은 단순히 토큰(T41, T42, T43, T4n)들의 벡터를 연결 짓지 않고 연속되는 시퀀스 정보를 가져와서 소스 데이터에 포함된 텍스트들의 시퀀스 정보를 학습할 수 있다. Specifically, the RNN used by the processor can learn the sequence information of texts included in the source data by obtaining continuous sequence information rather than simply connecting vectors of tokens (T41, T42, T43, T4n).

다만 RNN 개념은 이 직전 셀의 값만 입력으로 이용하기 때문에, 아무리 이 직전 셀의 값이 그 이전 셀의 계산결과를 포함한 것이라고 하더라도 점점 정보의 손실이 발생할 수 있다. However, because the RNN concept uses only the value of the previous cell as input, information loss may gradually occur even if the value of the previous cell includes the calculation result of the previous cell.

따라서 프로세서가 RNN을 이용하여 학습을 수행하는 경우에는 짧은 문장 구조에 대해서 높은 정확도를 가지나, 긴 문장 구조에 대해서는 정확도를 가지지 못한다.Therefore, when the processor performs learning using RNN, it has high accuracy for short sentence structures, but does not have accuracy for long sentence structures.

이를 극복하기 위해서 많은 양의 소스 데이터 학습이 필요하며, 적은 양의 소스 데이터 학습이 이루어진 경우에는 원활한 자연어 처리가 어렵다.To overcome this, a large amount of source data learning is required, and smooth natural language processing is difficult when a small amount of source data is learned.

따라서 이러한 단점을 극복하기 위하여 RNN을 이용하는 신경망 모델과 병렬적으로 학습을 수행하는 비 신경망 모델을 이용하여 하이브리드 모델을 형성할 수 있다.Therefore, to overcome these shortcomings, a hybrid model can be formed using a neural network model using RNN and a non-neural network model that performs learning in parallel.

프로세서는 상술한 바와 같이 RNN의 신경망 학습을 수행하는 모델과 병렬적으로 비 신경망 학습을 수행하고 이를 융합하여 최종적인 하이브리드 모델을 형성할 수 있다.As described above, the processor may perform non-neural network learning in parallel with a model performing RNN neural network learning and fuse them to form a final hybrid model.

한편 도 4에서 설명한 본 발명의 동작은 자연어 처리에 이용되는 RNN의 일 실시예에 불과하며 자연어 처리에는 RNN이외에도 LSTM(Long Short Term Memory) 및 GRU(Gated Recurrent Unit)가 이용될 수 있다.Meanwhile, the operation of the present invention described in FIG. 4 is only an example of an RNN used in natural language processing, and in addition to RNN, Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) can be used in natural language processing.

도 5a 및 도 5b는 본 발명의 일 실시예에 따른 고급 의미론 분석을 설명하기 위한 도면이다.5A and 5B are diagrams for explaining advanced semantic analysis according to an embodiment of the present invention.

도 5a을 참고하면, 프로세서는 소스데이터를 이용하여 미리 형성된 LDA 모델에서 소스 데이터에 포함된 적어도 하나의 주제를 추출할 수 있다.Referring to FIG. 5A, the processor may extract at least one topic included in the source data from a pre-formed LDA model using the source data.

LDA(Latent Dirichlet allocation)는 이산 자료들에 대한 확률적 생성 모형을 의미할 수 있다.LDA (Latent Dirichlet allocation) can refer to a probabilistic generation model for discrete data.

LDA는 텍스트 기반의 자료들에 대해 쓰일 수 있다.LDA can be used on text-based data.

LDA는 잠재 의미 분석(Latent semantic indexing, LSI), 확률 잠재 의미 분석(Probabilistic latent semantic analysis, pLSA)을 통하여 수행되는 토픽 모델링을 의미할 수 있다. LDA can refer to topic modeling performed through latent semantic indexing (LSI) and probabilistic latent semantic analysis (pLSA).

LDA에는 몇 가지 가정이 있는데 그 중 중요한 것은 단어의 교환성(exchangeability)이다. LDA makes several assumptions, an important one of which is the exchangeability of words.

교환성은 단어들의 순서는 상관하지 않고 단어들의 유무만이 중요하다는 가정을 의미할 수 있다.Commutativity can mean the assumption that the order of words does not matter, only the presence or absence of words matters.

따라서 LDA에서는 단어의 순서를 무시할 경우 문헌은 단순히 그 안에 포함되는 단어들의 빈도수만을 가지고 표현할 수 있다. Therefore, in LDA, if the order of words is ignored, the document can be expressed simply with the frequency of the words included in it.

이 가정을 기반으로 단어와 문서들의 교환성을 포함하는 혼합 모형을 LDA를 통하여 제시할 수 있다. 하지만 단순히 단어 하나를 단위로 생각하는 것이 아니라 특정 단어들의 묶음을 한 단위로 생각하는 방식(n-gram)으로 LDA의 교환성 가정을 확장시킬 수도 있다.Based on this assumption, a mixed model that includes the interchangeability of words and documents can be presented through LDA. However, the exchangeability assumption of LDA can be expanded by thinking of a bundle of specific words as a unit (n-gram) rather than simply thinking of a single word as a unit.

즉 텍스트 분석 장치의 프로세서는 적어도 하나의 주제를 소스 데이터 상의 순서와 독립적으로 추출하여 언어 출력 값을 획득할 수 있다.That is, the processor of the text analysis device can obtain a language output value by extracting at least one topic independently of the order in the source data.

텍스트 분석 장치는 각 주제에는 고유한 식별 번호와 발생 확률이 매칭될 수 있고 이를 기초로 언어 출력 값이 형성될 수 있다. The text analysis device can match a unique identification number and probability of occurrence to each topic and form a language output value based on this.

도 5a에서는 첫번째 주제에 식별번호 3과 0.45의 발생확률이 매칭되었고 두번째 주제에는 식별번호 7과 0.25의 발생확률이 매칭되었다.In Figure 5a, the first topic was matched with the identification number 3 and the occurrence probability of 0.45, and the second topic was matched with the identification number 7 and the occurrence probability of 0.25.

또한 세번째 주제에는 식별번호 42가 매칭되고 0.2의 발생 확률이 매칭되고 마지막 주제에는 식별 번호 45와 발생 확률 0.1이 매칭되었다(O51).In addition, the third topic was matched with identification number 42 and an occurrence probability of 0.2, and the last topic was matched with identification number 45 and an occurrence probability of 0.1 (O51).

또한 프로세서는 이러한 주제의 발생 확률도 도출하지만 실제 소스 데이터 상에서 얼마나 많은 주제가 등장했는지에 대응되는 주제 발생 횟수를 카운트할 수 있다.The processor also derives the probability of occurrence of these topics, but can also count the number of topic occurrences corresponding to how many topics appear in the actual source data.

프로세서는 각 주제에 식별 번호와 발생 확률이 매칭된 언어 출력 값을 출력할 수 있다.The processor can output a language output value whose identification number and probability of occurrence are matched to each topic.

이어서 프로세서는 언어 출력 값에서 각 주제의 발생 확률만 추출할 수 있다(S51).Subsequently, the processor can extract only the probability of occurrence of each topic from the language output value (S51).

프로세서는 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 연산하여 주제 발생 횟수와 주제의 발생 확률의 관계를 도출할 수 있다(R51).The processor may calculate the probability of occurrence of the language output value relative to the number of occurrences of the topic and derive the relationship between the number of occurrences of the topic and the probability of occurrence of the topic (R51).

한편 도 5b는 주제 발생 횟수와 주제의 발생 확률의 관계를 그래프로 나타내었다. Meanwhile, Figure 5b graphs the relationship between the number of topic occurrences and the probability of topic occurrence.

도 5b는 프로세서는 후술하는 바와 주제 발생 횟수에 대한 언어 출력 값을 기초로 적어도 하나의 주제에 대응되는 의미 인식율 및 소스 데이터의 구조 인식율을 결정할 수 있다.5B, the processor may determine a semantic recognition rate and a structure recognition rate of source data corresponding to at least one topic based on language output values for the number of topic occurrences, as will be described later.

상술한 바와 같이 의미 인식율은 텍스트 분석 장치가 텍스트 자체를 인식하는 정도를 의미하여 구조 인식율을 텍스트 분석 장치가 텍스트의 구문 자체를 인식하는 정도를 의미할 수 있다.As described above, the semantic recognition rate may refer to the degree to which a text analysis device recognizes the text itself, and the structural recognition rate may refer to the degree to which the text analysis device recognizes the syntax of the text itself.

우선 프로세서는 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 상기 주제 발생 횟수에 대하여 적분하여 상기 소스 데이터의 의미량(R62)을 연산할 수 있다.First, the processor may calculate the semantic quantity (R62) of the source data by integrating the probability of occurrence of the language output value with respect to the number of occurrences of the topic.

소스 데이터의 의미량(R62)은 하기의 수학식1을 기초로 연산될 수 있다.The meaningful quantity (R62) of the source data can be calculated based on Equation 1 below.

[수학식 1][Equation 1]

수학식1을 참고하면 S_R은 소스 데이터의 의미량(R52)을 의미할 수 있다. Referring to Equation 1, S _R may mean the meaningful amount (R52) of the source data.

의미량은 소스 데이터에 포함된 텍스트 상에서 의미를 갖는 텍스트의 양을 의미할 수 있다.The amount of meaning may refer to the amount of text that has meaning in the text included in the source data.

p_i는 각 주제에 대응되는 주제 발생 확률을 의미할 수 있으며, i는 주제 발생 횟수를 의미하는 것으로 도 5b에서는 그래프의 x좌표에 대응될 수 있다.p _i may mean the probability of topic occurrence corresponding to each topic, and i may mean the number of topic occurrences, which may correspond to the x-coordinate of the graph in FIG. 5B.

구체적으로 도 5b에서 제시된 그래프에서 밑 넓이가 소스 데이터의 의미량(R62)을 나타낼 수 있다.Specifically, in the graph presented in FIG. 5B, the base area may represent the meaningful amount (R62) of the source data.

이러한 의미량(R52)이 많은 것은 인식해야 할 텍스트의 수가 많은 것을 의미하므로 This large amount of meaning (R52) means that there is a large number of texts to be recognized.

소스 데이터의 의미량이 증가함에 따라 의미 인식율 및 상기 구조 인식율은 감소하도록 결정될 수 있다.As the semantic amount of source data increases, the semantic recognition rate and the structure recognition rate may be determined to decrease.

즉, 의미량이 많으면 해당 소스 데이터의 의미 인식율 및 구조 인식율을 낮게 결정될 수 있으며 해당 소스 데이터의 난이도는 높고 가독성은 낮은 것으로 결정될 수 있다.In other words, if the amount of meaning is large, the meaning recognition rate and structure recognition rate of the corresponding source data may be determined to be low, and the difficulty of the corresponding source data may be determined to be high and readability to be low.

또한 프로세서는 주제 발생 횟수에 대한 언어 출력 값 발생 확률의 변화량을 기초로 소스 데이터의 의미 명확도(C52)를 연산할 수 있다.Additionally, the processor can calculate the semantic clarity (C52) of the source data based on the change in the probability of occurrence of the language output value relative to the number of occurrences of the topic.

소스 데이터의 의미 명확도(C52)는 아래의 수학식을 기초로 결정될 수 있다.The semantic clarity (C52) of the source data can be determined based on the equation below.

[수학식 2][Equation 2]

수학식2를 참고하면, S_c는 의미 명확도를 의미하고 max(p)는 각 주제의 발생확률의 최대 값을 의미하여 pi는 각 주제 발생 횟수에 대응되는 주제 발생 확률을 의미할 수 있다.Referring to Equation 2, S _c means clarity of meaning, max(p) means the maximum value of the probability of occurrence of each topic, and pi may mean the probability of topic occurrence corresponding to the number of occurrences of each topic.

즉 프로세서는 발생 확률의 최대 값에 대한 각 주제의 발생 횟수에 대한 발생 확률의 변화량을 기초로 의미 명확도(C52)를 결정할 수 있다.That is, the processor can determine semantic clarity (C52) based on the change in probability of occurrence for the number of occurrences of each topic relative to the maximum value of probability of occurrence.

의미 명확도(C52)가 크면 각 텍스트의 의미가 명확히 구분되는 것을 의미하므로 의미 인식율이나 구조 인식율이 큰 것을 의미할 수 있다.A high semantic clarity (C52) means that the meaning of each text is clearly distinguished, so it can mean that the meaning recognition rate or structure recognition rate is high.

즉, 프로세서는, 의미 명확도가 증가함에 따라 의미 인식율 및 구조 인식율이 증가하는 양의 상관 관계가 있는 것으로 결정할 수 있다.That is, the processor may determine that there is a positive correlation in which the semantic recognition rate and the structural recognition rate increase as semantic clarity increases.

한편 프로세서는 소스 데이터의 주제 인식률 및 구조 인식률을 결정하는데 있어서 노이즈(N52)를 더 고려할 수 있다.Meanwhile, the processor may further consider noise (N52) in determining the subject recognition rate and structure recognition rate of the source data.

구체적으로 프로세서는 주제 발생 횟수의 평균 값을 연산하고, 주제 발생 횟수 평균 값 이하에 대응되는 주제 발생 횟수에 대한 언어 출력 값 발생 확률의 변화량을 기초로 소스 데이터의 노이즈(N52)를 연산할 수 있다.Specifically, the processor calculates the average value of the number of topic occurrences, and calculates the noise (N52) of the source data based on the change in the probability of occurrence of the language output value for the number of topic occurrences corresponding to the average number of topic occurrences. .

프로세서는 아래의 수학식3을 기초로 소스 데이터의 노이즈(N52)를 결정할 수 있다.The processor can determine the noise (N52) of the source data based on Equation 3 below.

[수학식 3][Equation 3]

수학식3을 참고하면 S_N은 소스 데이터의 노이즈(N52)를 의미하고 는 각 주제의 발생 확률의 평균을 의미할 수 있다.Referring to Equation 3, S _N refers to the noise (N52) of the source data. may mean the average of the probability of occurrence of each topic.

해당 노이즈가 크다는 것은 소스 데이터에서 의미를 인식하는데 어려운 것을 의미할 수 있으므로 프로세서는 소스 데이터의 노이즈(N52)가 증가함에 따라 의미 인식율 및 구조 인식율이 감소하는 것으로 결정할 수 있다.Since the noise may be large, it may mean that it is difficult to recognize meaning from the source data, so the processor may determine that the semantic recognition rate and structure recognition rate decrease as the noise (N52) of the source data increases.

프로세서는 상술한 동작을 통하여 소스 데이터의 의미량(C52), 명확도(C52) 및 노이즈(N52)를 연산하고 이를 기초로 해당 소스 데이터의 의미 인식율 또는 소스 데이터를 이루는 구문의 구조 인식율을 연산할 수 있다.Through the above-described operations, the processor calculates the amount of meaning (C52), clarity (C52), and noise (N52) of the source data, and based on this, calculates the semantic recognition rate of the source data or the structural recognition rate of the phrases that make up the source data. You can.

프로세서는 의미 인식율을 아래의 수학식을 이용하여 연산할 수 있다.The processor can calculate the semantic recognition rate using the equation below.

[수학식 4][Equation 4]

수학식4를 참고하면, R_m은 의미 인식율을 의미하고, S_c는 의미 명확도를 의미하고, S_R은 의미량을 의미하고 S_N은 노이즈를 의미할 수 있다. k1은 상수로 소스 데이터 종류, 양 등을 기초로 미리 결정될 수 있다.Referring to Equation 4, R _m refers to the semantic recognition rate, S _c refers to semantic clarity, S _R refers to the amount of meaning, and S _N may refer to noise. k1 is a constant and can be determined in advance based on the type and amount of source data.

프로세서는 구조 인식율을 아래의 수학식을 기초로 연산할 수 있다.The processor can calculate the structure recognition rate based on the equation below.

[수학식 5][Equation 5]

수학식5를 참고하면, R_S은 구조 인식율을 의미하고, S_c는 의미 명확도를 의미하고, S_R은 의미량을 의미하고 S_N은 노이즈를 의미할 수 있다. k2은 상수로 소스 데이터 종류, 양 등을 기초로 미리 결정될 수 있다.Referring to Equation 5, R _S refers to the structure recognition rate, S _c refers to semantic clarity, S _R refers to the amount of meaning, and S _N may refer to noise. k2 is a constant and can be determined in advance based on the type and amount of source data.

한편 LDA모델을 이용하여 출력한 언어 출력 값으로 상술한 의미량(C52), 명확도(C52) 및 노이즈(N52)를 연산하는 것은 본 발명의 일 실시예에 불과하며 이를 기초로 의미 인식율 및 구조 인식율을 결정하는 동작에는 그 제한이 없다.Meanwhile, calculating the above-mentioned semantic quantity (C52), clarity (C52), and noise (N52) with language output values output using the LDA model is only an embodiment of the present invention, and based on this, the semantic recognition rate and structure are There are no restrictions on the operation that determines the recognition rate.

도 6내지 도 10는 본 발명의 일 실시예에 따른 순서도이다.6 to 10 are flowcharts according to an embodiment of the present invention.

도 6을 참고하면, 텍스트 분석 장치는 소스 데이터를 획득할 수 있다(S601). Referring to FIG. 6, the text analysis device can acquire source data (S601).

이후 텍스트 분석 장치는 LDA를 통하여 고급 의미론 특징을 추출할 수 있다(S602).Afterwards, the text analysis device can extract advanced semantic features through LDA (S602).

또한 텍스트 분석 장치는 고급 의미론 특징 이외에 다양한 자연어 특징을 추출할 수 있다(S603).Additionally, the text analysis device can extract various natural language features in addition to advanced semantic features (S603).

한편 이렇게 도출된 데이터를 기초로 텍스트 분석 장치는 비신경망 모델의 학습을 수행할 수 있다(S604).Meanwhile, based on the data derived in this way, the text analysis device can perform learning of a non-neural network model (S604).

한편 소스 데이터를 획득한 텍스트 분석 장치는 신경망 모델의 학습을 수행할 수 있다(S605).Meanwhile, the text analysis device that has acquired the source data can perform learning of the neural network model (S605).

이후 텍스트 분석 장치는 비 학습된 비 신경망 모델과 신경망 모델을 이용하여 하이브리드 모델을 형성할 수 있다(S606).Afterwards, the text analysis device can form a hybrid model using the untrained non-neural network model and the neural network model (S606).

이어서 도 7을 참고하면, 텍스트 분석 장치는 소스 데이터를 획득할 수 있다(S701).Next, referring to FIG. 7, the text analysis device can acquire source data (S701).

텍스트 분석 장치는 소스 데이터에서 고급 의미론 특징을 추출할 수 있다(S702).The text analysis device can extract advanced semantic features from the source data (S702).

담화 기반 특징을 추출할 수 있다(S703). 또한 텍스트 분석 장치는 구문 특징을 추출할 수 있다(S704).Discourse-based features can be extracted (S703). Additionally, the text analysis device can extract syntactic features (S704).

또한 소스 데이터로부터 변형 특징을 추출할 수 있고, 표면적 특징을 추출할 수 있다(S705, S706). 이렇게 자연어 처리장치는 추출한 데이터의 특징을 기초로 비 신경망 모델의 학습을 수행할 수 있다(S707).Additionally, deformed features can be extracted from source data and surface features can be extracted (S705, S706). In this way, the natural language processing device can perform learning of a non-neural network model based on the characteristics of the extracted data (S707).

이어서 도 8은 상술한 신경망 모델과 비 신경망 모델 특징을 설명하기 위한 순서도이다. Next, Figure 8 is a flowchart for explaining the features of the neural network model and non-neural network model described above.

도 8을 참고하면 텍스트 분석 장치는 소스 데이터를 획득할 수 있다(S801). 이러한 소스 데이터는 신경망 모델을 통하여 학습을 수행할 수 있고 비 신경망 모델을 통하여 학습을 수행할 수 있다. 한편 신경망 모델을 통하여 학습을 수행하는 경우에는(S802), 많은 소스 데이터가 필요하며, 말 뭉치별 인식의 일관성이 적으며 인식의 설명이 어려운 경우가 많다. Referring to FIG. 8, the text analysis device can acquire source data (S801). Such source data can be used for learning through a neural network model or through a non-neural network model. On the other hand, when learning is performed through a neural network model (S802), a large amount of source data is required, the consistency of recognition for each corpus is low, and explanation of recognition is often difficult.

따라서 텍스트 분석 장치는 비 신경망 모델 학습과 신경망 모델 학습을 병렬적으로 수행할 수 있다(S803).Therefore, the text analysis device can perform non-neural network model learning and neural network model learning in parallel (S803).

이러한 동작을 통하여 신경망 모델과 비 신경망 모델의 조화를 이루어 하이브리드 모델을 형성할 수 있다(S804).Through this operation, a hybrid model can be formed by harmonizing the neural network model and the non-neural network model (S804).

도 9는 비 신경망 모델에 있어 고급 의미론 특징을 결정고급 의미론의 인식율을 결정하는 동작을 나타낸 순서도이다.Figure 9 is a flowchart showing the operation of determining the recognition rate of high-level semantics by determining high-level semantic features in a non-neural network model.

도 9 를 참고하면 텍스트 분석 장치는 소스 데이터를 획득할 수 있다(S901).Referring to FIG. 9, the text analysis device can acquire source data (S901).

이후 자연어 처리장치는 주제, 식별 정보 및 발생 확률을 포함하는 언어 출력 값을 획득할 수 있다(S902).Afterwards, the natural language processing device can obtain a language output value including the topic, identification information, and probability of occurrence (S902).

한편 텍스트 분석 장치는 주제 발생횟수를 카운트 할 수 있다(S903). \Meanwhile, the text analysis device can count the number of occurrences of the topic (S903). \

또한 각 주제의 발생 횟수에 대응되는 발생 확률을 추출할 수 있다(S904). 이후 각 주제에 대응되는 의미량(Richness), 명확도(Clarity) 및 노이즈(Noise)를 결정할 수 있다(S905).Additionally, the probability of occurrence corresponding to the number of occurrences of each topic can be extracted (S904). Afterwards, the richness, clarity, and noise corresponding to each topic can be determined (S905).

최종적으로 텍스트 분석 장치는 해당 소스 데이터의 의미 인식율 및 구조 인식율을 결정할 수 있다(S906). Finally, the text analysis device can determine the meaning recognition rate and structure recognition rate of the corresponding source data (S906).

도 10을 참고하면, 텍스트 분석 장치는 소스 데이터(S1001)를 이용하여 고급 의미적 특징을 결정하기 위하여 의미량(S1002), 명확도(S1003) 및 노이즈(S1004)를 각각 연산할 수 있다. 텍스트 분석 장치는 소스 데이터의 주제 인식률 및 구조 인식률을 결정할 수 있다(S1005). 텍스트 분석 장치는 비 신경망 학습을 수행할 수 있다(S1006). Referring to FIG. 10, the text analysis device can calculate semantic amount (S1002), clarity (S1003), and noise (S1004) to determine advanced semantic features using source data (S1001). The text analysis device may determine the subject recognition rate and structure recognition rate of the source data (S1005). The text analysis device may perform non-neural network learning (S1006).

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of the method or algorithm described in connection with embodiments of the present invention may be implemented directly in hardware, implemented as a software module executed by hardware, or a combination thereof. The software module may be RAM (Random Access Memory), ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), Flash Memory, hard disk, removable disk, CD-ROM, or It may reside on any type of computer-readable recording medium well known in the art to which the present invention pertains.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.Above, embodiments of the present invention have been described with reference to the attached drawings, but those skilled in the art will understand that the present invention can be implemented in other specific forms without changing its technical idea or essential features. You will be able to understand it. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

10 : 텍스트 분석 장치
110 : 메모리
120 : 프로세서
130 : 통신부10: Text analysis device
110: memory
120: processor
130: Department of Communications

Claims

In a text analysis method performed by a device,
Obtaining a language output value including unique identification information corresponding to at least one topic included in source data and a probability of occurrence of the topic through an LDA model provided in the device;
counting the number of occurrences of the topic corresponding to each of the language output values;
extracting the probability of occurrence from each of the language output values;
calculating a probability of occurrence of the language output value relative to the number of occurrences of the topic; and
A step of determining a semantic recognition rate corresponding to the at least one topic and a structure recognition rate of the source data based on the language output value for the number of occurrences of the topic,
The step of determining the semantic recognition rate and the structural recognition rate is,
Obtaining a language output value including unique identification information corresponding to at least one topic included in the source data and a probability of occurrence of the topic;
Determine the amount of meaning, semantic clarity, and noise based on the probability of occurrence of the language output value relative to the number of occurrences of the topic,
A step of determining the semantic recognition rate corresponding to the source data based on the semantic amount and the noise,
A text analysis method comprising: determining the structure recognition rate of the source data based on the semantic clarity and the noise.

According to claim 1,
The step of determining a semantic recognition rate corresponding to the at least one topic and a structural recognition rate of the source data includes:
Calculate the richness of the source data by integrating the probability of occurrence of the language output value with respect to the number of occurrences of the topic with respect to the number of occurrences of the topic,
A text analysis method including; determining a semantic recognition rate corresponding to the topic and a structure recognition rate of the source data based on the amount of meaning of the source data.

According to paragraph 2,
The step of determining a recognition rate corresponding to the at least one topic and a structure recognition rate of the source data includes:
A text analysis method wherein the semantic recognition rate and the structural recognition rate are determined to decrease as the semantic amount of the source data increases.

According to claim 1,
The step of determining a recognition rate corresponding to the at least one topic and a structure recognition rate of the source data includes:
Calculate semantic clarity of the source data based on the change in probability of occurrence of the language output value with respect to the number of occurrences of the topic,
A text analysis method including; determining the semantic recognition rate and the structure recognition rate of the source data based on the semantic clarity of the source data.

According to paragraph 4,
The step of determining the semantic recognition rate and the structural recognition rate is,
A text analysis method for determining that the semantic recognition rate and the structural recognition rate increase as the semantic clarity increases.

According to claim 1,
The step of determining a recognition rate corresponding to the at least one topic and a structure recognition rate of the source data includes:
Calculate the average value of the number of occurrences of the topic,
Calculate noise of the source data based on the amount of change in the probability of occurrence of the language output value for the number of topic occurrences corresponding to less than the average value of the number of topic occurrences,
A text analysis method comprising: determining the semantic recognition rate and the structure recognition rate of the source data based on noise of the source data.

According to clause 6,
The step of determining the semantic recognition rate and the structural recognition rate is,
A text analysis method for determining that the semantic recognition rate and the structural recognition rate decrease as noise of the source data increases.

According to paragraph 1,
The step of obtaining the language output value is,
Extracting the at least one topic independently from the order in the source data to obtain the language output value.

In a device that performs natural language processing,
A memory configured to store source data and LDA model; and
At least one processor configured to communicate with the memory,
The at least one processor,
Obtaining a language output value including unique identification information and an occurrence probability corresponding to at least one topic included in the source data from an LDA model formed in advance using source data,
Count the number of occurrences of the topic corresponding to each of the language output values,
Extracting the probability of occurrence from each of the language output values,
Calculate the probability of occurrence of the language output value for the number of occurrences of the topic,
Obtaining a language output value including unique identification information corresponding to at least one topic included in the source data and a probability of occurrence of the topic,
Determine the amount of meaning, semantic clarity, and noise based on the probability of occurrence of the language output value relative to the number of occurrences of the topic,
Determining a semantic recognition rate corresponding to the source data based on the semantic amount and the noise,
A text analysis device that determines a structure recognition rate of the source data based on the semantic clarity and the noise.