KR102384694B1

KR102384694B1 - Natural language processing method and natural language processing device using neural network model and non-neural network model

Info

Publication number: KR102384694B1
Application number: KR1020210082665A
Authority: KR
Inventors: 이형종; 이웅성
Original assignee: 주식회사 렉스퍼
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2022-04-08
Anticipated expiration: 2041-06-24

Abstract

본 발명은 자연어 처리 방법에 있어서, 상기 장치에 구비된 LDA 모델을 통해 제1 소스 데이터의 자연어 인식을 수행하여 고급 의미론 특징을 추출(Advanced semantic features)하는 단계; 제2 소스 데이터를 이용하여 상기 고급 의미론 특징을 포함한 자연어 특징을 추출하는 단계; 상기 고급 의미론 특징 및 상기 자연어 특징을 이용하여 자연어 인식을 위한 비 신경망 모델의 학습을 수행하는 단계; 상기 제2 소스 데이터를 이용하여 상기 자연어 인식을 위한 신경망 모델의 학습을 수행하는 단계; 상기 신경망 모델이 학습한 결과 데이터를 이용하여 상기 비 신경망 모델의 추가적인 학습을 수행하는 단계; 및 상기 비 신경망 모델과 상기 신경망 모델을 모두 이용하여 상기 자연어 인식을 위한 하이브리드 모델을 형성하는 단계;를 포함할 수 있다.The present invention provides a natural language processing method, comprising: extracting advanced semantic features by performing natural language recognition of first source data through an LDA model provided in the apparatus; extracting natural language features including the advanced semantic features by using second source data; performing learning of a non-neural network model for natural language recognition using the advanced semantic feature and the natural language feature; performing learning of a neural network model for natural language recognition using the second source data; performing additional learning of the non-neural network model using result data learned by the neural network model; and forming a hybrid model for natural language recognition using both the non-neural network model and the neural network model.

Description

NATURAL LANGUAGE PROCESSING METHOD AND NATURAL LANGUAGE PROCESSING DEVICE USING NEURAL NETWORK MODEL AND NON-NEURAL NETWORK MODEL

본 발명은 자연어 처리 방법에 관한 것으로 구체적으로 신경망 모델과 비 신경망 모델을 이용한 자연어 처리 방법 및 자연어 처리 장치에 관한 것이다.The present invention relates to a natural language processing method, and more particularly, to a natural language processing method and a natural language processing apparatus using a neural network model and a non-neural network model.

자연어 처리(Natural Language Processing)는 요소 기술로 자연어 분석, 이해, 생성 등이 있으며, 정보 검색, 기계 번역, 질의응답 등 다양한 분야에 응용된다.Natural language processing (NLP) is an element technology that includes natural language analysis, understanding, and generation, and is applied to various fields such as information retrieval, machine translation, and Q&A.

자연어 처리에는 자연어 분석, 자연어 이해, 자연어 생성 등의 기술이 사용된다. 자연어 분석은 그 정도에 따라 형태소 분석(morphological analysis), 통사 분석(syntactic analysis), 의미 분석(semantic analysis) 및 화용 분석(pragmatic analysis)의 4 가지로 나눌 수 있다. 자연어 이해는 컴퓨터가 자연어로 주어진 입력에 따라 동작하게 하는 기술이며, 자연어 생성은 동영상이나 표의 내용 등을 사람이 이해할 수 있는 자연어로 변환하는 기술이다.Natural language processing uses techniques such as natural language analysis, natural language understanding, and natural language generation. Natural language analysis can be divided into four types: morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis according to the degree. Natural language understanding is a technology that makes a computer operate according to input given in natural language, and natural language generation is a technology that converts the contents of videos or tables into natural language that humans can understand.

최근에는 이러한 자연어 처리에 있어서 신경망 모델(Neural Network model)이 이용되고 있다. Recently, a neural network model has been used in such natural language processing.

이러한 신경망 모델은 자연어 처리에 있어서 의미 분석에서 향상된 성능을 제공하고 있으나, 소스 데이터가 적으면 높은 정확도를 제공하지 못하고 일관성 없는 동작으로 구동되는 문제점이 있었다.Although such a neural network model provides improved performance in semantic analysis in natural language processing, there is a problem in that it cannot provide high accuracy when there is little source data and is driven by inconsistent behavior.

공개특허공보 제10-2019-0046631호Unexamined Patent Publication No. 10-2019-0046631

본 발명이 해결하고자 하는 과제는 신경망 모델과 비 신경망 모델을 함께 사용하여 적은 소스 데이터를 사용하는 경우에도 높은 정확도와 일관성 있는 자연어 처리를 수행하는 자연어 처리 방법 및 자연어 처리 장치를 제공한다.The problem to be solved by the present invention is to provide a natural language processing method and a natural language processing apparatus that perform natural language processing with high accuracy and consistency even when using a small amount of source data by using a neural network model and a non-neural network model together.

본 발명이 해결하고자 하는 과제들은 이상에서 언급된 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 과제를 해결하기 위한 본 발명은 자연어 처리 방법에 있어서,The present invention for solving the above problems in a natural language processing method,

상기 장치에 구비된 LDA 모델을 통해 제1 소스 데이터의 자연어 인식을 수행하여 고급 의미론 특징을 추출(Advanced semantic features)하는 단계;extracting advanced semantic features by performing natural language recognition of first source data through the LDA model provided in the device;

제2 소스 데이터를 이용하여 상기 고급 의미론 특징을 포함한 자연어 특징을 추출하는 단계; 상기 고급 의미론 특징 및 상기 자연어 특징을 이용하여 자연어 인식을 위한 비 신경망 모델의 학습을 수행하는 단계; 상기 제2 소스 데이터를 이용하여 상기 자연어 인식을 위한 신경망 모델의 학습을 수행하는 단계; 상기 신경망 모델이 학습한 결과 데이터를 이용하여 상기 비 신경망 모델의 추가적인 학습을 수행하는 단계; 및 상기 비 신경망 모델과 상기 신경망 모델을 모두 이용하여 상기 자연어 인식을 위한 하이브리드 모델을 형성하는 단계;를 포함할 수 있다.extracting natural language features including the advanced semantic features by using second source data; performing learning of a non-neural network model for natural language recognition using the advanced semantic feature and the natural language feature; performing learning of a neural network model for natural language recognition using the second source data; performing additional learning of the non-neural network model using result data learned by the neural network model; and forming a hybrid model for natural language recognition using both the non-neural network model and the neural network model.

또한 고급 의미론 특징 추출 단계는, 상기 제1 소스 데이터에 포함된 적어도 하나의 주제에 대응되는 고유 식별 정보 및 상기 주제의 발생 확률을 포함하는 언어 출력 값을 획득하는 단계; 상기 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 기초로 상기 제1 소스 데이터에 대응되는 의미 인식율 및 상기 제1 소스 데이터의 구조 인식율을 결정하는 단계;를 포함할 수 있다.In addition, the step of extracting advanced semantic features may include: acquiring a language output value including unique identification information corresponding to at least one topic included in the first source data and an occurrence probability of the topic; and determining a meaning recognition rate corresponding to the first source data and a structure recognition rate of the first source data based on the occurrence probability of the language output value with respect to the number of occurrences of the topic.

또한 추출된 자연어 특징은, 상기 제2 소스 데이터에 포함된 자연어의 문장의 상위 구조를 포함하는 담화 기반 특징(Disco)을 포함할 수 있다.In addition, the extracted natural language feature may include a discourse-based feature (Disco) including a higher structure of a sentence of a natural language included in the second source data.

또한 추출된 자연어 특징은, 상기 제2 소스 데이터에 포함된 자연어를 구성하는 구 및 절의 특징 및 상기 구 및 절의 구조에 대응되는 구문 특징(Synta)을 포함할 수 있다.In addition, the extracted natural language features may include features of phrases and clauses constituting the natural language included in the second source data and syntax features (Synta) corresponding to the structures of the phrases and clauses.

또한 추출된 자연어 특징은, 상기 제2 소스 데이터에 포함된 자연어의 명사, 동사, 형용사 및 부사의 변형 사용에 대응되는 변형 특징(lxSem)을 포함할 수 있다.In addition, the extracted natural language feature may include a variant feature lxSem corresponding to the use of variants of nouns, verbs, adjectives, and adverbs of the natural language included in the second source data.

본 발명의 일 실시예에 따른 자연어 처리 방법은, 제3 소스 데이터에 포함된 자연어의 명사, 동사, 형용사 및 부사의 변형 사용에 대응되는 변형 특징(lxSem)을 포함하는 상기 자연어 특징을 추출하는 단계; 및 상기 자연어 특징을 기초로 상기 비 신경망 모델의 학습을 수행하는 단계;를 더 포함할 수 있다.The natural language processing method according to an embodiment of the present invention comprises the steps of extracting the natural language feature including a transforming feature (lxSem) corresponding to the transforming use of a noun, a verb, an adjective, and an adverb of a natural language included in third source data ; and performing learning of the non-neural network model based on the natural language feature.

또한 추출된 자연어 특징은, 상기 제2 소스 데이터에 포함된 자연어의 음절 특성에 대응되는 표면적 특징(ShaTr)을 포함할 수 있다.In addition, the extracted natural language feature may include a surface feature (ShaTr) corresponding to the syllable feature of the natural language included in the second source data.

또한 하이브리드 모델 형성 단계는, RNN을 이용하는 상기 신경망 모델과 병렬적으로 학습을 수행하는 상기 비 신경망 모델을 이용하여 상기 하이브리드 모델을 형성할 수 있다.In addition, in the hybrid model forming step, the hybrid model may be formed using the non-neural network model that performs learning in parallel with the neural network model using RNN.

또한 하이브리드 모델 형성 단계는, 상기 제1 소스 데이터 및 상기 제2 소스 데이터에 포함된 텍스트양에 기초하여 상기 비 신경망 모델 및 상기 신경망 모델 각각에 대응되는 제1 가중치 및 제2 가중치를 결정하고, 상기 제1 가중치 및 상기 제2 가중치에 기초하여 상기 비 신경망 모델이 출력한 결과 데이터와 상기 신경망 모델이 출력한 결과 데이터를 반영하여 상기 하이브리드 모델을 형성할 수 있다.In addition, in the hybrid model forming step, a first weight and a second weight corresponding to each of the non-neural network model and the neural network model are determined based on the amount of text included in the first source data and the second source data, and the The hybrid model may be formed by reflecting the result data output by the non-neural network model and the result data output by the neural network model based on the first weight and the second weight.

본 발명의 일 실시예에 따른 자연어 처리를 수행하는 장치에 있어서, 제1소스 데이터, 제2소스 데이터 및 LDA모델을 저장하는 메모리; 상기 메모리와 통신을 수행하는 적어도 하나의 프로세서;를 포함하고,An apparatus for performing natural language processing according to an embodiment of the present invention, comprising: a memory for storing first source data, second source data, and an LDA model; Including; at least one processor to communicate with the memory;

상기 적어도 하나의 프로세서는, 상기 LDA 모델을 통해 상기 제1 소스 데이터의 자연어 인식을 수행하여 고급 의미론 특징을 추출(Advanced semantic features)하고, 제2 소스 데이터를 이용하여 상기 고급 의미론 특징을 포함한 자연어 특징을 추출하고, 상기 고급 의미론 특징 및 상기 자연어 특징을 이용하여 자연어 인식을 위한 비 신경망 모델의 학습을 수행하고, 상기 제2 소스 데이터를 이용하여 상기 자연어 인식을 위한 신경망 모델의 학습을 수행하고, 상기 신경망 모델이 학습한 결과 데이터를 이용하여 상기 비 신경망 모델의 추가적인 학습을 수행하고,The at least one processor performs natural language recognition of the first source data through the LDA model to extract advanced semantic features, and uses the second source data to perform natural language features including the advanced semantic features. extracting , learning of a non-neural network model for natural language recognition using the advanced semantic feature and the natural language feature, learning of a neural network model for natural language recognition using the second source data, and Further learning of the non-neural network model is performed using the result data learned by the neural network model,

상기 비 신경망 모델과 상기 신경망 모델을 모두 이용하여 상기 자연어 인식을 위한 하이브리드 모델을 형성할 수 있다.A hybrid model for natural language recognition may be formed using both the non-neural network model and the neural network model.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명의 일 실시예에 따른 자연어 처리 방법 및 자연어 처리 장치는, 신경망 모델과 비 신경망 모델을 함께 사용하여 적은 소스 데이터를 사용하는 경우에도 높은 정확도와 일관성 있는 자연어 처리를 수행할 수 있다.The natural language processing method and natural language processing apparatus according to an embodiment of the present invention can perform natural language processing with high accuracy and consistency even when using a small amount of source data by using a neural network model and a non-neural network model together.

본 발명의 일 실시예에 따른 자연어 처리 방법 및 자연어 처리 장치는, 의미 분석(semantic analysis)에 있어서 통계화 된 방법 및 LDA(Latent Dirichlet allocation) 토픽 모델링 기법을 활용하여 텍스트의 인식 난이도를 도출하여 효율적인 자연어 처리를 수행할 수 있다. The natural language processing method and natural language processing apparatus according to an embodiment of the present invention utilizes a statistical method and a latent dirichlet allocation (LDA) topic modeling technique in semantic analysis to derive text recognition difficulty and thus efficiently It can perform natural language processing.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 자연어 처리 장치의 블록도를 나타낸 도면이다.
도2 및 도3은 본 발명의 일 실시예에 따른 신경망 모델과 비 신경망 모델의 동작을 설명하기 위한 도면이다.
도4는 본 발명의 일 실시예에 따른 신경망 모델을 설명하기 위한 도면이다.
도 5a 및 도 5b은 본 발명의 일 실시예에 따른 고급 의미 분석을 설명하기 위한 도면이다.
도 6 내지 도 10는 본 발명의 일 실시예에 따른 순서도이다.1 is a diagram illustrating a block diagram of a natural language processing apparatus according to an embodiment of the present invention.
2 and 3 are diagrams for explaining the operation of a neural network model and a non-neural network model according to an embodiment of the present invention.
4 is a diagram for explaining a neural network model according to an embodiment of the present invention.
5A and 5B are diagrams for explaining advanced semantic analysis according to an embodiment of the present invention.
6 to 10 are flowcharts according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

본 명세서에서 소스 데이터는 자연어 처리에 있어서 이용되는 텍스트 데이터를 포함한 데이터를 의미할 수 있다.In the present specification, source data may refer to data including text data used in natural language processing.

본 명세서에서 고급 의미론 특징은 소스 데이터의 의미 분석 수행을 통하여 추출된 특징을 의미할 수 있다.In the present specification, the advanced semantic feature may mean a feature extracted through semantic analysis of source data.

본 명세서에서 하이브리드 모델은 신경망 모델과 비 신경망 모델을 모두 이용하는 자연어 처리 모델을 의미할 수 있다.In the present specification, the hybrid model may refer to a natural language processing model using both a neural network model and a non-neural network model.

본 명세서에서 언어 출력 값은 소스 데이터에서 추출된 주제에 대응되는 식별 정보와 발생 확률이 매칭된 출력 데이터를 의미할 수 있다.In the present specification, the language output value may refer to output data in which identification information corresponding to a subject extracted from source data and an occurrence probability are matched.

본 명세서에서 의미 인식율은 자연어 처리 장치가 소스 데이터에 포함된 텍스트의 의미를 인식하는 비율을 의미할 수 있다.In the present specification, the meaning recognition rate may refer to a rate at which the natural language processing apparatus recognizes the meaning of text included in source data.

본 명세서에서 구조 인식율은 소스 데이터 상에서 텍스트가 이루는 문단의 구조 등을 의미할 수 있다.In the present specification, the structure recognition rate may refer to the structure of a paragraph formed by text on the source data.

본 명세서에서 변형 특징은 소스 데이터에 사용된 텍스트가 통상적으로 사용되는 언어로부터 변형되어 사용되는 언어의 형태의 특징을 의미할 수 있다.In the present specification, the transformation characteristic may mean a characteristic of the form of a language in which text used for source data is transformed from a commonly used language.

본 명세서에서 의미량(Sematic richness)은 소스 데이터에 포함된 텍스트 상에서 의미를 갖는 텍스트의 양에 대응되는 지표를 의미할 수 있다.In the present specification, sematic richness may mean an index corresponding to the amount of text having a meaning on text included in source data.

본 명세서에서 의미 명확도(Semantic clarity)는 소스 데이터 상에 포함된 텍스트가 갖는 각 주제의 구별되는 정도를 의미할 수 있다.In the present specification, semantic clarity may refer to a degree to which each subject of text included in source data is distinguished.

본 명세서에서 소스 데이터의 노이즈(Noise)는 소스 데이터에 포함된 텍스트에서 불필요한 의미에 대응되는 텍스트에 대응되는 비율을 의미할 수 있다.In the present specification, noise of source data may mean a ratio corresponding to text corresponding to unnecessary meaning in text included in source data.

이하, 첨부된 도면을 참조하여 본 발명의 실시예를 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 자연어 처리 장치(10)의 제어블록도를 나타낸 도면이다.1 is a diagram illustrating a control block diagram of a natural language processing apparatus 10 according to an embodiment of the present invention.

도 1을 참고하면 본 발명의 일 실시예에 따른 자연어 처리 장치는 메모리, 적어도 하나의 프로세서 및 통신부를 포함할 수 있다.Referring to FIG. 1 , a natural language processing apparatus according to an embodiment of the present invention may include a memory, at least one processor, and a communication unit.

한편 자연어 처리 장치는 여러 소스 데이터(D11, D12, D13)를 전달 받을 수 있다.Meanwhile, the natural language processing device may receive several source data D11, D12, and D13.

제1 소스 데이터(D11)는 고급 의미가 포함된 데이터 베이스를 의미할 수 있다.The first source data D11 may mean a database including advanced meaning.

또한 제2 소스 데이터(D12)는 통상적인 텍스트가 포함된 데이터 베이스를 의미할 수 있다.Also, the second source data D12 may mean a database including normal text.

또한 제3 소스 데이터(D13)는 통상적인 언어에서 변형된 형태의 데이터 베이스를 의미할 수 있다.Also, the third source data D13 may refer to a database in a form transformed from a common language.

자연어 처리 장치는 제3 소스 데이터(D13)에 포함된 데이터를 이용하여 비 신경망 모델의 학습과 신경망 모델의 학습을 모두 수행할 수 있다.The natural language processing apparatus may perform both the learning of the non-neural network model and the learning of the neural network model by using the data included in the third source data D13.

통신부(130)는 데이터 베이스로부터 소스 데이터를 전달받을 수 있다.The communication unit 130 may receive source data from the database.

통신부(130)는 외부 장치와 통신을 가능하게 하는 하나 이상의 구성 요소를 포함할 수 있으며, 예를 들어 근거리 통신 모듈, 유선 통신 모듈 및 무선 통신 모듈 중 적어도 하나를 포함할 수 있다.The communication unit 130 may include one or more components that enable communication with an external device, and may include, for example, at least one of a short-range communication module, a wired communication module, and a wireless communication module.

메모리(110)는 소스 데이터(D11, D12, D13)를 비롯한 자연어 처리를 위한 각종 데이터를 저장할 수 있다.The memory 110 may store various data for natural language processing including the source data D11, D12, and D13.

메모리(110)는 캐쉬, ROM(Read Only Memory), PROM(Programmable ROM), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 및 플래쉬 메모리(Flash memory)와 같은 비휘발성 메모리 소자 또는 RAM(Random Access Memory)과 같은 휘발성 메모리 소자 또는 하드디스크 드라이브(HDD, Hard Disk Drive), CD-ROM과 같은 저장 매체 중 적어도 하나로 구현될 수 있으나 이에 한정되지는 않는다. The memory 110 is a non-volatile memory device or RAM (such as a cache, read only memory (ROM), programmable ROM (PROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) and flash memory (Flash memory). It may be implemented as at least one of a volatile memory device such as a random access memory, a hard disk drive (HDD), or a storage medium such as a CD-ROM, but is not limited thereto.

적어도 하나의 프로세서(120)는 자연어 인식부(121), 신경망 모델 학습부(122) 및 비 신경망 모델 학습부(123)을 포함할 수 있다. The at least one processor 120 may include a natural language recognition unit 121 , a neural network model learning unit 122 , and a non-neural network model learning unit 123 .

또한 적어도 하나의 프로세서(120)는 상술한 모듈의 동작을 수행하기 위한 중앙 프로세서(124)를 포함할 수 있다.In addition, the at least one processor 120 may include a central processor 124 for performing the above-described module operations.

또한 적어도 하나의 프로세서(120)는 상술한 자연어 인식부(121), 신경망 모델 학습부(122) 및 비 신경망 모델 학습부(123)의 동작을 수행할 수 있는 중앙 프로세서(124)를 포함할 수 있다.In addition, the at least one processor 120 may include a central processor 124 capable of performing the operations of the natural language recognition unit 121, the neural network model learning unit 122, and the non-neural network model learning unit 123 described above. there is.

적어도 하나의 프로세서(120)는 제1 소스 데이터(D11)를 미리 형성된 LDA모델 통한 상기 제1 소스 데이터(D11)의 자연어 인식을 수행하여 고급 의미론 특징을 추출(Advanced semantic features)할 수 있다.The at least one processor 120 may extract advanced semantic features by performing natural language recognition of the first source data D11 through the preformed LDA model of the first source data D11.

LDA모델은 잠재 디리클레 할당(Latent Dirichlet allocation, LDA)을 의미할 수 있다.The LDA model may mean latent Dirichlet allocation (LDA).

고급 의미론 특징은 제1 소스 데이터(D11)에 대응되는 텍스트의 의미적 특징을 의미할 수 있다.The advanced semantic feature may mean a semantic feature of the text corresponding to the first source data D11.

LDA는 주어진 문서에 대하여 각 문서에 어떤 주제들이 존재하는지를 서술하는 대한 확률적 토픽 모델 기법 중 하나를 의미할 수 있다. LDA에 관련된 자세한 설명은 아래에서 설명한다.LDA may refer to one of the probabilistic topic model techniques for describing which topics exist in each document for a given document. A detailed description of the LDA is provided below.

적어도 하나의 프로세서(120)는, 제2 소스 데이터(D12)를 이용하여 고급 의미론 특징을 포함한 자연어 특징을 추출할 수 있다.The at least one processor 120 may extract natural language features including advanced semantic features by using the second source data D12 .

즉, 제2 소스 데이터(D12)는 통상적인 자연어 처리에 필요한 학습 데이터를 형성하는데 이용되는 데이터로서 제1 소스 데이터(D11)로 처리되는 고급 의미론 특징뿐만 아니라 담화 기반 특징, 구문 특징, 변형 특징 및 표면적 특징을 포함한 자연어 특징을 추출할 수 있다. 이와 관련된 상세한 설명은 후술한다.That is, the second source data D12 is data used to form learning data required for normal natural language processing, and includes not only advanced semantic features processed as the first source data D11, but also discourse-based features, syntax features, transformation features, and Natural language features including surface features can be extracted. A detailed description related thereto will be provided later.

또한 적어도 하나의 프로세서(120)는 고급 의미론 특징 및 자연어 특징을 이용하여 자연어 인식을 위한 비 신경망 모델 학습을 수행할 수 있다.In addition, the at least one processor 120 may perform non-neural network model learning for natural language recognition using advanced semantic features and natural language features.

반면 적어도 하나의 프로세서(120)는 제2 소스 데이터(D12)를 이용하여 자연어 인식을 위한 신경망 모델 학습을 수행할 수 있다.On the other hand, at least one processor 120 may perform neural network model learning for natural language recognition using the second source data D12.

다만 적어도 하나의 프로세서(120)는 변형 텍스트가 다수 포함된 제3 소스 데이터(D13)의 데이터 베이스를 이용하여 변형 특징(lxSem)이 포함된 자연어 특징을 추가적으로 추출하고, 이를 이용하여 비 신경망 모델의 학습을 수행할 수 있다.However, the at least one processor 120 additionally extracts the natural language feature including the transform feature lxSem by using the database of the third source data D13 including a large number of variant texts, and using this, learning can be performed.

적어도 하나의 프로세서(120)는 신경망 모델이 학습한 결과 데이터를 이용하여 상기 비 신경망 모델의 추가적인 학습을 수행할 수 있다.The at least one processor 120 may perform additional learning of the non-neural network model using result data learned by the neural network model.

적어도 하나의 프로세서(120)는 이어서 비 신경망 모델과 신경망 모델을 모두 이용하여 자연어 인식을 위한 하이브리드 모델을 형성할 수 있다. 이와 관련된 상세한 설명은 도 2에서 서술하도록 한다.The at least one processor 120 may then form a hybrid model for natural language recognition using both the non-neural network model and the neural network model. A detailed description related thereto will be described in FIG. 2 .

한편 적어도 하나의 프로세서(120)는 소스데이터를 이용하여 미리 형성된 LDA 모델에서 소스 데이터에 포함된 적어도 하나의 주제에 대응되는 고유한 식별 정보와 발생 확률을 포함하는 언어 출력 값을 획득할 수 있다.Meanwhile, the at least one processor 120 may obtain a language output value including unique identification information and an occurrence probability corresponding to at least one subject included in the source data from the LDA model formed in advance by using the source data.

고유 식별 번호는 소스 데이터에 포함된 텍스트 중 의미를 갖는 단위에 대응되는 식별 정보를 의미할 수 있다.The unique identification number may mean identification information corresponding to a unit having a meaning among texts included in the source data.

또한 발생 확률은 각 의미를 갖는 단위에 텍스트의 발생 확률을 의미할 수 있다.Also, the occurrence probability may mean the occurrence probability of the text in units having respective meanings.

적어도 하나의 프로세서(120)는 언어 출력 값 각각에 대응되는 주제의 발생 횟수를 카운트할 수 있다.The at least one processor 120 may count the number of occurrences of the topic corresponding to each of the language output values.

또한 적어도 하나의 프로세서(120)는 언어 출력 값 각각에서 발생 확률만을 추출하고 이어서 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 연산할 수 있다.In addition, the at least one processor 120 may extract only the occurrence probability from each of the language output values and then calculate the occurrence probability of the language output value with respect to the number of occurrences of the topic.

적어도 하나의 프로세서(120)는 주제 발생 횟수에 대한 언어 출력 값을 기초로 상기 적어도 하나의 주제에 대응되는 의미 인식율 및 소스 데이터의 구조 인식율을 결정할 수 있다. The at least one processor 120 may determine a semantic recognition rate and a structure recognition rate of source data corresponding to the at least one subject based on a language output value for the number of occurrences of the subject.

자연어 처리 장치(10)가 각각을 연산하는 상세한 동작에 대해서는 아래에서 서술하도록 한다.A detailed operation of the natural language processing device 10 calculating each will be described below.

도 1에 도시된 자연어 처리 장치(10)의 구성 요소들의 성능에 대응하여 적어도 하나의 구성요소가 추가되거나 삭제될 수 있다. 또한, 구성 요소들의 상호 위치는 시스템의 성능 또는 구조에 대응하여 변경될 수 있다는 것은 당해 기술 분야에서 통상의 지식을 가진 자에게 용이하게 이해될 것이다.At least one component may be added or deleted according to the performance of the components of the natural language processing apparatus 10 shown in FIG. 1 . In addition, it will be readily understood by those of ordinary skill in the art that the mutual positions of the components may be changed corresponding to the performance or structure of the system.

한편, 도 1에서 도시된 각각의 구성요소는 소프트웨어 및/또는 Field Programmable Gate Array(FPGA) 및 주문형 반도체(ASIC, Application Specific Integrated Circuit)와 같은 하드웨어 구성요소를 의미한다.Meanwhile, each component illustrated in FIG. 1 refers to software and/or hardware components such as Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC).

도2 및 도3은 본 발명의 일 실시예에 따른 신경망 모델과 비 신경망 모델의 동작을 설명하기위한 도면이다.2 and 3 are diagrams for explaining the operation of a neural network model and a non-neural network model according to an embodiment of the present invention.

도2를 참고하면 자연어 처리 장치는 제1 소스 데이터(D21)와 제2 소스 데이터(D23) 그리고 제3 소스 데이터(D22)를 이용하여 신경망 학습 및 비 신경망 학습을 수행할 수 있다.Referring to FIG. 2 , the natural language processing apparatus may perform neural network learning and non-neural network learning using the first source data D21 , the second source data D23 , and the third source data D22 .

구체적으로 자연어 처리 장치는 제1 소스 데이터(D21)를 이용하여 LDA기법(M21)을 통하여 고급 의미론 특징(F21)을 추출할 수 있다. 자연어 처리 장치는 위키 피디아와 같은 데이터 베이스 상에서 각 텍스트의 의미량(richness), 명확도(clarity), 노이즈(noise), 발견된 주제의 총 개수를 추출할 수 있다.Specifically, the natural language processing apparatus may extract the advanced semantic feature F21 through the LDA technique M21 using the first source data D21. The natural language processing apparatus may extract richness, clarity, noise, and total number of discovered topics of each text from a database such as Wikipedia.

또한 자연어 처리 장치는 제2 소스 데이터(D23)에 포함된 자연어의 문장의 상위 구조를 포함하는 담화 기반 특징(Disco, F22)을 포함하는 자연어 특징을 추출할 수 있다.Also, the natural language processing apparatus may extract the natural language feature including the discourse-based features Disco and F22 including the upper structure of the sentence of the natural language included in the second source data D23.

담화 기반 특징(F22)은 상위 수준의 종속성 구조를 포함할 수 있다.The discourse-based feature F22 may include a higher-level dependency structure.

담화 기반 특징(F22)은 미시적 및 거시적 수준에서 텍스트 구조의 추세적 특징을 포함할 수 있다. 담화 기반 특징(F22)에는 엔티티 밀도 기능(EnDF)과 엔티티 그리드(EnGF)기능이 포함될 수 있다.The discourse-based features F22 may include trending features of the text structure at micro and macro levels. The discourse-based feature F22 may include an entity density function (EnDF) and an entity grid (EnGF) function.

엔티티 밀도 기능은 개체 수에 따른 인식 정도에 관련이 있으며 엔티티 그리드 기능은 해당 구문의 일관성과 관련이 있다.The entity density function is related to the degree of recognition according to the number of entities, and the entity grid function is related to the consistency of the corresponding syntax.

적어도 하나의 프로세서는 제2 소스 데이터(D23)에 포함된 자연어를 구성하는 구 및 절의 특징 및 상기 구 및 절의 구조에 대응되는 구문 특징(Synta, F24)을 포함하는 자연어 특징을 추출할 수 있다.The at least one processor may extract natural language features including features of phrases and clauses constituting the natural language included in the second source data D23 and syntax features (Synta, F24) corresponding to the structures of the phrases and clauses.

구문 특징(F24)은 텍스트의 더 긴 처리 시간과 관련이 있다. Syntax feature F24 relates to a longer processing time of the text.

이러한 구문 특징(F24)은 가독성의 중요한 지표인 텍스트의 전체적인 복잡성에 영향을 미칠 수 있다.This syntax feature (F24) can affect the overall complexity of the text, which is an important indicator of readability.

구문 특징(F24)은 명사, 동사, 부사구의 개수를 포함하여 여러 변형을 구현할 수 있다.Syntax feature F24 may implement several variations, including the number of nouns, verbs, and adverb phrases.

또한 구문 특징(F24)은 평균 파싱 트리 높이에 대한 작업에서 파싱 된 트리의 구조적 형태를 포함할 수 있다.Also, the syntax feature F24 may include the structural shape of the parsed tree in the operation on the average parsed tree height.

또한 적어도 하나의 프로세서는, 제2 소스 데이터(D23)에 포함된 자연어의 명사, 동사, 형용사 및 부사의 변형 사용에 대응되는 변형 특징(lxSem, F23)을 포함하는 자연어 특징을 추출할 수 있다.Also, at least one processor may extract natural language features including transforming features lxSem and F23 corresponding to transforming use of nouns, verbs, adjectives, and adverbs of natural language included in the second source data D23 .

변형 특징(F23)은 소스 데이터에 포함된 어휘의 의미, 단어의 난이도 및 단어의 생소함과 관련된 속성을 포함할 수 있다.The transformation feature F23 may include attributes related to the meaning of the vocabulary included in the source data, the difficulty of the word, and the unfamiliarity of the word.

구체적으로 변형 특징(F23)은 각 단어의 비율을 나타내는 명사, 동사, 형용사, 부사 변형 정도를 포함할 수 있다.Specifically, the transformation feature F23 may include a degree of transformation of a noun, a verb, an adjective, and an adverb indicating the ratio of each word.

한편 이러한 변형 특징(F23)은 미리 형성된 제3 소스 데이터(D22)를 이용하여 추출할 수도 있다. Meanwhile, the deformable feature F23 may be extracted using the previously formed third source data D22.

또한 적어도 하나의 프로세서는 제3 소스 데이터(D22)를 통하여 추출된 변형 특징을 이용하여 비 신경망 모델(M24)의 학습을 수행할 수 있다.In addition, at least one processor may perform learning of the non-neural network model M24 by using the deformed feature extracted through the third source data D22.

즉 프로세서는 일반적인 학습에 이용되는 데이터 베이스 이외에 다른 데이터 베이스를 이용하여 비 신경망 학습(M24)을 수행하고 이후 자연어 처리를 수행하는데 이용할 수 있다.That is, the processor may perform non-neural network learning (M24) using a database other than the database used for general learning and then use it to perform natural language processing.

또한 프로세서는, 제2 소스 데이터(D23)에 포함된 자연어의 음절 특성에 대응되는 표면적 특징(ShaTr, F25)을 포함하는 자연어 특징을 추출할 수 있다.Also, the processor may extract the natural language feature including the surface features ShaTr and F25 corresponding to the syllable feature of the natural language included in the second source data D23.

표면적 특징(F25)은 난이도에 대응되는 특징을 의미할 수 있다.The surface area feature F25 may mean a feature corresponding to difficulty.

표면적 특징(F25)에는 소스 데이터에 포함된 텍스트의 토큰(Token), 음절 및 문자의 평균 개수가 포함될 수 있다.The surface feature F25 may include an average number of tokens, syllables, and characters of text included in the source data.

한편 도 2와 도 3을 함께 참고하면, 적어도 하나의 프로세서는 제2 소스 데이터(D23)을 이용하여 신경망 학습(M22, M23)을 수행할 수 있고, 제1 소스 데이터(D21), 제2소스 데이터(D23) 및 제3 소스 데이터(D32)를 이용하여 비 신경망 학습(M24)을 수행할 수 있다.Meanwhile, referring to FIGS. 2 and 3 together, at least one processor may perform neural network learning M22 and M23 using the second source data D23, and the first source data D21 and the second source Non-neural network learning (M24) may be performed using the data D23 and the third source data D32.

한편 프로세서는 이러한 두 모델을 모두 이용하여 하이브리드 모델을 형성할 수 있다.On the other hand, the processor may form a hybrid model using both of these models.

도 3을 참고하면 프로세서는 제1 소스 데이터 및 상기 제2 소스 데이터에 포함된 텍스트 양에 기초하여 비 신경망 모델(M31) 및 신경망 모델(M32) 각각에 대응되는 제1 가중치(W1) 및 제2 가중치(W2)를 결정할 수 있다.Referring to FIG. 3 , the processor performs first and second weights W1 and second corresponding to the non-neural network model M31 and the neural network model M32, respectively, based on the amount of text included in the first source data and the second source data. A weight W2 may be determined.

일반적으로 신경망 모델(M31)은 적은 데이터 세트에서 낮은 성능이 발휘되고, 소스 데이터가 적은 경우 각 말뭉치 분류 방법에 따른 일관성이 결여될 수 있다.In general, the neural network model M31 exhibits low performance in a small data set, and when the source data is small, consistency according to each corpus classification method may be lacking.

따라서 소스 데이터 양이 작은 경우에는 비 신경망 모델(M31)에 대응되는 가중치(W1)를 높여 소스 데이터가 적은 경우에도 원활한 자연어 처리가 될 수 있도록 학습을 수행하고, 소스 데이터의 양이 많은 경우에는 신경망 모델(M21)에 대응되는 가중치를 높여 자연어 처리의 성능을 향상시킬 수 있다.Therefore, when the amount of source data is small, the weight (W1) corresponding to the non-neural network model (M31) is increased to perform training for smooth natural language processing even when the source data is small, and when the amount of source data is large, the neural network The performance of natural language processing may be improved by increasing the weight corresponding to the model M21.

프로세서는 제1 가중치(W1) 및 제2 가중치(W2)에 기초하여 비 신경망 모델(M31)이 출력한 결과 데이터와 신경망 모델(M32)이 출력한 결과 데이터를 반영하여 하이브리드 모델(M33)을 형성할 수 있다.The processor reflects the result data output by the non-neural network model M31 and the result data output by the neural network model M32 based on the first weight W1 and the second weight W2 to form the hybrid model M33 can do.

이렇게 형성된 하이브리드 모델(M33)은 적은 소스 데이터 양에도 효과적이며 일관적인 자연어 처리를 수행할 수 있다.The hybrid model M33 formed in this way can perform effective and consistent natural language processing even with a small amount of source data.

정리하면 프로세서는 제1 소스 데이터, 제2 소스 데이터 및 제3 소스 데이터를 이용하여 비 신경망 학습을 수행하여 비 신경망 학습모델(M31)을 형성할 수 있다.In summary, the processor may form the non-neural network learning model M31 by performing non-neural network learning using the first source data, the second source data, and the third source data.

또한 프로세서는 제2 소스 데이터를 이용하여 신경망 학습을 수행하여 신경망 모델(M32)을 형성할 수 있다. In addition, the processor may form the neural network model M32 by performing neural network learning using the second source data.

한편 프로세서는 학습에 필요한 소스 데이터가 많으면 신경망 모델(M32) 위주로 학습을 수행하고 학습에 필요한 소스 데이터의 양이 적으면 비 신경망 모델(M31)위주로 학습을 수행할 수 있고 이러한 비중은 각 모델에 대응되는 가중치(W1, W2)를 변경하여 이루어질 수 있다.On the other hand, if there is a lot of source data required for learning, the processor performs learning mainly on the neural network model (M32), and if the amount of source data required for learning is small, the processor can perform learning mainly on the non-neural network model (M31), and this proportion corresponds to each model. This can be done by changing the weights W1 and W2 to be used.

한편 도 2 및 도 3에서 설명한 본 발명의 동작은 본 발명의 일 실시예에 불과하며 프로세서가 자연어 처리 능력 향상을 위하여 하이브리드 모델을 형성하는 실시예에는 그 제한이 없다.Meanwhile, the operation of the present invention described with reference to FIGS. 2 and 3 is only one embodiment of the present invention, and the embodiment in which the processor forms a hybrid model to improve natural language processing capability is not limited thereto.

도 4는 본 발명의 일 실시예에 따른 RNN을 이용한 신경망 모델을 설명하기 위한 도면이다.4 is a diagram for explaining a neural network model using an RNN according to an embodiment of the present invention.

일반적으로 자연어 처리를 하는데 있어서 순환 신경망(Recurrent neural network , RNN)이 이용될 수 있다.In general, a recurrent neural network (RNN) may be used for natural language processing.

순환 신경망(Recurrent neural network, RNN)은 인공 신경망의 한 종류로, 유닛 간의 연결이 순환적 구조를 갖는 특징을 갖고 있다. A recurrent neural network (RNN) is a type of artificial neural network, and has a characteristic that connections between units have a cyclic structure.

이러한 구조는 시변적 동적 특징을 모델링 할 수 있도록 신경망 내부에 상태를 저장할 수 있게 해주므로, 순방향 신경망과 달리 내부의 메모리를 이용해 시퀀스 형태의 입력을 처리할 수 있다.Since this structure allows the state to be stored inside the neural network to model time-varying dynamic features, it is possible to process the input in the form of a sequence using the internal memory, unlike the forward neural network.

따라서 순환 인공 신경망은 자연어 인식과 같이 시변적 특징을 지니는 데이터를 처리하는데 적용될 수 있다. 이러한 특성 때문에 RNN은 프로세서가 자연어 처리를 수행하는데 이용될 수 있다.Therefore, the recurrent artificial neural network can be applied to processing data with time-varying characteristics, such as natural language recognition. Because of these characteristics, RNNs can be used by processors to perform natural language processing.

즉 프로세서는 신경망 모델을 통하여 학습을 수행하는데 있어서 RNN을 이용한 신경망 모델을 이용할 수 있다.That is, the processor may use the neural network model using the RNN in performing learning through the neural network model.

구체적으로 프로세서가 이용하는 RNN은 단순히 토큰(T41, T42, T43, T4n)들의 벡터를 연결 짓지 않고 연속되는 시퀀스 정보를 가져와서 소스 데이터에 포함된 텍스트들의 시퀀스 정보를 학습할 수 있다. Specifically, the RNN used by the processor can learn sequence information of texts included in source data by bringing continuous sequence information without simply connecting vectors of tokens T41, T42, T43, and T4n.

다만 RNN 개념은 이 직전 셀의 값만 입력으로 이용하기 때문에, 아무리 이 직전 셀의 값이 그 이전 셀의 계산결과를 포함한 것이라고 하더라도 점점 정보의 손실이 발생할 수 있다. However, since the RNN concept uses only the value of the previous cell as an input, information may be gradually lost no matter how the value of the previous cell includes the calculation result of the previous cell.

따라서 프로세서가 RNN을 이용하여 학습을 수행하는 경우에는 짧은 문장 구조에 대해서 높은 정확도를 가지나, 긴 문장 구조에 대해서는 정확도를 가지지 못한다.Therefore, when the processor performs learning using the RNN, it has high accuracy for short sentence structures, but does not have accuracy for long sentence structures.

이를 극복하기 위해서 많은 양의 소스 데이터 학습이 필요하며, 적은 양의 소스 데이터 학습이 이루어진 경우에는 원활한 자연어 처리가 어렵다.To overcome this, it is necessary to learn a large amount of source data, and when a small amount of source data is learned, smooth natural language processing is difficult.

따라서 이러한 단점을 극복하기 위하여 RNN을 이용하는 신경망 모델과 병렬적으로 학습을 수행하는 비 신경망 모델을 이용하여 하이브리드 모델을 형성할 수 있다.Therefore, in order to overcome these shortcomings, a hybrid model can be formed using a neural network model using RNN and a non-neural network model that performs learning in parallel.

프로세서는 상술한 바와 같이 RNN의 신경망 학습을 수행하는 모델과 병렬적으로 비 신경망 학습을 수행하고 이를 융합하여 최종적인 하이브리드 모델을 형성할 수 있다.As described above, the processor may perform non-neural network learning in parallel with a model that performs neural network learning of RNN and fuse them to form a final hybrid model.

한편 도 4에서 설명한 본 발명의 동작은 자연어 처리에 이용되는 RNN의 일 실시예에 불과하며 자연어 처리에는 RNN이외에도 LSTM(Long Short Term Memory) 및 GRU(Gated Recurrent Unit)가 이용될 수 있다.Meanwhile, the operation of the present invention described with reference to FIG. 4 is merely an embodiment of an RNN used for natural language processing, and a Long Short Term Memory (LSTM) and a Gated Recurrent Unit (GRU) may be used in addition to the RNN for natural language processing.

도 5a 및 도 5b는 본 발명의 일 실시예에 따른 고급 의미론 분석을 설명하기 위한 도면이다.5A and 5B are diagrams for explaining advanced semantic analysis according to an embodiment of the present invention.

도 5a을 참고하면, 프로세서는 소스데이터를 이용하여 미리 형성된 LDA 모델에서 소스 데이터에 포함된 적어도 하나의 주제를 추출할 수 있다.Referring to FIG. 5A , the processor may extract at least one subject included in the source data from the preformed LDA model using the source data.

LDA(Latent Dirichlet allocation)는 이산 자료들에 대한 확률적 생성 모형을 의미할 수 있다.LDA (Latent Dirichlet Allocation) may refer to a probabilistic generation model for discrete data.

LDA는 텍스트 기반의 자료들에 대해 쓰일 수 있다.LDA can be used for text-based materials.

LDA는 잠재 의미 분석(Latent semantic indexing, LSI), 확률 잠재 의미 분석(Probabilistic latent semantic analysis, pLSA)을 통하여 수행되는 토픽 모델링을 의미할 수 있다. LDA may refer to topic modeling performed through latent semantic indexing (LSI) and probabilistic latent semantic analysis (pLSA).

LDA에는 몇 가지 가정이 있는데 그 중 중요한 것은 단어의 교환성(exchangeability)이다. LDA has several assumptions, the most important of which is the exchangeability of words.

교환성은 단어들의 순서는 상관하지 않고 단어들의 유무만이 중요하다는 가정을 의미할 수 있다.Exchangeability can mean the assumption that the order of words does not matter and only the presence or absence of words is important.

따라서 LDA에서는 단어의 순서를 무시할 경우 문헌은 단순히 그 안에 포함되는 단어들의 빈도수만을 가지고 표현할 수 있다. Therefore, in LDA, if the order of words is ignored, the literature can be expressed only with the frequency of words included in it.

이 가정을 기반으로 단어와 문서들의 교환성을 포함하는 혼합 모형을 LDA를 통하여 제시할 수 있다. 하지만 단순히 단어 하나를 단위로 생각하는 것이 아니라 특정 단어들의 묶음을 한 단위로 생각하는 방식(n-gram)으로 LDA의 교환성 가정을 확장시킬 수도 있다.Based on this assumption, a mixed model including the interchangeability of words and documents can be presented through LDA. However, it is also possible to extend the assumption of exchangeability of LDA in a way (n-gram) that considers a group of specific words as a unit rather than simply thinking of a single word as a unit.

즉 자연어 처리 장치의 프로세서는 적어도 하나의 주제를 소스 데이터 상의 순서와 독립적으로 추출하여 언어 출력 값을 획득할 수 있다.That is, the processor of the natural language processing apparatus may obtain a language output value by extracting at least one subject independently of an order on the source data.

자연어 처리 장치는 각 주제에는 고유한 식별 번호와 발생 확률이 매칭될 수 있고 이를 기초로 언어 출력 값이 형성될 수 있다. In the natural language processing apparatus, a unique identification number and occurrence probability may be matched to each subject, and a language output value may be formed based on this.

도 5a에서는 첫번째 주제에 식별번호 3과 0.45의 발생확률이 매칭되었고 두번째 주제에는 식별번호 7과 0.25의 발생확률이 매칭되었다.In Fig. 5a, the occurrence probability of identification number 3 and 0.45 was matched to the first subject, and the probability of occurrence of identification number 7 and 0.25 was matched to the second subject.

또한 세번째 주제에는 식별번호 42가 매칭되고 0.2의 발생 확률이 매칭되고 마지막 주제에는 식별 번호 45와 발생 확률 0.1이 매칭되었다(O51).In addition, identification number 42 was matched to the third subject, an occurrence probability of 0.2 was matched, and identification number 45 and an occurrence probability 0.1 were matched to the last subject (O51).

또한 프로세서는 이러한 주제의 발생 확률도 도출하지만 실제 소스 데이터 상에서 얼마나 많은 주제가 등장했는지에 대응되는 주제 발생 횟수를 카운트할 수 있다.In addition, the processor may also derive the occurrence probability of such a topic, but may count the number of occurrences of a topic corresponding to how many topics appear in the actual source data.

프로세서는 각 주제에 식별 번호와 발생 확률이 매칭된 언어 출력 값을 출력할 수 있다.The processor may output a language output value in which an identification number and an occurrence probability are matched to each subject.

이어서 프로세서는 언어 출력 값에서 각 주제의 발생 확률만 추출할 수 있다(S51).Subsequently, the processor may extract only the occurrence probability of each topic from the language output value (S51).

프로세서는 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 연산하여 주제 발생 횟수와 주제의 발생 확률의 관계를 도출할 수 있다(R51).The processor may derive the relationship between the number of occurrences of the topic and the probability of occurrence of the topic by calculating the occurrence probability of the language output value with respect to the number of occurrences of the topic (R51).

한편 도 5b는 주제 발생 횟수와 주제의 발생 확률의 관계를 그래프로 나타내었다. Meanwhile, FIG. 5B is a graph showing the relationship between the number of occurrences of a topic and the probability of occurrence of a topic.

도 5b는 프로세서는 후술하는 바와 주제 발생 횟수에 대한 언어 출력 값을 기초로 적어도 하나의 주제에 대응되는 의미 인식율 및 소스 데이터의 구조 인식율을 결정할 수 있다.5B , the processor may determine a semantic recognition rate corresponding to at least one subject and a structure recognition rate of source data based on a language output value for the number of occurrences of the subject, which will be described later.

상술한 바와 같이 의미 인식율은 자연어 처리 장치가 텍스트 자체를 인식하는 정도를 의미하여 구조 인식율을 자연어 처리 장치가 텍스트의 구문 자체를 인식하는 정도를 의미할 수 있다.As described above, the semantic recognition rate may mean the degree to which the natural language processing apparatus recognizes the text itself, and the structure recognition rate may mean the degree to which the natural language processing apparatus recognizes the syntax of the text itself.

우선 프로세서는 주제 발생 횟수에 대한 상기 언어 출력 값 발생 확률을 상기 주제 발생 횟수에 대하여 적분하여 상기 소스 데이터의 의미량(R62)을 연산할 수 있다.First, the processor may calculate the semantic amount R62 of the source data by integrating the probability of occurrence of the language output value with respect to the number of occurrences of the topic with respect to the number of occurrences of the topic.

소스 데이터의 의미량(R62)은 하기의 수학식1을 기초로 연산 될 수 있다.The semantic amount R62 of the source data may be calculated based on Equation 1 below.

수학식1을 참고하면 S_R은 소스 데이터의 의미량(R52)을 의미할 수 있다. Referring to Equation 1, S _R may mean a semantic amount R52 of the source data.

의미량은 소스 데이터에 포함된 텍스트 상에서 의미를 갖는 텍스트의 양을 의미할 수 있다.The semantic amount may mean the amount of text having a meaning on text included in the source data.

p_i는 각 주제에 대응되는 주제 발생 확률을 의미할 수 있으며, i는 주제 발생 횟수를 의미하는 것으로 도 5b에서는 그래프의 x좌표에 대응될 수 있다.p _i may mean the probability of occurrence of a topic corresponding to each topic, and i means the number of occurrences of the topic, which may correspond to the x-coordinate of the graph in FIG. 5B .

구체적으로 도 5b에서 제시된 그래프에서 밑 넓이가 소스 데이터의 의미량(R62)을 나타낼 수 있다.Specifically, in the graph presented in FIG. 5B , the area under the base may represent the semantic amount R62 of the source data.

이러한 의미량(R52)이 많은 것은 인식해야 할 텍스트의 수가 많은 것을 의미하므로 This large amount of meaning (R52) means that the number of texts to be recognized is large.

소스 데이터의 의미량이 증가함에 따라 의미 인식율 및 상기 구조 인식율은 감소하도록 결정될 수 있다.As the semantic amount of the source data increases, the semantic recognition rate and the structure recognition rate may be determined to decrease.

즉, 의미량이 많으면 해당 소스 데이터의 의미 인식율 및 구조 인식율을 낮게 결정될 수 있으며 해당 소스 데이터의 난이도는 높고 가독성은 낮은 것으로 결정될 수 있다.That is, if the amount of meaning is large, the meaning recognition rate and structure recognition rate of the corresponding source data may be determined to be low, and the difficulty level of the corresponding source data may be determined to be low and readability.

또한 프로세서는 주제 발생 횟수에 대한 언어 출력 값 발생 확률의 변화량을 기초로 소스 데이터의 의미 명확도(C52)를 연산할 수 있다.Also, the processor may calculate the semantic clarity C52 of the source data based on the amount of change in the occurrence probability of the language output value with respect to the number of occurrences of the subject.

소스 데이터의 의미 명확도(C52)는 아래의 수학식을 기초로 결정될 수 있다.The semantic clarity C52 of the source data may be determined based on the following equation.

수학식2를 참고하면, S_c는 의미 명확도를 의미하고 max(p)는 각 주제의 발생확률의 최대 값을 의미하여 pi는 각 주제 발생 횟수에 대응되는 주제 발생 확률을 의미할 수 있다.Referring to Equation 2, S _c denotes semantic clarity, max(p) denotes the maximum value of the occurrence probability of each topic, and pi denotes a topic occurrence probability corresponding to the number of occurrences of each topic.

즉 프로세서는 발생 확률의 최대 값에 대한 각 주제의 발생 횟수에 대한 발생 확률의 변화량을 기초로 의미 명확도(C52)를 결정할 수 있다.That is, the processor may determine the semantic clarity C52 based on the amount of change in the occurrence probability with respect to the number of occurrences of each topic with respect to the maximum value of the occurrence probability.

의미 명확도(C52)가 크면 각 텍스트의 의미가 명확히 구분되는 것을 의미하므로 의미 인식율이나 구조 인식율이 큰 것을 의미할 수 있다.When the meaning clarity C52 is large, it means that the meaning of each text is clearly distinguished, and thus it may mean that the meaning recognition rate or structure recognition rate is large.

즉, 프로세서는, 의미 명확도가 증가함에 따라 의미 인식율 및 구조 인식율이 증가하는 양의 상관 관계가 있는 것으로 결정할 수 있다.That is, the processor may determine that there is a positive correlation in which the semantic recognition rate and the structure recognition rate increase as the semantic clarity increases.

한편 프로세서는 소스 데이터의 주제 인식률 및 구조 인식률을 결정하는데 있어서 노이즈(N52)를 더 고려할 수 있다.Meanwhile, the processor may further consider the noise N52 in determining the subject recognition rate and the structure recognition rate of the source data.

구체적으로 프로세서는 주제 발생 횟수의 평균 값을 연산하고, 주제 발생 횟수 평균 값 이하에 대응되는 주제 발생 횟수에 대한 언어 출력 값 발생 확률의 변화량을 기초로 소스 데이터의 노이즈(N52)를 연산할 수 있다.Specifically, the processor may calculate the average value of the number of occurrences of the topic, and calculate the noise N52 of the source data based on the amount of change in the probability of occurrence of the language output value with respect to the number of occurrences of the topic corresponding to or less than the average value of the number of occurrences of the topic. .

프로세서는 아래의 수학식3을 기초로 소스 데이터의 노이즈(N52)를 결정할 수 있다.The processor may determine the noise N52 of the source data based on Equation 3 below.

수학식3을 참고하면 S_N은 소스 데이터의 노이즈(N52)를 의미하고

는 각 주제의 발생 확률의 평균을 의미할 수 있다.Referring to Equation 3, S _N means the noise (N52) of the source data,

may mean the average of the occurrence probabilities of each subject.

해당 노이즈가 크다는 것은 소스 데이터에서 의미를 인식하는데 어려운 것을 의미할 수 있으므로 프로세서는 소스 데이터의 노이즈(N52)가 증가함에 따라 의미 인식율 및 구조 인식율이 감소하는 것으로 결정할 수 있다.Since the large noise may mean that it is difficult to recognize a meaning in the source data, the processor may determine that the meaning recognition rate and the structure recognition rate decrease as the noise N52 of the source data increases.

프로세서는 상술한 동작을 통하여 소스 데이터의 의미량(C52), 명확도(C52) 및 노이즈(N52)를 연산하고 이를 기초로 해당 소스 데이터의 의미 인식율 또는 소스 데이터를 이루는 구문의 구조 인식율을 연산할 수 있다.The processor calculates the semantic amount (C52), clarity (C52), and noise (N52) of the source data through the above-described operation, and based on this, calculates the semantic recognition rate of the source data or the structure recognition rate of the syntax constituting the source data. can

한편 LDA모델을 이용하여 출력한 언어 출력 값으로 상술한 의미량(C52), 명확도(C52) 및 노이즈(N52)를 연산하는 것은 본 발명의 일 실시예에 불과하며 이를 기초로 의미 인식율 및 구조 인식율을 결정하는 동작에는 그 제한이 없다.Meanwhile, calculating the above-described semantic quantity (C52), clarity (C52), and noise (N52) with the language output value output using the LDA model is only an embodiment of the present invention, and based on this, the semantic recognition rate and structure There is no limit to the operation for determining the recognition rate.

도 6내지 도 10는 본 발명의 일 실시예에 따른 순서도이다.6 to 10 are flowcharts according to an embodiment of the present invention.

도 6을 참고하면, 자연어 처리 장치는 소스 데이터를 획득할 수 있다(S601). Referring to FIG. 6 , the natural language processing apparatus may acquire source data ( S601 ).

이후 자연어 처리 장치는 LDA를 통하여 고급 의미론 특징을 추출할 수 있다(S602).Thereafter, the natural language processing apparatus may extract advanced semantic features through the LDA (S602).

또한 자연어 처리 장치는 고급 의미론 특징 이외에 다양한 자연어 특징을 추출할 수 있다(S603).In addition, the natural language processing apparatus may extract various natural language features in addition to advanced semantic features ( S603 ).

한편 이렇게 도출된 데이터를 기초로 자연어 처리 장치는 비신경망 모델의 학습을 수행할 수 있다(S604).Meanwhile, the natural language processing apparatus may learn the non-neural network model based on the derived data (S604).

한편 소스 데이터를 획득한 자연어 처리 장치는 신경망 모델의 학습을 수행할 수 있다(S605).Meanwhile, the natural language processing apparatus having obtained the source data may perform learning of the neural network model ( S605 ).

이후 자연어 처리 장치는 비 학습된 비 신경망 모델과 신경망 모델을 이용하여 하이브리드 모델을 형성할 수 있다(S606).Thereafter, the natural language processing apparatus may form a hybrid model using the non-trained non-neural network model and the neural network model (S606).

이어서 도 7을 참고하면, 자연어 처리 장치는 소스 데이터를 획득할 수 있다(S701).Then, referring to FIG. 7 , the natural language processing apparatus may acquire source data ( S701 ).

자연어 처리 장치는 소스 데이터에서 고급 의미론 특징을 추출할 수 있다(S702).The natural language processing apparatus may extract advanced semantic features from the source data (S702).

담화 기반 특징을 추출할 수 있다(S703). 또한 자연어 처리 장치는 구문 특징을 추출할 수 있다(S704).A discourse-based feature may be extracted (S703). Also, the natural language processing apparatus may extract syntax features (S704).

또한 소스 데이터로부터 변형 특징을 추출할 수 있고, 표면적 특징을 추출할 수 있다(S705, S706). 이렇게 자연어 처리장치는 추출한 데이터의 특징을 기초로 비 신경망 모델의 학습을 수행할 수 있다(S707).Also, it is possible to extract deformation features from the source data and extract surface features (S705 and S706). In this way, the natural language processing apparatus may perform learning of the non-neural network model based on the characteristics of the extracted data (S707).

이어서 도 8은 상술한 신경망 모델과 비 신경망 모델 특징을 설명하기 위한 순서도이다. Next, FIG. 8 is a flowchart for explaining the features of the above-described neural network model and non-neural network model.

도 8을 참고하면 자연어 처리 장치는 소스 데이터를 획득할 수 있다(S801). 이러한 소스 데이터는 신경망 모델을 통하여 학습을 수행할 수 있고 비 신경망 모델을 통하여 학습을 수행할 수 있다. 한편 신경망 모델을 통하여 학습을 수행하는 경우에는(S802), 많은 소스 데이터가 필요하며, 말 뭉치 별 인식의 일관성이 적으며 인식의 설명이 어려운 경우가 많다. Referring to FIG. 8 , the natural language processing apparatus may acquire source data ( S801 ). Such source data may be learned through a neural network model, and learning may be performed through a non-neural network model. On the other hand, when learning is performed through the neural network model (S802), a lot of source data is required, the consistency of recognition for each corpus is small, and it is often difficult to explain the recognition.

따라서 자연어 처리 장치는 비 신경망 모델 학습과 신경망 모델 학습을 병렬적으로 수행할 수 있다(S803).Accordingly, the natural language processing apparatus may perform the non-neural network model learning and the neural network model learning in parallel (S803).

이러한 동작을 통하여 신경망 모델과 비 신경망 모델의 조화를 이루어 하이브리드 모델을 형성할 수 있다(S804).Through this operation, a hybrid model can be formed by harmonizing the neural network model and the non-neural network model (S804).

도 9는 비 신경망 모델에 있어 고급 의미론의 특징을 결정하는 동작을 나타낸 순서도이다.9 is a flowchart illustrating an operation of determining a feature of advanced semantics in a non-neural network model.

도 9 를 참고하면 자연어 처리 장치는 소스 데이터를 획득할 수 있다(S901).Referring to FIG. 9 , the natural language processing apparatus may acquire source data ( S901 ).

이후 자연어 처리장치는 주제, 식별 정보 및 발생 확률을 포함하는 언어 출력 값을 획득할 수 있다(S902).Thereafter, the natural language processing apparatus may obtain a language output value including a subject, identification information, and an occurrence probability ( S902 ).

한편 자연어 처리 장치는 주제 발생횟수를 카운트 할 수 있다(S903).Meanwhile, the natural language processing apparatus may count the number of occurrences of the topic (S903).

또한 각 주제의 발생 횟수에 대응되는 발생 확률을 추출할 수 있다(S904). 이후 각 주제에 대응되는 의미량(Richness), 명확도(Clarity) 및 노이즈(Noise)를 결정할 수 있다(S905).In addition, it is possible to extract the occurrence probability corresponding to the number of occurrences of each topic (S904). Thereafter, it is possible to determine the amount of meaning (Richness), clarity (Clarity), and noise (Noise) corresponding to each topic (S905).

최종적으로 자연어 처리 장치는 해당 소스 데이터의 의미 인식율 및 구조 인식율을 결정할 수 있다(S906). Finally, the natural language processing apparatus may determine a semantic recognition rate and a structure recognition rate of the corresponding source data (S906).

도 10을 참고하면, 자연어 처리 장치는 소스 데이터(S1001)를 이용하여 고급 의미적 특징을 결정하기 위하여 의미량(S1002), 명확도(S1003) 및 노이즈(S1004)를 각각 연산할 수 있다. 자연어 처리 장치는 소스 데이터의 주제 인식률 및 구조 인식률을 결정할 수 있다(S1005). 자연어 처리 장치는 비 신경망 학습을 수행할 수 있다(S1006). Referring to FIG. 10 , the natural language processing apparatus may calculate a semantic amount S1002 , a clarity S1003 , and a noise S1004 to determine advanced semantic features using source data S1001 , respectively. The natural language processing apparatus may determine a subject recognition rate and a structure recognition rate of the source data (S1005). The natural language processing apparatus may perform non-neural network learning (S1006).

본 발명의 실시예와 관련하여 설명된 방법 또는 알고리즘의 단계들은 하드웨어로 직접 구현되거나, 하드웨어에 의해 실행되는 소프트웨어 모듈로 구현되거나, 또는 이들의 결합에 의해 구현될 수 있다. 소프트웨어 모듈은 RAM(Random Access Memory), ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리(Flash Memory), 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터 판독가능 기록매체에 상주할 수도 있다.The steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, as a software module executed by hardware, or by a combination thereof. A software module may include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

이상, 첨부된 도면을 참조로 하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야의 통상의 기술자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며, 제한적이 아닌 것으로 이해해야만 한다.In the above, embodiments of the present invention have been described with reference to the accompanying drawings, but those of ordinary skill in the art to which the present invention pertains can realize that the present invention can be embodied in other specific forms without changing the technical spirit or essential features thereof. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

10 : 자연어 처리 장치
110 : 메모리
120 : 프로세서
130 : 통신부10: natural language processing unit
110: memory
120 : processor
130: communication department

Claims

A method for natural language processing performed by a device, comprising:
Extracting advanced semantic features including semantic quantity, semantic clarity, and noise corresponding to the first source data by performing natural language recognition of the first source data through the LDA model provided in the device ;
extracting natural language features including the advanced semantic features by using second source data;
performing learning of a non-neural network model including at least one of an SVM for natural language recognition, a Naive Bayes Classifier (NBC), and a random forest using the advanced semantic feature and the natural language feature;
performing learning of a neural network model for natural language recognition using the second source data;
performing additional learning of the non-neural network model using result data learned by the neural network model; and
and forming a hybrid model for natural language recognition using both the non-neural network model and the neural network model.

delete

According to claim 1,
The extracted natural language feature includes a discourse-based feature (Disco) including a higher-order structure of a sentence of a natural language included in the second source data.

According to claim 1,
The extracted natural language features include features of phrases and clauses constituting the natural language included in the second source data, and syntax features (Synta) corresponding to the structures of the phrases and clauses.

According to claim 1,
The extracted natural language feature includes a variant feature (lxSem) corresponding to the use of variants of nouns, verbs, adjectives, and adverbs of natural language included in the second source data.

6. The method of claim 5,
extracting the natural language feature including a transforming feature (lxSem) corresponding to the use of transforming nouns, verbs, adjectives, and adverbs of natural language included in third source data; and
The natural language processing method further comprising; performing learning of the non-neural network model based on the natural language feature.

According to claim 1,
The extracted natural language features correspond to the syllable characteristics of the natural language included in the second source data, and include a surface area feature (ShaTr) including tokens, syllables, and average number of texts of the second source data. Way.

According to claim 1,
The hybrid model forming step is
A natural language processing method for forming the hybrid model using the non-neural network model that performs learning in parallel with the neural network model using an RNN.

delete

In the apparatus for performing natural language processing,
a memory for storing the first source data, the second source data, and the LDA model;
Including; at least one processor to communicate with the memory;
The at least one processor,
performing natural language recognition of the first source data through the LDA model to extract advanced semantic features including semantic quantity, semantic clarity, and noise corresponding to the first source data (Advanced semantic features);
extracting natural language features including the advanced semantic features using the second source data,
Using the advanced semantic feature and the natural language feature to perform learning of a non-neural network model including at least one of SVM, NBC (Naive Bayes Classifier), and random forest for natural language recognition,
performing learning of the neural network model for natural language recognition using the second source data, and performing additional learning of the non-neural network model using the result data learned by the neural network model;
A natural language processing apparatus for forming a hybrid model for natural language recognition by using both the non-neural network model and the neural network model.