KR20200057206A

KR20200057206A - Method and system for automatic visualization of the untold information

Info

Publication number: KR20200057206A
Application number: KR1020180141265A
Authority: KR
Inventors: 박종철; 양원석; 김정호
Original assignee: 한국과학기술원
Priority date: 2018-11-16
Filing date: 2018-11-16
Publication date: 2020-05-26

Abstract

As to the study of natural language processing technology, there has been conducted a research on a method and a system for automatically extracting information which has not been directly mentioned in a document, but can be inferred through a combination of pieces of information mentioned to the surroundings. However, there has not been proposed a specific method for constructing a system capable of automatically making a determination that ″specific information is omitted from a corresponding document,″ which can be naturally recognized by the human, and outputting and providing, to a user, all information predicted as being omitted in the whole of the determination in natural language form. The present invention relates to a method and a system for automatically identifying information which is most likely to be mentioned together in a given document, but is not mentioned in the document, and providing a user with the information. In detail, the method and the system add, in sentence form, information which is highly likely to be mentioned together with pieces of information mentioned in a document input from the user, of the information not mentioned in the input document, and output and provide, to the user, a document in which added sentences, and sentences included in an existing input document merge with each other. The method and the system enable the user to be assisted in an automated manner in determining, with respect to a specific document, information which is likely to be intentionally or unintentionally omitted in a process of writing the document, thereby easing a cognitive burden of the user required for a detailed reading process and related-information investigation for the document.

Description

Method and system for automatic visualization of the untold information}

본 발명은 주어진 문서와 함께 언급될 가능성이 가장 높았으나 문서 내에 언급되지 않은 정보를 자동으로 파악하여 사용자에게 제공하는 자연어처리 기술에 대한 것이다.The present invention relates to a natural language processing technology that automatically detects information not mentioned in a document and provides it to a user, which is most likely to be mentioned together with a given document.

자연언어처리 기술에 있어 문서 내에서 직접적으로 언급된 정보에 대한 정보 추출에 대한 연구를 넘어, 문서 내에서 언급되지 않았으나 사람의 행간을 읽는 능력을 통해 유추 가능한 정보에 대한 자동 정보 추출 방법 및 시스템에 대한 연구가 진행되어 왔다.Beyond the research on extracting information on information directly mentioned in documents in natural language processing technology, the method and system for automatic information extraction for information that is not mentioned in documents but can be inferred through the ability to read human leading Research has been conducted.

Liu and Singh (2004) 은 상식적인 추론(commonsense reasoning)을 통해 사람이 자연스럽게 인지적인 연결 및 연상 과정을 거치는 단어/개념 지도인 ConceptNet을 구축하였으며, 해당하는 지도가 전반적인 자연언어처리 기술에 있어 관련 단어 및 개념에 대한 연결 과정을 모델링함에 활용될 수 있음을 입증하였다. Mikolov et al. (2013) 은 문서 내에서 단어들이 같은 문장 내에서 동시에 등장하는 패턴을 비지도학습 방식을 통해 벡터 인코딩함을 통해 특정 단어가 어떤 다른 단어와 동시에 등장할 가능성이 높은 지를 파악하는 기술 개발을 진행하였으며 이를 통해 단어 의미 전반에 대한 모델링을 수행할 수 있음을 실험적으로 입증하였다. Le and Mikolov (2014) 는 Mikolov et al. (2013)과 유사한 방식으로 문서 내에서 구문들이 동시에 등장하는 패턴을 비지도 학습을 통해 벡터 인코딩함을 통해 특정 구문이 어떤 다른 구문과 동시에 등장할 가능성이 높은 지를 파악하는 기술 개발을 진행하였으며 이를 통해 구문 의미 전반에 대한 모델링을 수행할 수 있음을 실험적으로 입증하였다. 또한 Tang et al. (2014) 은 이와 같은 벡터 인코딩을 통한 동시 발생 빈도에 대한 분석 및 의미 모델링을 통해 감정 단어 분석의 성능을 향상시킬 수 있음을 실험적으로 입증하였다.Liu and Singh (2004) built ConceptNet, a word / concept map through which humans naturally go through cognitive connections and associations through commonsense reasoning, and the corresponding maps are related words in the overall natural language processing technology. And it was proved that it can be used to model the connection process for the concept. Mikolov et al. (2013) developed a technique to identify which words are likely to appear simultaneously with which other words by vector encoding patterns in which words appear simultaneously in the same sentence in a document through an unsupervised learning method. Through this, it was experimentally proved that modeling for overall word meaning can be performed. Le and Mikolov (2014) described Mikolov et al. In a similar way to (2013), we developed a technique to identify which phrases are likely to appear simultaneously with which other phrases by vector encoding patterns in which phrases appear simultaneously in the document through unsupervised learning. It has been experimentally proved that modeling can be performed for the entire syntax meaning. Also, Tang et al. (2014) experimentally proved that the performance of emotion word analysis can be improved by analyzing the semantic modeling and analyzing the frequency of simultaneous occurrence through such vector encoding.

미국 특허 US7778819B2 (granted, 2010-08-17), "Method and apparatus for predicting word prominence in speech synthesis"는 동일한 문서 혹은 대화 내에서 가장 자주 등장하거나 가장 핵심적인 정보를 포함하는 단어 및 가장 자주 등장하거나 가장 핵심적인 정보를 포함하는 단어 조합에 대한 분석을 통해 고품질의 음성 합성을 가능케 하는 방법 및 장치를 제안하였다. 또한, 미국 특허 US7149695B1 (granted, 2006-12-12), "Method and apparatus for speech recognition using semantic inference and word agglomeration"은 단어 의미 사이의 연관성 분석 및 함께 자주 등장하는 단어에 대한 패턴 분석을 통해 의미론적 추론 및 단어 응집을 진행하고 이를 통해 음성 인식의 성능을 향상시키는 방법 및 장치를 제안하였다.U.S. Patent US7778819B2 (granted, 2010-08-17), "Method and apparatus for predicting word prominence in speech synthesis" is the most frequently occurring or most frequently occurring or most frequently occurring or most important information within the same document or conversation. We proposed a method and apparatus that enables high-quality speech synthesis through analysis of word combinations that contain key information. In addition, US Patent US7149695B1 (granted, 2006-12-12), "Method and apparatus for speech recognition using semantic inference and word agglomeration" is semantic through analysis of association between word meanings and pattern analysis of frequently occurring words. A method and apparatus for improving the performance of speech recognition through inference and word aggregation have been proposed.

Boltu?iζ and ?najder (2015) 는 텍스트 유사도 지표를 통해 온라인 토론 포럼에서 가장 빈번하게 등장하는 주장이 어떤 것이 있는 지를 분석하였으며, 특정 주장에 항상 함께 등장하는 관련 주장이 어떤 것이 있는 지를 분석하였다. Bernardy et al. (2018) 은 특정 문장이 어떤 다른 문장과 함께 등장하는 지에 따라 해당하는 문장에 대한 독자의 문장 순응도(sentence acceptability) 및 문장을 독해하고 이해하는 데에 드는 노력이 변화 가능하다는 점을 제안하고 이에 대한 실험을 진행하였다.Boltu? Iζ and? Najder (2015) analyzed the most frequently occurring claims in online discussion forums through text similarity indicators, and the related claims that always appeared together in a particular claim. Bernardy et al. (2018) proposed and suggested that the reader's sentence acceptability and the effort to read and understand the sentence may change depending on which other sentence appears along with the other sentence. The experiment was conducted.

Lee et al. (2014) 은 문서 내에서 직접적으로 언급되지는 않았으나, 특정 유전자의 변형에 대한 서술 및 특정 암 증상 혹은 관찰 결과에의 변화 및 인과 관계에 대한 언급의 논리적 연결을 통해 추론 가능한 유전자-암 관계를 자동으로 파악하는 시스템을 제안하고 기술 개발하였다. You et al. (2017) 은 문서 내에서 표현된, 직접적으로 언급되지는 않았으나 통계 정보 및 실험 결과에 대한 서술을 통해 간접적으로 파악할 수 있거나 유전자와 환경 인자 사이의 연관성을 자동으로 유추하는 시스템을 제안하고 기술 개발하였다. Chung et al. (2017) 은 문서 내에서 직접적으로 언급되지는 않았으나 사람의 행간을 읽는 능력을 통해 파악 가능한 사건-공간 정보를, 문서 내의 여러 문장에 걸쳐 부분적으로 드러난 정보들의 조합을 통해 자동으로 파악하는 시스템을 제안하고 기술 개발하였다.Lee et al. (2014) is not directly mentioned in the document, but it is possible to automatically deduce a gene-cancer relationship that can be deduced through a logical connection of a description of a modification of a specific gene and a change in a specific cancer symptom or observation and a causal relationship. We proposed a system to grasp and developed the technology. You et al. (2017) proposed and developed a system that can be grasped indirectly through the description of statistical information and experimental results, which is expressed in the document, but is not directly mentioned, or that automatically infers the association between genes and environmental factors. . Chung et al. (2017) proposes a system that automatically detects event-spatial information that is not directly mentioned in the document, but can be understood through the ability to read a person's leading edge, through a combination of information partially revealed across multiple sentences in the document. And developed the technology.

상술한 바와 같이, 현재까지 문서에 언급되지는 않았으나 추론 가능한 특정 정보에 대한 자동 추출 연구 및 문장 내에 동시에 등장하는 단어 혹은 구문들의 동시 등장 패턴 분석을 통한 의미 모델링에 대한 연구는 활발히 진행되어 왔지만, 인간이 자연스럽게 인지할 수 있는, "해당하는 문서에 특정 정보가 누락되어 있는 것 같다"는 판단을 자동으로 수행하고 누락된 것으로 예상되는 정보를 자연어 형태로 사용자에게 출력 제공할 수 있는 시스템을 구축하기 위한 구체적인 방법이 제안된 적은 없다.As described above, although not mentioned in the document to date, research on automatic extraction of specific information that can be inferred and research on semantic modeling through simultaneous appearance pattern analysis of words or phrases appearing in sentences have been actively conducted, but human This is a naturally recognizable system for constructing a system capable of automatically performing a judgment that "the relevant document is missing certain information" and providing the user with information expected to be missing in the form of natural language. No specific method has been proposed.

상기와 같은 목적을 달성하기 위하여, 본 발명의 일 실시예에 따른 언급되지 않은 정보 가시화 시스템은,In order to achieve the above object, an information visualization system not mentioned according to an embodiment of the present invention,

동일한 문서 내에 함께 등장하는 문장들의 상호 연관성을 분석함에 있어 그 분석 기반이 되는 문서 집합을 저장하는 동시 언급 기준 코퍼스(110),In analyzing the correlation between sentences appearing together in the same document, the simultaneous reference criterion corpus 110 that stores a set of documents that are the basis of the analysis,

동시 언급 기준 코퍼스(110)의 각 문서에서 기초 담화 단위와 사건 정보를 추출하며, 각 문서에 내에 포함된 각 문장을 문장 의미 벡터로 변환하며, 각 문서에 포함된 기초 담화 단위와 사건 정보를 각각 담화 의미 벡터 및 사건 의미 벡터로 변환하며, 두 개의 다른 의미 벡터에 상응하는 두 개의 다른 텍스트가 같은 문서 내에서 동시에 발생하는 빈도를 계산하며, 동시 언급 기준 코퍼스(110) 전반에 걸쳐, 특정 해쉬맵에 대한 복수 동시 등장 관계를 활용하여 두 개의 다른 의미 벡터에 상응하는 두 개의 다른 텍스트가 같은 문서 내에서 동시에 발생하는 빈도를 저장하는 동시 언급 정보 처리부(120),The basic discourse unit and event information are extracted from each document of the simultaneous reference criteria corpus 110, each sentence included in each document is converted into a sentence semantic vector, and the basic discourse unit and event information included in each document are respectively Calculates how often two different texts corresponding to two different semantic vectors occur simultaneously in the same document, and converts them into discourse semantic vectors and event semantic vectors. Simultaneous reference information processing unit 120 that stores the frequency at which two different texts corresponding to two different semantic vectors occur simultaneously in the same document by utilizing a plurality of simultaneous appearance relations for,

분석의 대상이 되는 입력 문서를 사용자로부터 입력받고, 입력 문서에서 기초 담화 단위 및 사건 정보를 추출하며, 입력 문서 내에 존재하는 텍스트에 대한 동시 발생 정보를 동시 언급 정보 처리부(120)로부터 불러오며 이를 통해 입력 문서 내에 존재하는 텍스트에 대해, 입력 문서 내에 존재하지 않는 텍스트 중 최대 동시 언급 확률을 갖는 텍스트 집합을 생성하며, 입력 문서 내에 존재하지 않는 텍스트 중 최대 동시 언급 확률을 갖는 텍스트 집합 내에 존재하는 기초 담화 단위 및 사건을 모두 문장 형태로 변환하여 상기 최대 동시 언급 확률을 갖는 텍스트 집합에 포함된 모든 항목이 문장 형태를 갖도록 변환하며, 상기 입력 문서 내에 존재하지 않는 텍스트 중 최대 동시 언급 확률을 갖는 텍스트 집합을 입력 문서 내에 기존에 존재하던 문장들에 추가하여 병합할 때 최대화된 문서 일관성을 갖는 문서를 생성하기 위해, 상대적으로 추가 문장들이 가져야 하는 위치를 계산하며, 사용자로부터 입력받은 문서 내에 포함되어 있던 문장과, 포함되어 있지 않던 새로 추가된 문장을 볼드체, 컬러 코딩 등의 방식을 통해 구분하여 병합한 문서를 사용자에게 출력 제공하며, 새로 추가된 문장 각각에 대해 원문 출처를 사용자에게 출력 제공하며, 각각의 원문 출처에 대해 클릭 방식의 상호작용을 통해 사용자가 새로 추가된 정보의 원문에 해당하는 문서에 접근할 수 있도록 하는 사용자 입력 문서 처리부(130)를 포함하며, 상기 사용자 입력 문서 처리부(130)는 사용자로부터 입력받은 입력 문서 내에 언급되지 않은 정보 중 입력 문서 내에 언급된 정보들과 함께 언급될 확률이 높은 정보를 제공하여 문서의 저자로 인해 의도적으로 숨겨진 정보 혹은 저자의 편향성으로 인해 의도치 않게 생략된 정보를 독자가 제공받을 수 있도록 하여 저자의 편향된 견해에 따라 독자의 견해가 영향을 받지 않도록 보조할 수 있다.The input document to be analyzed is input from the user, the basic discourse unit and event information are extracted from the input document, and the simultaneous occurrence information for the text existing in the input document is retrieved from the simultaneous reference information processing unit 120. For text existing in the input document, a text set having a maximum simultaneous mention probability among texts not present in the input document is generated, and a basic discourse existing in a text set having a maximum simultaneous reference probability among texts not present in the input document Converts both units and events into sentence form, converts all items included in the text set having the maximum simultaneous mention probability to have a sentence form, and sets a text set having the maximum simultaneous reference probability among texts not present in the input document. In order to generate a document with maximized document consistency when merging in addition to existing sentences in the input document, the position that additional sentences should have is calculated, and the sentence included in the document received from the user , Divide the newly added sentences that are not included through bold, color coding, etc., and output the merged document to the user, and provide the original text source to the user for each newly added sentence, and each original text The user input document processing unit 130 includes a user input document processing unit 130 that allows a user to access a document corresponding to the original text of the newly added information through a click-type interaction with respect to the source. By providing information that is not likely to be mentioned along with information mentioned in the input document among information not mentioned in the input document received, information that was intentionally hidden due to the author of the document or information that was unintentionally omitted due to the bias of the author By making it available to readers, it is possible to assist readers not to be affected by their biased views.

바람직하게, 상기 특정 해쉬맵에 대한 복수 동시 등장 관계는 하기 제1 항목 내지 제 6 항목 중 하나 이상의 항목을 포함한다.Preferably, the plurality of concurrent appearance relations for the specific hashmap includes one or more of the following first to sixth items.

벡터 쌍에 있어서, 벡터 쌍은 두 개의 다른 문장 의미 벡터로 구성되며, 상기 두 개의 다른 문장 의미 벡터는, 입력 중에 문장 혹은 기초 담화 단위 혹은 사건을 포함하고 출력 중에 의미 벡터를 포함하는 특정 해쉬맵에 있어, 한 문서 내에 존재하는 다른 두 문장에 있어, 한 문장을 포함하는 입력에 대한 출력에 포함되는 의미 벡터와, 다른 한 문장을 포함하는 입력에 대한 출력에 포함되는 벡터 쌍이 존재하는 경우 벡터 쌍 내 두 벡터의 관계에 해당하는 제1 항목 For a vector pair, the vector pair consists of two different sentence semantic vectors, and the two different sentence semantic vectors include a sentence or a basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. Yes, in the other two sentences existing in one document, if there is a semantic vector included in the output for the input containing one sentence and a vector pair included in the output for the input containing the other sentence, within the vector pair First item corresponding to the relationship of two vectors

벡터 쌍에 있어서, 벡터 쌍은 두 개의 다른 담화 의미 벡터로 구성되며, 상기 두 개의 다른 담화 의미 벡터는, 입력 중에 문장 혹은 기초 담화 단위 혹은 사건을 포함하고 출력 중에 의미 벡터를 포함하는 특정 해쉬맵에 있어, 한 문서 내에 존재하는 다른 두 기초 담화 단위에 있어, 한 기초 담화 단위를 포함하는 입력에 대한 출력에 포함되는 의미 벡터와, 다른 한 기초 담화 단위를 포함하는 입력에 대한 출력에 포함되는 벡터 쌍이 존재하는 경우 벡터 쌍 내 두 벡터의 관계에 해당하는 제2 항목 For a vector pair, the vector pair is composed of two different discourse semantic vectors, and the two different discourse semantic vectors include a sentence or a basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. There are two pairs of semantic vectors included in the output for the input containing one basic discourse unit and the vector pairs included in the output for the input containing the other basic discourse unit for two other basic discourse units present in one document. If present, the second item corresponding to the relationship of the two vectors in the vector pair

벡터 쌍에 있어서, 벡터 쌍은 두 개의 다른 사건 의미 벡터로 구성되며, 상기 두 개의 다른 사건 의미 벡터는, 입력 중에 문장 혹은 기초 담화 단위 혹은 사건을 포함하고 출력 중에 의미 벡터를 포함하는 특정 해쉬맵에 있어, 한 문서 내에 존재하는 다른 두 사건에 있어, 한 사건을 포함하는 입력에 대한 출력에 포함되는 의미 벡터와, 다른 한 사건을 포함하는 입력에 대한 출력에 포함되는 벡터 쌍이 존재하는 경우 벡터 쌍 내 두 벡터의 관계에 해당하는 제3 항목 For a vector pair, the vector pair consists of two different event semantic vectors, and the two different event semantic vectors include a sentence or basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. Yes, for two other events in one document, if there is a semantic vector included in the output for the input containing one event and a vector pair included in the output for the input containing the other event, within the vector pair The third item corresponding to the relationship of the two vectors

벡터 쌍에 있어서, 벡터 쌍은 하나의 문장 의미 벡터와 하나의 담화 의미 벡터로 구성되며, 상기 두 개의 의미 벡터는, 입력 중에 문장 혹은 기초 담화 단위 혹은 사건을 포함하고 출력 중에 의미 벡터를 포함하는 특정 해쉬맵에 있어, 한 문서 내에 존재하는 하나의 문장과 하나의 기초 담화 단위에 있어, 해당하는 문장을 포함하는 입력에 대한 출력에 포함되는 의미 벡터와, 해당하는 기초 담화 단위를 포함하는 입력에 대한 출력에 포함되는 벡터 쌍이 존재하는 경우 벡터 쌍 내 두 벡터의 관계에 해당하는 제4 항목 For a vector pair, the vector pair consists of a sentence semantic vector and a discourse semantic vector, and the two semantic vectors include a sentence or basic discourse unit or event during input and a semantic vector during output. In the hash map, in a sentence and one basic discourse unit existing in a document, a semantic vector included in an output for an input including a corresponding sentence and an input for an input including a corresponding basic discourse unit If there are vector pairs in the output, the fourth item corresponding to the relationship of the two vectors in the vector pair

벡터 쌍에 있어서, 벡터 쌍은 하나의 문장 의미 벡터와 하나의 사건 의미 벡터로 구성되며, 상기 두 개의 의미 벡터는, 입력 중에 문장 혹은 기초 담화 단위 혹은 사건을 포함하고 출력 중에 의미 벡터를 포함하는 특정 해쉬맵에 있어, 한 문서 내에 존재하는 하나의 문장과 하나의 사건에 있어, 해당하는 문장을 포함하는 입력에 대한 출력에 포함되는 의미 벡터와, 해당하는 사건을 포함하는 입력에 대한 출력에 포함되는 벡터 쌍이 존재하는 경우 벡터 쌍 내 두 벡터의 관계에 해당하는 제5 항목 For a vector pair, the vector pair consists of a sentence semantic vector and an event semantic vector, and the two semantic vectors include a sentence or basic discourse unit or event during input and a semantic vector during output. In the hash map, in one sentence and one event existing in a document, the semantic vector included in the output for the input including the corresponding sentence, and the output included in the output for the input including the corresponding event Fifth item corresponding to the relationship of two vectors in a vector pair if a vector pair exists

벡터 쌍에 있어서, 벡터 쌍은 하나의 담화 의미 벡터와 하나의 사건 의미 벡터로 구성되며, 상기 두 개의 의미 벡터는, 입력 중에 문장 혹은 기초 담화 단위 혹은 사건을 포함하고 출력 중에 의미 벡터를 포함하는 특정 해쉬맵에 있어, 한 문서 내에 존재하는 하나의 기초 담화 단위와 하나의 사건에 있어, 해당하는 기초 담화 단위를 포함하는 입력에 대한 출력에 포함되는 의미 벡터와, 해당하는 사건을 포함하는 입력에 대한 출력에 포함되는 벡터 쌍이 존재하는 경우 벡터 쌍 내 두 벡터의 관계에 해당하는 제6 항목 For a vector pair, the vector pair consists of a discourse semantic vector and an event semantic vector, and the two semantic vectors include a sentence or a basic discourse unit or event during input and a semantic vector during output. In the hashmap, for one basic discourse unit and one event existing in a document, the semantic vector included in the output for the input including the corresponding basic discourse unit, and for the input including the corresponding event The sixth item corresponding to the relationship of two vectors in a vector pair when there are vector pairs included in the output.

본 발명을 통해 사용자는 특정 문서에 대해 문서의 저자로 인해 의도적으로 숨겨졌을 가능성이 있는 정보 혹은 저자의 편향성으로 인해 의도치 않게 생략되었을 가능성이 있는 정보를 자동으로 제공받을 수 있으며, 인간이 자연스럽게 인지 가능한, "해당하는 문서에 특정 정보가 누락되어 있는 것 같다"는 판단에 필요한, 문서에 대한 자세한 독해 과정 및 관련 정보 조사에 필요한 인지적 부담을 완화시킬 수 있으며, 이를 통해 자동화된 방식으로 저자의 편향된 견해에 따라 독자의 견해가 영향을 받지 않도록 보조할 수 있다.Through the present invention, the user can be automatically provided with information that may be intentionally hidden due to the author's bias or information that may have been inadvertently omitted due to the author's bias for a specific document, and humans naturally recognize Where possible, it is possible to alleviate the cognitive burden required for detailed reading process of documents and investigation of related information, which is necessary for the determination that "the relevant document is missing certain information", and in an automated manner, You can help your readers' views not be affected by biased views.

도 1은 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 시스템의 구성도이다.
도 2는 도 1에 도시된 동시 언급 정보 처리부의 일 실시 예 상세 구성도이다.
도 3은 도 1에 도시된 사용자 입력 문서 처리부의 일 실시 예 상세 구성도이다.
도 4는 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 시스템의 입력 및 출력 결과의 예시를 도시한 도면이다.
도 5는 언급되지 않은 정보 가시화 시스템에 포함되는 문헌 말뭉치의 예시를 도시한 도면이다.
도 6은 언급되지 않은 정보 가시화 시스템에 포함되는 동시 발생 정보 데이터베이스의 예시를 도시한 도면이다.
도 7은 언급되지 않은 정보 가시화 시스템에 포함되는 문장 변환 데이터베이스의 예시를 도시한 도면이다.
도 8은 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 방법 중 동시 언급 정보 처리부에 의한 방법을 구체적으로 도시한 흐름도이다.
도 9는 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 방법 중 사용자 입력 문서 처리부에 의한 방법을 구체적으로 도시한 흐름도이다.1 is a configuration diagram of an information visualization system not mentioned according to an embodiment of the present invention.
FIG. 2 is a detailed configuration diagram of an embodiment of the simultaneous reference information processor illustrated in FIG. 1.
3 is a detailed configuration diagram of an embodiment of a user input document processing unit illustrated in FIG. 1.
4 is a diagram illustrating an example of input and output results of an information visualization system not mentioned according to an embodiment of the present invention.
5 is a diagram illustrating an example of a document corpus included in an information visualization system that is not mentioned.
6 is a diagram illustrating an example of a concurrent information database included in an information visualization system that is not mentioned.
7 is a diagram illustrating an example of a sentence conversion database included in an information visualization system that is not mentioned.
8 is a flowchart specifically illustrating a method by a simultaneous reference information processing unit among methods for visualizing information that is not mentioned according to an embodiment of the present invention.
9 is a flowchart specifically illustrating a method by a user input document processing unit among methods for visualizing information not mentioned according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형 태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.Advantages and features of the present invention, and methods for achieving them will be clarified with reference to embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and common knowledge in the art to which the present invention pertains. It is provided to completely inform the person having the scope of the invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상 의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for describing the embodiments, and is not intended to limit the present invention. In the present specification, the singular form also includes the plural form unless otherwise specified in the phrase. As used herein, "comprises" and / or "comprising" refers to the components, steps, operations, and / or elements mentioned above of one or more other components, steps, operations, and / or elements. Presence or addition is not excluded.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사 전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used as meanings commonly understood by those skilled in the art to which the present invention pertains. In addition, terms that are commonly defined in the dictionary are not ideally or excessively interpreted unless specifically defined.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예들을 보다 상세하게 설명하고자 한다. 도면 상의 동일한 구성요소에 대해서는 동일한 참조 부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the accompanying drawings. The same reference numerals are used for the same components in the drawings, and duplicate descriptions for the same components are omitted.

도 1은 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 시스템의 구성도이다.1 is a configuration diagram of an information visualization system not mentioned according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 언급되지 않은 정보 가시화 시스템(100)은 동시 언급 기준 코퍼스(110), 동시 언급 정보 처리부(120), 사용자 입력 문서 처리부(130)를 포함한다.As shown in FIG. 1, the unmentioned information visualization system 100 according to an embodiment of the present invention includes a simultaneous reference reference corpus 110, a simultaneous reference information processing unit 120, and a user input document processing unit 130. Includes.

동시 언급 기준 코퍼스(110)는 동일한 문서 내에 함께 등장하는 문장들의 상호 연관성을 분석함에 있어 그 분석 기반이 되는 문서 집합을 저장한다. 도 5는 언급되지 않은 정보 가시화 시스템에 포함되는 동시 언급 기준 코퍼스의 예시를 도시한 도면이다. 도 5에 도시된 바와 같이, 동시 언급 기준 코퍼스는 복수 문서 집합을 저장하며, 각 문서에 대해 (1) 문서 전체에 해당하는 텍스트 문자열, (2) 출간 일자, 원문 출처, 저자에 해당하는 문서 정보, (3) 문서 내에 포함된 각 문장이 문서 내에서 등장한 순서대로 나열된 리스트를 저장한다.The simultaneous reference criteria corpus 110 stores a set of documents that are the basis of the analysis in analyzing the correlation between sentences appearing together in the same document. FIG. 5 is a diagram illustrating an example of a simultaneous reference reference corpus included in an information visualization system not mentioned. As shown in FIG. 5, the concurrent reference criterion stores a plurality of document sets, and for each document (1) text string corresponding to the entire document, (2) publication date, original text source, document information corresponding to the author , (3) Save the list of each sentence included in the document in the order in which they appeared in the document.

동시 언급 정보 처리부(120)는 동시 언급 기준 코퍼스(110)내 각 문서에 있어 문서 내에 등장하는 문장 혹은 기초 담화 단위(elementary discourse unit) 혹은 사건(이하 단위 텍스트)이 동시에 등장하는 빈도에 대한 계산을 수행하며, 이를 통해 특정한 단위 텍스트가 문서 내에 등장하였을 때, 동시에 등장할 확률이 가장 높은 다른 단위 텍스트가 어떤 것인지에 대한 확률을 계산하며, 계산 결과를 동시 발생 정보 데이터베이스(125)에 저장한다.The simultaneous mention information processing unit 120 calculates the frequency of occurrence of sentences or elemental discourse units or events (hereinafter referred to as unit text) that appear in a document in each document in the concurrent reference reference corpus 110 at the same time. Through this, when a specific unit text appears in a document, the probability of which other unit text is most likely to appear simultaneously is calculated, and the calculation result is stored in the concurrent occurrence information database 125.

사용자 입력 문서 처리부(130)는 사용자로부터 문서를 입력받으며, 입력받은 문서(이하 입력 문서)내에 포함된 문장들과 동시에 언급될 가능성이 가장 높은 단위 텍스트 중, 실제로 입력 문서에 언급되지 않은 단위 텍스트를 파악하며, 상기 언급되지 않은 단위 텍스트 각각에 있어 입력 문서의 어떤 연속된 문장 쌍 사이에 추가되는 것이 문서의 일관성을 최대화할 지 판단하며, 일관성을 최대화하는 문장 순서에 따라 상기 언급되지 않은 단위 텍스트를 입력 문서에 추가하며, 추가된 문장과 기존에 입력 문서 내에 포함되었던 문장을 볼드체, 컬러 코딩 등의 방식을 통해 구별하여 사용자에게 출력하며, 추가된 각 문장에 대해 원본 출처를 사용자에게 제공하여 사용자가 클릭 방식의 상호작용을 통해 원본 문서에 접근할 수 있도록 한다.The user input document processing unit 130 receives a document from a user, and among the unit texts most likely to be referred to simultaneously with sentences included in the input document (hereinafter, an input document), the unit text not actually mentioned in the input document To determine which sequential pair of sentences in the input document for each of the above-noted unit texts will maximize the consistency of the document, and the unit texts not mentioned in the order of the sentences to maximize consistency. It adds to the input document, distinguishes the added sentence and the sentence previously included in the input document through bold, color coding, etc., and outputs it to the user. For each added sentence, the original source is provided to the user so that the user can Provides access to the original document through click-through interaction.

도 4는 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 시스템의 입력 및 출력 결과의 예시를 도시한 도면이다. 도 4에 도시된 바와 같이, 본 발명의 일 실시예에 따른 언급되지 않은 정보 가시화 시스템(100)은 사용자로부터 입력받은 입력 문서 내에 언급되지 않은 정보 중 입력 문서 내에 언급된 정보들과 함께 언급될 확률이 높은 정보를 문장 형태로 추가하며, 추가된 문장들과 기존 입력 문서에 포함되었던 문장들을 병합한 문서를 사용자에게 출력 제공한다. 언급되지 않은 정보에 해당하는 추가 문장 각각은 특정한 연속되는 문장 쌍의 사이에 추가되며, 추가의 대상이 되는 문장 쌍은, 각각의 추가 문장이 추가되었을 때에 추가 이후의 문서 전반이 보일 일관성을 최대화 하도록 결정되며, 사용자는 원문 외부 링크를 통해 추가 정보에 해당하는 문장이 포함된 원문에 접근할 수 있다.4 is a diagram illustrating an example of input and output results of an information visualization system not mentioned according to an embodiment of the present invention. As shown in FIG. 4, the unmentioned information visualization system 100 according to an embodiment of the present invention is a probability that the information mentioned in the input document among the information not mentioned in the input document received from the user is mentioned This high information is added in the form of a sentence, and a document combining the added sentences and sentences included in the existing input document is output to the user. Each additional sentence corresponding to the information not mentioned is added between a specific pair of consecutive sentences, and the target sentence pair is added to maximize the consistency of the entire document after addition when each additional sentence is added. It is determined, and the user can access the original text including the sentence corresponding to the additional information through an external link to the original text.

동시 언급 정보 처리부(120)를 상세히 설명하면 다음과 같다.The detailed description of the simultaneous reference information processing unit 120 is as follows.

동시 언급 정보 처리부(120)는 도 2에 도시된 바와 같이, 언급된 구문 추출부(121), 언급된 사건 추출부(122), 의미 벡터 변환부(123), 동시 언급 빈도 계산부(124), 동시 발생 정보 데이터베이스(125)로 구성된다. As illustrated in FIG. 2, the simultaneous reference information processing unit 120 includes a reference syntax extraction unit 121, a reference event extraction unit 122, a semantic vector conversion unit 123, and a simultaneous reference frequency calculation unit 124. , Concurrent information database 125.

언급된 구문 추출부(121)는 동시 언급 기준 코퍼스(110)의 각 문서에 대해 담화 구조 분석(discourse parsing)을 수행하고, 기초 담화 단위(elementary discourse unit)를 추출한다. 동시에, 언급된 구문 추출부(121)는 추출되는 모든 기초 담화 단위와 해당하는 기초 담화 단위의 추출 출처 문장을 문장 변환 데이터베이스(137)에 저장을 시도한다. 저장을 시도하는 기초 담화 단위가 이미 문장 변환 데이터베이스(137)에 존재하는 경우에는, 해당하는 기초 담화 단위에 대해 기존에 저장된 문장보다 새로 저장을 시도하는 문장의 길이가 짧을 경우에는 새로 저장을 시도하는 문장으로 해당하는 데이터베이스 항목을 업데이트하며, 그렇지 않을 경우에는 데이터베이스 항목을 업데이트 하지 않는다. 저장을 시도하는 기초 담화 단위가 문장 변환 데이터베이스(137)에 존재하지 않는 경우, 해당하는 기초 담화 단위에 상응하는 색인을 문장 변환 데이터베이스(137)에 추가하고, 추출 출처 문장을 해당하는 색인에 대한 값으로 저장한다.The mentioned phrase extraction unit 121 performs discourse parsing on each document of the simultaneous reference reference corpus 110 and extracts an elementary discourse unit. At the same time, the above-mentioned syntax extracting unit 121 attempts to store the extracted source sentences of all the extracted basic discourse units and corresponding basic discourse units in the sentence conversion database 137. When the basic discourse unit that attempts to store already exists in the sentence conversion database 137, when the length of the sentence that attempts to save newly is shorter than that of the previously stored sentence for the corresponding basic discourse unit, a new save is attempted. The corresponding database item is updated with a sentence, otherwise the database item is not updated. If the basic discourse unit to be stored does not exist in the sentence conversion database 137, an index corresponding to the corresponding basic discourse unit is added to the sentence conversion database 137, and the extracted source sentence is a value for the corresponding index. To save.

언급된 사건 추출부(122)는 동시 언급 기준 코퍼스(110)의 각 문서에 대해 의미역 분석(semantic role labeling)을 수행하고, 이에 기반하여 정의되는 사건(event) 정보를 추출한다. 바람직하게, 하나의 사건은 텍스트에서 드러난 하나의 술어(predicate)와 해당하는 술어에 대해 분석된 의미역(semantic role) 및 각각의 의미역에 해당하는 단어들의 집합으로 정의한다. 동시에, 언급된 사건 추출부(122)는, 기초 담화 단위를 문장 변환 데이터베이스(137)에 저장한 것과 같은 방식으로, 추출된 모든 사건과 해당하는 사건의 추출 출처 문장을 문장 변환 데이터베이스(137)에 저장한다.The mentioned event extraction unit 122 performs semantic role labeling for each document of the concurrent reference reference corpus 110, and extracts event information defined based on the semantic role labeling. Preferably, an event is defined as a predicate revealed in the text, a semantic role analyzed for the corresponding predicate, and a set of words corresponding to each semantic domain. At the same time, the mentioned event extraction unit 122, in the same way as the basic discourse unit is stored in the sentence conversion database 137, extracts the extracted source sentences of all the events and corresponding events in the sentence conversion database 137. To save.

의미 벡터 변환부(123)는 동시 언급 기준 코퍼스(110)의 각 문서에 내에 포함된 각 문장을 문장 의미 벡터로 변환하며, 언급된 구문 추출부(121)로부터 추출된 기초 담화 단위를 전달받고, 언급된 사건 추출부(122)로부터 추출된 사건 정보를 전달받고, 전달받은 기초 담화 단위와 사건 정보를 각기 담화 의미 벡터 및 사건 의미 벡터로 변환한다. 바람직하게, 문장 의미 벡터 변환과 담화 의미 벡터의 경우, Le and Mikolov (2014) 에 의해 제안된 기법과 같은 분산 표상 기법을 활용하여 진행하며, 사건 의미 벡터의 경우, Mikolov et al. (2014) 에 의해 제안된 기법과 같은 분산 표상 기법을 활용하며, 술어에 해당하는 단어에 상응하는 단어 의미 벡터와 각각의 의미역(semantic role)에 해당하는 단어에 상응하는 단어 의미 벡터를 순서대로 연결(concatenate)한 벡터로 정의하며, 의미역에 해당하는 단어가 없을 때에는 모든 컴포넌트(component)의 값이 0인 벡터를 단어 의미 벡터로 간주하는 방법을 통해 진행한다.The semantic vector converting unit 123 converts each sentence included in each document of the simultaneous reference reference corpus 110 into a sentence semantic vector, and receives the basic discourse unit extracted from the mentioned phrase extracting unit 121, The event information extracted from the mentioned event extraction unit 122 is received, and the received basic discourse unit and event information are respectively converted into a discourse semantic vector and an event semantic vector. Preferably, in the case of sentence semantic vector conversion and discourse semantic vector, it proceeds using a distributed representation technique such as the technique proposed by Le and Mikolov (2014), and in the case of event semantic vector, Mikolov et al. It utilizes the distributed representation technique such as the technique proposed by (2014), and in order the word semantic vector corresponding to the word corresponding to the predicate and the word semantic vector corresponding to the word corresponding to each semantic role. It is defined as a concatenated vector, and when there is no word corresponding to the semantic domain, it proceeds through a method that considers a vector having all component values as 0 as a word semantic vector.

또한, 의미 벡터 변환부(123)는 상기 서술한 바와 같은 과정에 따라 변환된 문장 의미 벡터, 구문 의미 벡터, 사건 의미 벡터 각각에 대해 문장 의미 벡터 공간, 구문 의미 벡터 공간, 사건 의미 벡터 공간을 생성한다. 생성된 문장 의미 벡터 공간 위에, 동시 언급 기준 코퍼스(110) 내의 모든 문서로부터 추출하여 변환한 모든 문장 의미 벡터를 업로드한 다음, 해당하는 벡터 공간을 M개의 공간으로 균등 분할한다. 여기서, M은 시스템 초기값으로 정의되는 자연수이다. 마찬가지로, 생성된 구문 의미 벡터 공간 위에, 모든 구문 의미 벡터를 업로드하고, 해당하는 벡터 공간을 M개의 공간으로 균등분할하며, 생성된 사건 의미 벡터 공간 위에 모든 사건 의미 벡터를 업로드 하고, 해당하는 벡터 공간을 M개의 공간으로 균등분할한다. 바람직하게, 벡터 공간의 균등 분할은 k-평균 클러스터링(k-means clustering) 방법을 활용하여 진행한다.In addition, the semantic vector converting unit 123 generates a sentence semantic vector space, a syntax semantic vector space, and an event semantic vector space for each sentence semantic vector, syntax semantic vector, and event semantic vector converted according to the above-described process. do. On the generated sentence semantic vector space, all sentence semantic vectors extracted and converted from all the documents in the concurrent reference reference corpus 110 are uploaded, and the corresponding vector space is equally divided into M spaces. Here, M is a natural number defined as the initial value of the system. Likewise, on the generated syntax semantic vector space, all syntax semantic vectors are uploaded, the corresponding vector space is equally divided into M spaces, all event semantic vectors are uploaded on the generated event semantic vector space, and the corresponding vector space Is divided equally into M spaces. Preferably, equal division of the vector space is performed using a k-means clustering method.

또한, 의미 벡터 변환부(123)는 상기 서술한 세 개의 의미 공간에 있어, 균등 분할된 M개 공간(이하 분할 공간) 각각의 내부에 존재하는 의미 벡터가 추출된 출처에 해당하는 상기 단위 텍스트를 입력으로 받아, 해당하는 의미 공간에 업로드된 의미 벡터 중 상기 분할 공간의 중심에 가장 가까운 의미 벡터(이하 대표 벡터)와 상기 대표 벡터가 추출된 출처에 해당하는 상기 단위 텍스트(이하 대표 단위 텍스트)를 출력하는 해쉬맵(hash map, 이하 대표 벡터 및 단위 텍스트 변환 해쉬맵)을 생성한다. 바람직하게, 상기 분할 공간의 중심에 가장 가까운 의미 벡터는, k-평균 클러스터링 방법에 따른 계산 과정 중 도출되는, 상기 분할 공간 각각에 대해 정의되는 "가장 가까운 평균(the nearest mean)"과 가장 가까운 의미 벡터로서 정의된다. 바람직하게, 해당하는 의미 공간에 업로드된 의미 벡터 중 "가장 가까운 평균"과 가장 가까운 의미 벡터의 탐색은 Greedy search in proximity neighborhood graphs 방법이나 Locally sensitive hashing 방법을 활용하여 빠른 시간 내에 계산이 완료될 수 있도록 한다.In addition, the semantic vector converting unit 123 includes the unit text corresponding to the source from which the semantic vector existing in each of the M spaces (hereinafter divided spaces) evenly divided in the three semantic spaces described above is extracted. Of the semantic vectors received as input, the semantic vector (hereinafter referred to as a representative vector) closest to the center of the divided space among the semantic vectors uploaded to the corresponding semantic space and the unit text (hereinafter referred to as representative unit text) corresponding to the source from which the representative vector was extracted. A hash map (hereinafter, a representative vector and unit text conversion hash map) is generated to be output. Preferably, the semantic vector closest to the center of the divided space means the closest mean to the "the nearest mean" defined for each of the divided spaces, which is derived during the calculation process according to the k-means clustering method. It is defined as a vector. Preferably, the search for the “closest mean” and the closest semantic vector among the semantic vectors uploaded to the corresponding semantic space can be completed in a short time using the Greedy search in proximity neighborhood graphs method or Locally sensitive hashing method. do.

동시 언급 빈도 계산부(124)는 의미 벡터 변환부(123)로부터 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵을 전달받으며, 의미 벡터 변환부(123)로부터 동시 언급 기준 코퍼스(110)의 각 문서 내에 존재하는 모든 문장 각각에 상응하는 문장 의미 벡터, 모든 기초 담화 단위에 상응하는 담화 의미 벡터, 모든 사건에 상응하는 사건 의미 벡터를 전달받으며, 두 개의 다른 상기 대표 벡터를 통해 만들 수 있는 모든 쌍에 대해, 해당하는 쌍의 절대 동시 발생 빈도를 정의하고, 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵에 대한 복수 동시 등장 관계에 포함되는 관계를 갖는 상기 대표 벡터 쌍을 파악한다. 바람직하게, 동시 언급 빈도 계산부(124)는 동시 언급 기준 코퍼스(110)내의 모든 문서에 대해 순회하며, 모든 쌍에 있어 상기 절대 동시 발생 빈도를 0으로 설정한 다음, 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵에 대한 복수 동시 등장 관계를 갖는 상기 대표 벡터쌍을 찾을 때마다 해당하는 벡터쌍에 대한 상기 절대 동시 발생 빈도를 1씩 증가시킨다. 바람직하게, 동시 언급 빈도 계산부(124)는 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵에 대한 복수 동시 등장 관계의 각 항목에 대해 M×M 차원의 행렬을 정의하고, 각 행과 열의 색인을 상기 대표 벡터(및 이에 상응하는 상기 대표 단위 텍스트)의 색인으로 하며, 각 벡터쌍의 상기 절대 동시 발생 빈도를 해당하는 쌍 내에 포함되는 두 대표 벡터의 색인에 해당하는 행과 열이 만나는 위치의 컴포넌트로 하며, 순회가 끝난 다음 각 행렬에 있어, 열을 기준으로 정상화(normalize)한다. 다르게 말해, 각 행렬의 각 열에 있어, 한 열에 해당하는 컴포넌트의 합이 1이 되도록 해당하는 열 내의 컴포넌트들을 해당하는 열 내에 존재하는 컴포넌트들의 합으로 나눈다. 단, 해당하는 열 내에 존재하는 컴포넌트들의 합이 0인 경우에는 해당하는 열 내의 컴포넌트들을 모두 0으로 한다. 여기서, M은 상술된 바와 같이 의미 벡터 공간 내를 균등분할함에 있어 분할되는 공간의 개수로서 시스템 초기값으로 정의된 M과 같다.The simultaneous reference frequency calculation unit 124 receives the representative vector and the unit text conversion hashmap from the semantic vector conversion unit 123, and is present in each document of the simultaneous reference reference corpus 110 from the semantic vector conversion unit 123. A sentence semantic vector corresponding to each sentence, a discourse semantic vector corresponding to all basic discourse units, and an event semantic vector corresponding to all events, for all pairs that can be made through two different representative vectors, The frequency of absolute simultaneous occurrence of the corresponding pair is defined, and the representative vector pair having a relationship included in a plurality of simultaneous appearance relationships for the representative vector and the unit text conversion hash map is identified. Preferably, the simultaneous reference frequency calculator 124 traverses all documents in the simultaneous reference reference corpus 110, sets the absolute simultaneous frequency to 0 in all pairs, and then converts the representative vector and unit text Whenever the representative vector pair having a plurality of concurrent appearance relations for a hashmap is found, the frequency of the absolute simultaneous occurrence of the corresponding vector pair is increased by one. Preferably, the simultaneous reference frequency calculating unit 124 defines a matrix of M × M dimensions for each item of a plurality of concurrent appearance relations for the representative vector and unit text conversion hash map, and the index of each row and column is the representative The index of the vector (and the corresponding representative unit text), and the absolute co-occurrence frequency of each vector pair is a component at a position where the row and column corresponding to the index of the two representative vectors included in the corresponding pair meet. , Normalizes each column after the traversal is over. In other words, in each column of each matrix, the components in the corresponding column are divided by the sum of the components in the corresponding column so that the sum of the components corresponding to one column is 1. However, when the sum of components present in the corresponding column is 0, all components in the corresponding column are set to 0. Here, M is the number of spaces to be divided in equal division within the semantic vector space as described above, and is equal to M defined as the system initial value.

또한, 동시 언급 빈도 계산부(124)는 상술된 방법을 통해 계산된 행렬 각각을, 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵에 대한 복수 동시 등장 관계 중 대상이 되는 행렬과 상응하는 항목을 대분류로 하고, 해당하는 행렬의 각 열에 있어 해당하는 열의 색인에 해당하는 상기 대표 단위 텍스트를 중분류로 하고, 해당하는 행렬의 각 행에 있어 해당하는 행의 색인에 해당하는 상기 대표 단위 텍스트를 소분류로 하고, 해당하는 열과 행이 만나는 위치의 컴포넌트를 상대 발생 빈도로 하여, 동시 발생 정보 데이터베이스(125)에 저장한다 (데이터베이스 분류에 대한 자세한 설명은 다음 단락 참조).In addition, the simultaneous reference frequency calculating unit 124 sets each of the matrices calculated through the above-described method as a major classification of the target vector and the corresponding matrix among the plurality of simultaneous appearance relations for the unit text conversion hashmap. , In each column of the corresponding matrix, the representative unit text corresponding to the index of the corresponding column is regarded as a middle classification, and in each row of the corresponding matrix, the representative unit text corresponding to the index of the corresponding row is sub-classified, and corresponding The component at the position where the column and the row to meet are relative occurrence frequencies and are stored in the concurrent occurrence information database 125 (for details on the database classification, see the next paragraph).

동시 발생 정보 데이터베이스(125)는 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵에 대한 복수 동시 등장 관계에 포함 가능한 각각의 항목을 만족하는, 동시 언급 기준 코퍼스(110)에서 발견 가능한 모든 상기 단위 텍스트 쌍 및 이에 상응하는 상기 상대 발생 빈도를 저장한다. 도 6은 언급되지 않은 정보 가시화 시스템에 포함되는 동시 발생 정보 데이터베이스의 예시를 도시한 도면이다. 도 6에 도시된 바와 같이, 동시 발생 정보 데이터베이스는 문장을 기준으로 한 동시 발생 정보, 구문을 기준으로 하는 동시 발생 정보, 사건을 기준으로 하는 동시 발생 정보를 대분류로 하여 저장하며, 해당하는 대분류는, 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵에 대한 복수 동시 등장 관계에 포함 가능한 각각의 항목가 정의하는 관계에 어떤 종류의 상기 단위 텍스트가 포함되었는 지에 따라 매핑 가능하며, 상술된 방법을 통해 계산된 행렬 각각에 포함된 각각의 컴포넌트는 동시 발생 정보 데이터베이스에 두번씩 저장되며, 이는 데이터베이스에서부터 특정 값을 불러올 때의 빠른 처리 속도를 위함이다. 또한, 동시 발생 정보 데이터베이스는 "From X To Y"형태의 분류를 갖는데, X는 상술한 중분류가 되며, Y는 상술한 소분류로 해석될 수 있으며, X와 Y 각각은 상기 단위 텍스트의 종류에 해당하는 문장, 담화, 사건 중 하나가 될 수 있다.The coincidence information database 125 satisfies each item that can be included in a plurality of concurrent appearance relations for the representative vector and unit text conversion hashmap, and all the unit text pairs found in the concurrent reference criterion corpus 110 and the same The corresponding frequency of relative occurrence is stored. 6 is a diagram illustrating an example of a concurrent information database included in an information visualization system that is not mentioned. As shown in FIG. 6, the concurrent occurrence information database stores the concurrent occurrence information based on the sentence, the simultaneous occurrence information based on the syntax, and the simultaneous occurrence information based on the event as a large classification, and the corresponding large classification is , Can be mapped according to what kind of the unit text is included in a relationship defined by each item that can be included in a plurality of concurrent appearance relations for the representative vector and unit text conversion hashmap, and each matrix calculated through the above-described method Each component included in is stored twice in the coincidence information database, which is for fast processing speed when loading a specific value from the database. In addition, the coincidence information database has a classification of “From X To Y”, where X becomes the above-described middle classification, and Y can be interpreted as the above-described sub-classification, and each of X and Y corresponds to the type of the unit text. It can be one of a sentence, a discourse, or an incident.

사용자 입력 문서 처리부(130)을 상세히 설명하면 다음과 같다.The user input document processing unit 130 will be described in detail as follows.

사용자 입력 문서 처리부(130)은 도 3에 도시된 바와 같이, 입력부(131), 전처리부(132), 동시 발생 정보 처리부(133), 문장 변환부(134), 문장 추가 위치 계산부(135), 출력부(136), 문장 변환 데이터베이스(137)로 구성된다. As shown in FIG. 3, the user input document processing unit 130 includes an input unit 131, a pre-processing unit 132, a co-occurrence information processing unit 133, a sentence conversion unit 134, and a sentence addition location calculation unit 135. , An output unit 136, and a sentence conversion database 137.

입력부(131)는 분석의 대상이 되는 입력 문서를 사용자로부터 입력받는다.The input unit 131 receives an input document to be analyzed, from a user.

전처리부(132)는 입력부(131)로부터 입력 문서를 전달받고, 입력 문서에 대한 담화 구조 분석(discourse parsing) 및 의미역 분석(semantic role labeling)을 수행하며, 이를 통해 기초 담화 단위 (elementary discourse unit) 및 사건(event) 정보를 추출한다. 바람직하게, 언급된 사건 추출부(122)의 사건 정보 처리 방식과 같이, 하나의 사건은 텍스트에서 드러난 하나의 술어(predicate)와 해당하는 술어에 대해 분석된 의미역(semantic role) 및 각각의 의미역에 해당하는 단어들의 집합으로 정의한다.The pre-processing unit 132 receives an input document from the input unit 131 and performs discourse parsing and semantic role labeling on the input document, through which an elementary discourse unit ) And event information. Preferably, as in the event information processing method of the mentioned event extraction unit 122, one event is a predicate revealed in the text, a semantic role analyzed for the corresponding predicate, and each meaning It is defined as a set of words corresponding to the inverse.

동시 발생 정보 처리부(133)는 전처리부(132)로부터 입력 문서 및 입력 문서 내에서 추출 가능한 기초 담화 단위와 사건 정보를 전달받고, 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵을 전달받으며, 입력 문서에 포함된 각각의 문장, 기초 담화 단위, 사건 정보에 해당하는 상기 단위 텍스트 각각을 상기 대표 벡터 및 단위 텍스트 변환 해쉬맵을 통해 상기 대표 의미 벡터 및 상기 대표 단위 텍스트로 변환한다. 해당하는 해쉬맵 내에 입력 문서에 포함된 상기 단위 텍스트에 상응하는 입력값이 존재하지 않을 경우, 상기 의미 벡터 변환부(123)에서 정의된 의미 벡터 공간 내에서 해당하는 단위 텍스트와 가장 가까운 상기 대표 의미 벡터와 상기 대표 단위 텍스트를 탐색하는 방법을 통해 입력 문서에 포함된 각각의 상기 단위 텍스트를 상기 대표 의미 벡터 및 상기 대표 단위 텍스트로 변환한다. 바람직하게, 의미 벡터 공간 내에서 가장 가까운 상기 대표 의미 벡터와 상기 대표 단위 텍스트를 탐색하는 방법은 의미 벡터 변환부(123)에 대해 상술한 것과 같은 방식으로 빠른 계산이 가능하도록 한다.The simultaneous generation information processing unit 133 receives the input document and the basic discourse unit and event information extractable from the input document from the preprocessing unit 132, receives the representative vector and unit text conversion hashmap, and includes it in the input document Each of the unit text corresponding to each sentence, basic discourse unit, and event information is converted into the representative semantic vector and the representative unit text through the representative vector and the unit text conversion hash map. If there is no input value corresponding to the unit text included in the input document in the corresponding hash map, the representative meaning closest to the corresponding unit text in the semantic vector space defined by the semantic vector conversion unit 123 Each of the unit text included in the input document is converted into the representative semantic vector and the representative unit text through a method of searching for a vector and the representative unit text. Preferably, a method of searching for the representative semantic vector and the representative unit text closest in the semantic vector space enables quick calculation in the same manner as described above for the semantic vector converting unit 123.

동시 발생 정보 처리부(133)는 입력 문서 내에 존재하는 상기 단위 텍스트 각각에 대해 변환된 상기 대표 단위 텍스트 각각에 대한 동시 발생 정보를 동시 발생 정보 데이터베이스(125)로부터 불러오며, 불러온 동시 발생 정보 데이터베이스(125)의 부분 집합을 입력 문서 내에 존재하는 상기 단위 텍스트 각각에 대해 순회하며, 가능한 동시 발생 문장, 동시 발생 담화, 동시 발생 사건 각각에 상응하는 텍스트를 기준으로 상대 발생 빈도를 누적 합계한다. 동시 발생 정보 처리부(133)는 누적 합계된 상대 발생 빈도를 기준으로 상기 단위 텍스트를 내림차순 정렬하며, 상위 L개의 항목을 최대 동시 언급 확률을 갖는 단위 텍스트 집합으로 선별한다. 여기서, L은 시스템 초기값으로 정의되는 자연수이다.The co-occurrence information processing unit 133 retrieves the co-occurrence information for each of the representative unit texts converted for each of the unit texts existing in the input document from the co-occurrence information database 125, and retrieves the co-occurrence information database ( A subset of 125) is traversed for each of the unit texts present in the input document, and the relative occurrence frequency is cumulatively summed based on the text corresponding to each of the possible concurrent sentences, concurrent conversations, and concurrent events. The co-occurrence information processing unit 133 sorts the unit text in descending order based on the cumulative sum of relative occurrence frequencies, and selects the top L items as a unit text set having the maximum simultaneous reference probability. Here, L is a natural number defined as the initial value of the system.

문장 변환부(134)는 동시 발생 정보 처리부(133)로부터 상기 최대 동시 언급 확률을 갖는 단위 텍스트 집합을 전달받고, 해당하는 집합 내에 존재하는 기초 담화 단위 및 사건에 상응하는 문장을 문장 변환 데이터베이스(137)로부터 불러와, 기초 담화 단위 및 사건을 모두 문장 형태로 변환하여 상기 최대 동시 언급 확률을 갖는 단위 텍스트 집합에 포함된 모든 항목이 문장 형태를 갖도록 변환하며, 이를 최대 동시 언급 확률을 갖는 문장 집합으로 정의한다.The sentence conversion unit 134 receives the unit text set having the maximum simultaneous reference probability from the concurrent occurrence information processing unit 133, and converts the sentence corresponding to the basic discourse unit and event existing in the corresponding set into the sentence conversion database 137 ), And converts all basic discourse units and events into sentence form, converts all items included in the unit text set having the maximum simultaneous mention probability to have a sentence form, and converts them into sentence sets with maximum simultaneous reference probability. define.

문장 추가 위치 계산부(135)는 문장 변환부(134)로부터 상기 최대 동시 언급 확률을 갖는 문장집합을 전달받고, 입력부(131)로부터 입력 문서를 전달받으며, 입력 문서 내에 포함된 문장과 상기 최대 동시 언급 확률을 갖는 문장 집합을 순서를 고려하지 않는 하나의 문장 집합(이하 병합 문장 집합)으로 병합하며, 병합 문장 집합에 대해 문서의 일관성(coherence)을 최대화하는 방식의 문장 순서 정렬 모델을 활용하여 문장 순서를 설정하고, 새로 설정된 문장 순서를 통해 기존에 입력 문서 내에 포함되지 않았던 추가 문장들(다르게 말해, 상기 최대 동시 언급 확률을 갖는 문장 집합 내의 문장들)이 기존에 입력 문서 내에 포함되었던 문장에 대해, 상기 병합 문장 집합을 통해 최대화된 문서 일관성을 갖는 문서를 생성하기 위해, 상대적으로 어떤 위치에 추가되어야 하는 지를 계산하고, 계산된 결과를 문서 일관성 최대화를 위한 문장 병합 순서로 정의한다. 바람직하게 문장 순서 정렬 모델은 Logeswaran et al. (2018) 에 의해 제안된 것과 같은 방식으로 구축한다.The sentence adding position calculating unit 135 receives the sentence set having the maximum simultaneous mention probability from the sentence converting unit 134, receives the input document from the input unit 131, and the sentence included in the input document and the maximum simultaneous Sentence-sorting model is used to merge the sentence set with the probability of mention into one sentence set that does not consider the order (hereinafter, a merged sentence set) and maximize the document coherence for the merged sentence set. Set the order, and through the newly set sentence order, additional sentences that were not included in the previously input document (in other words, sentences in the set of sentences having the maximum simultaneous reference probability) for the sentences that were previously included in the input document , In order to generate a document having the maximized document consistency through the merged sentence set, it is calculated in which position it should be relatively added, and the calculated result is defined as a sentence merge order for maximizing document consistency. Preferably the sentence order sorting model is Logeswaran et al. (2018).

출력부(136)는 문장 추가 위치 계산부(135)로부터 상기 병합 문장 집합과 상기 문서 일관성 최대화를 위한 문장 병합 순서를 전달받으며, 해당하는 문장 병합 순서에 따라 상기 병합 문장 집합을 하나의 문서로 변환하며, 기존에 사용자로부터 입력받은 문서 내에 포함되어 있던 문장과 포함되어 있지 않던, 새로 추가된 문장을 볼드체, 컬러 코딩 등의 방식을 통해 구분하여 사용자에게 출력 제공하며, 새로 추가된 문장 각각의 출처에 해당하는 문서를 동시 언급 기준 코퍼스(110)에서 탐색하며, 해당하는 문서의 원문 출처를 불러오며, 새로 추가된 문장 각각에 대해 원문 출처를 사용자에게 출력 제공하며, 각각의 원문 출처에 대해 클릭 방식의 상호작용을 통해 사용자가 새로 추가된 정보의 원문에 해당하는 문서에 접근할 수 있도록 한다.The output unit 136 receives the merged sentence set and the sentence merge order for maximizing the document consistency from the sentence addition position calculating unit 135, converts the merged sentence set into one document according to the corresponding sentence merge order, , Provides output to the user by dividing newly added sentences that are included in documents previously received from users and not included through bold, color coding, etc., and corresponds to each source of newly added sentences The document to be searched is searched in the corpus 110 based on the simultaneous reference, the original text source of the corresponding document is retrieved, and the original text source is output to the user for each newly added sentence, and a click-type interaction for each original text source is provided. The action allows the user to access the document corresponding to the original text of the newly added information.

문장 변환 데이터베이스(137)는 문장 변환부(134)에서 기초 담화 단위와 사건에 해당하는 상기 단위 텍스트를 문장으로 변환하기 위한 담화-문장 및 사건-문장 쌍을 저장한다. 도 7은 언급되지 않은 정보 가시화 시스템에 포함되는 문장 변환 데이터베이스의 예시를 도시한 도면이다. 도 7에 도시된 바와 같이, 문장 변환 데이터베이스는 각각의 기초 담화 단위에 대한 대표 문장을 저장하며, 각각의 사건에 대한 대표 문장을 저장한다.The sentence conversion database 137 stores the discourse-sentence and event-sentence pair for converting the unit text corresponding to the basic discourse unit and the event into a sentence in the sentence conversion unit 134. 7 is a diagram illustrating an example of a sentence conversion database included in an information visualization system that is not mentioned. As shown in FIG. 7, the sentence conversion database stores representative sentences for each basic discourse unit and representative sentences for each event.

도 8과 9는 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 방법을 도시한 흐름도이다. 도 8에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 방법 중 동시 언급 정보 처리부(120)에 의한 방법은 언급된 구문 추출 단계(S310), 언급된 사건 추출 단계(S320), 의미 벡터 변환 단계(S330), 동시 언급 빈도 계산 단계(S340)를 포함하여 구성된다. 도 9에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 언급되지 않은 정보 가시화 방법 중 사용자 입력 문서 처리부(130)에 의한 방법은 입력 단계(S410), 전처리 단계(S420), 동시 발생 정보 처리 단계(S430), 문장 변환 단계(S440), 문장 추가 위치 계산 단계(S450), 출력 단계(S460)를 포함하여 구성된다.8 and 9 are flowcharts illustrating a method for visualizing information that is not mentioned according to an embodiment of the present invention. As illustrated in FIG. 8, among the methods for visualizing information that are not mentioned according to an embodiment of the present invention, the method by the simultaneously mentioned information processing unit 120 refers to the syntax extraction step (S310) and the event extraction step (S320). ), A semantic vector conversion step (S330), and a simultaneous reference frequency calculation step (S340). As illustrated in FIG. 9, among the methods for visualizing information that is not mentioned according to an embodiment of the present invention, a method by the user input document processing unit 130 processes an input step (S410), a pre-processing step (S420), and simultaneous occurrence information processing. It comprises a step (S430), a sentence conversion step (S440), a sentence addition position calculation step (S450), and an output step (S460).

언급된 구문 추출 단계(S310)는 동시 언급 기준 코퍼스(110)의 각 문서에서 기초 담화 단위를 추출하는 단계이다. 언급된 사건 추출 단계(S320)는 동시 언급 기준 코퍼스(110)의 각 문서에서 사건 정보를 추출하는 단계이다. 의미 벡터 변환 단계(S330)는 동시 언급 기준 코퍼스(110)의 각 문서에 내에 포함된 각 문장을 문장 의미 벡터로 변환하며, 문서에 포함된 기초 담화 단위와 사건 정보를 각기 담화 의미 벡터 및 사건 의미 벡터로 변환하는 단계이다. 동시 언급 빈도 계산 단계(S340)는 두 개의 다른 의미 벡터에 상응하는 두 개의 다른 텍스트가 같은 문서 내에서 동시에 발생하는 빈도를 파악하는 단계이다.The mentioned phrase extraction step (S310) is a step of extracting a basic discourse unit from each document of the simultaneous reference reference corpus 110. The mentioned event extraction step (S320) is a step of extracting event information from each document of the simultaneous reference reference corpus 110. The semantic vector conversion step (S330) converts each sentence included in each document of the concurrent reference reference corpus 110 into a sentence semantic vector, and converts the basic discourse unit and event information included in the document into a discourse semantic vector and event semantics, respectively. This is the conversion to vector. The simultaneous reference frequency calculation step S340 is a step of grasping the frequency at which two different texts corresponding to two different semantic vectors occur simultaneously in the same document.

입력 단계(S410)는 분석의 대상이 되는 입력 문서를 사용자로부터 입력받는 단계이다. 전처리 단계(S420)는 입력 문서에서 기초 담화 단위 및 사건 정보를 추출하는 단계이다. 동시 발생 정보 처리 단계(S430)는 입력 문서 내에 존재하는 텍스트에 대한 동시 발생 정보를 동시 발생 정보 데이터베이스(125)로부터 불러오며 이를 통해 입력 문서 내에 존재하는 텍스트에 대해, 입력 문서 내에 존재하지 않는 텍스트 중 최대 동시 언급 확률을 갖는 텍스트 집합을 생성하는 단계이다. 문장 변환 단계(S440)는 입력 문서 내에 존재하지 않는 텍스트 중 최대 동시 언급 확률을 갖는 텍스트 집합 내에 존재하는 기초 담화 단위 및 사건에 상응하는 문장을 문장 변환 데이터베이스(137)로부터 불러와, 기초 담화 단위 및 사건을 모두 문장 형태로 변환하여 상기 최대 동시 언급 확률을 갖는 텍스트 집합에 포함된 모든 항목이 문장 형태를 갖도록 변환하는 단계이다. 문장 추가 위치 계산 단계(S450)는 상기 입력 문서 내에 존재하지 않는 텍스트 중 최대 동시 언급 확률을 갖는 텍스트 집합을 입력 문서 내에 기존에 존재하던 문장들에 추가하여 병합할 때 최대화된 문서 일관성을 갖는 문서를 생성하기 위해, 상대적으로 추가 문장들이 가져야 하는 위치를 계산하는 단계이다. 출력 단계(S460)는 사용자로부터 입력받은 문서 내에 포함되어 있던 문장과, 포함되어 있지 않던 새로 추가된 문장을 볼드체, 컬러 코딩 등의 방식을 통해 구분하여 병합한 문서를 사용자에게 출력 제공하며, 새로 추가된 문장 각각에 대해 원문 출처를 사용자에게 출력 제공하며, 각각의 원문 출처에 대해 클릭 방식의 상호작용을 통해 사용자가 새로 추가된 정보의 원문에 해당하는 문서에 접근할 수 있도록 하는 단계이다.The input step S410 is a step of receiving an input document to be analyzed, from a user. The pre-processing step S420 is a step of extracting basic discourse units and event information from the input document. In the co-occurrence information processing step (S430), the co-occurrence information for the text existing in the input document is retrieved from the co-occurrence information database 125, and through this, for the text existing in the input document, among the text not present in the input document This is a step of generating a text set having the maximum simultaneous mention probability. In the sentence conversion step (S440), a sentence corresponding to a basic discourse unit and an event existing in a text set having the maximum simultaneous reference probability among texts not present in the input document is retrieved from the sentence conversion database 137, and the basic discourse unit and It is a step of converting all the events into a sentence form so that all items included in the text set having the maximum simultaneous mention probability have a sentence form. In the step of calculating the position of adding a sentence (S450), a document having a maximum document consistency is added when merging by adding a text set having the maximum simultaneous mention probability among texts not existing in the input document to existing sentences in the input document. In order to generate, it is a step of calculating the position that additional sentences should have. In the output step (S460), the document included in the document received from the user and the newly added sentence, which are not included, are divided through bold, color coding, and the like to output the merged document to the user. It is a step to output the original text source to the user for each sentence, and to allow the user to access the document corresponding to the original text of the newly added information through a click-type interaction for each original text source.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and / or combinations of hardware components and software components. For example, the devices and components described in the embodiments include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor (micro signal processor), a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose computers or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may run an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and / or data may be interpreted by a processing device, or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodied in The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and / or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if substituted or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

부호의 설명Explanation of code

100 : 언급되지 않은 정보 가시화 시스템100: information visualization system not mentioned

110 : 동시 언급 기준 코퍼스110: reference corpus

120 : 동시 언급 정보 처리부120: simultaneous reference information processing unit

130 : 사용자 입력 문서 처리부130: user input document processing unit

121 : 언급된 담화 단위 추출부121: mentioned discourse unit extraction unit

122 : 언급된 사건 추출부122: mentioned incident extraction unit

123 : 의미 벡터 변환부123: semantic vector converter

124 : 동시 언급 빈도 계산부124: simultaneous mention frequency calculation unit

125 : 동시 발생 정보 데이터베이스125: concurrent information database

131 : 입력부131: input

132 : 전처리부132: pre-processing unit

133 : 동시 발생 정보 처리부133: concurrent information processing unit

134 : 문장 변환부134: sentence conversion unit

135 : 문장 추가 위치 계산부135: sentence location calculation unit

136 : 출력부136: output

137 : 문장 변환 데이터베이스137: sentence conversion database

Claims

In the method of visualizing the information not mentioned,
Among the information that is not directly mentioned in the input document received from the user, the information most likely to be mentioned together with the information mentioned in the input document is added in the form of sentences, and the added sentences and sentences included in the existing input documents A method of automatically providing a user with information that was most likely to be mentioned with a given document, but not directly mentioned within the document, by providing the merged document to the user.

According to claim 1,
The way to visualize unmentioned information
In calculating the information that is likely to be referred to above, sentences, basic discourse units, and events are converted into semantic vectors to measure mutual similarity between pairs of texts or pairs of texts, and hashmaps in the semantic vector conversions. How to utilize multiple concurrent appearance relationships for a specific hashmap using.

According to claim 2,
The method for the plurality of simultaneous appearance relations for the specific hashmap includes one or more of the following first to sixth items:
For a vector pair, the vector pair consists of two different sentence semantic vectors, and the two different sentence semantic vectors include a sentence or a basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. Yes, in the other two sentences existing in one document, if there is a semantic vector included in the output for the input containing one sentence and a vector pair included in the output for the input containing the other sentence, within the vector pair First item corresponding to the relationship of two vectors
For a vector pair, the vector pair is composed of two different discourse semantic vectors, and the two different discourse semantic vectors include a sentence or a basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. There are two pairs of semantic vectors included in the output for the input containing one basic discourse unit and the vector pairs included in the output for the input containing the other basic discourse unit for two other basic discourse units present in one document. If present, the second item corresponding to the relationship of the two vectors in the vector pair
For a vector pair, the vector pair consists of two different event semantic vectors, and the two different event semantic vectors include a sentence or basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. Yes, for two other events in one document, if there is a semantic vector included in the output for the input containing one event and a vector pair included in the output for the input containing the other event, within the vector pair The third item corresponding to the relationship of the two vectors
For a vector pair, the vector pair consists of a sentence semantic vector and a discourse semantic vector, and the two semantic vectors include a sentence or basic discourse unit or event during input and a semantic vector during output. In the hash map, in a sentence and one basic discourse unit existing in a document, a semantic vector included in an output for an input including a corresponding sentence and an input for an input including a corresponding basic discourse unit If there are vector pairs in the output, the fourth item corresponding to the relationship of the two vectors in the vector pair
For a vector pair, the vector pair consists of a sentence semantic vector and an event semantic vector, and the two semantic vectors include a sentence or basic discourse unit or event during input and a semantic vector during output. In the hash map, in one sentence and one event existing in a document, the semantic vector included in the output for the input including the corresponding sentence, and the output included in the output for the input including the corresponding event Fifth item corresponding to the relationship of two vectors in a vector pair if a vector pair exists
For a vector pair, the vector pair consists of a discourse semantic vector and an event semantic vector, and the two semantic vectors include a sentence or a basic discourse unit or event during input and a semantic vector during output. In the hashmap, for one basic discourse unit and one event existing in a document, the semantic vector included in the output for the input including the corresponding basic discourse unit, and for the input including the corresponding event The sixth item corresponding to the relationship of two vectors in a vector pair when there are vector pairs included in the output.

According to claim 1,
The way to visualize unmentioned information
A mentioned syntax extraction step (S310) of extracting basic discourse units for a given document set;
A referenced event extraction step (S320) of extracting event information for the given document set;
A semantic vector conversion step of converting each sentence included in the given document set into a sentence semantic vector, and converting basic discourse units and event information included in the document into discourse semantic vectors and event semantic vectors, respectively (S330);
Simultaneous reference frequency calculation step of grasping the frequency of two different text corresponding to two different semantic vectors simultaneously in the same document (S340);
An input step of receiving an input document to be analyzed from a user (S410);
A pre-processing step of extracting basic discourse units and event information from the input document (S420);
A co-occurrence information processing step (S430) of generating a text set having a maximum simultaneous mention probability among texts not present in the input document for the text existing in the input document;
A sentence conversion step of converting all the basic discourse units and events existing in the text set having the maximum simultaneous mention probability into a sentence form to convert all items included in the text set having the maximum simultaneous reference probability into a sentence form (S440) );
In order to generate a document with maximized document consistency when merging by adding a set of texts having the maximum simultaneous mention probability among texts not present in the input document to existing texts in the input document, relatively additional texts are added. A sentence adding position calculation step of calculating a position to have (S450);
The text included in the document received from the user and the newly added text that are not included are divided and provided through bold, color coding, etc., and the merged document is output to the user, and the original text for each newly added text An output step of providing a source to the user and allowing a user to access a document corresponding to the original text of the newly added information through a click-type interaction for each source source (S460);
How to include.

In a system for visualizing unmentioned information,
Among the information not mentioned in the input document received from the user, information that is likely to be mentioned together with the information mentioned in the input document is added in a sentence form, and the added sentences and sentences included in the existing input document are merged. A system that automatically provides users with information most likely to be mentioned with a given document, but not mentioned in the document, by providing the document to the user.

The method of claim 5,
The system for visualizing unmentioned information
In calculating the information likely to be mentioned together, a sentence, a basic discourse unit, and an event are converted into a semantic vector to measure mutual similarity between a pair of text sets or a pair of texts, and a hash map in the semantic vector conversion. A system that utilizes multiple simultaneous appearance relationships for a specific hash map by utilizing.

The method of claim 6,
The plurality of simultaneous appearance relations for the specific hashmap includes one or more of the following first to sixth items:
For a vector pair, the vector pair consists of two different sentence semantic vectors, and the two different sentence semantic vectors include a sentence or a basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. Yes, in the other two sentences existing in one document, if there is a semantic vector included in the output for the input containing one sentence and a vector pair included in the output for the input containing the other sentence, within the vector pair First item corresponding to the relationship of two vectors
For a vector pair, the vector pair is composed of two different discourse semantic vectors, and the two different discourse semantic vectors include a sentence or a basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. There are two pairs of semantic vectors included in the output for the input containing one basic discourse unit and the vector pairs included in the output for the input containing the other basic discourse unit for two other basic discourse units present in one document. If present, the second item corresponding to the relationship of the two vectors in the vector pair
For a vector pair, the vector pair consists of two different event semantic vectors, and the two different event semantic vectors include a sentence or basic discourse unit or event during input and a specific hashmap that includes a semantic vector during output. Yes, for two other events in one document, if there is a semantic vector included in the output for the input containing one event and a vector pair included in the output for the input containing the other event, within the vector pair The third item corresponding to the relationship of the two vectors
For a vector pair, the vector pair consists of a sentence semantic vector and a discourse semantic vector, and the two semantic vectors include a sentence or basic discourse unit or event during input and a semantic vector during output. In the hash map, in a sentence and one basic discourse unit existing in a document, a semantic vector included in an output for an input including a corresponding sentence and an input for an input including a corresponding basic discourse unit If there are vector pairs in the output, the fourth item corresponding to the relationship of the two vectors in the vector pair
For a vector pair, the vector pair consists of a sentence semantic vector and an event semantic vector, and the two semantic vectors include a sentence or basic discourse unit or event during input and a semantic vector during output. In the hash map, in one sentence and one event existing in a document, the semantic vector included in the output for the input including the corresponding sentence, and the output included in the output for the input including the corresponding event Fifth item corresponding to the relationship of two vectors in a vector pair if a vector pair exists
For a vector pair, the vector pair consists of a discourse semantic vector and an event semantic vector, and the two semantic vectors include a sentence or a basic discourse unit or event during input and a semantic vector during output. In the hashmap, for one basic discourse unit and one event existing in a document, the semantic vector included in the output for the input including the corresponding basic discourse unit, and for the input including the corresponding event The sixth item corresponding to the relationship of two vectors in a vector pair when there are vector pairs included in the output.

The method of claim 5,
The system for visualizing unmentioned information
Concurrent reference criteria corpus 110; Simultaneous mention information processing unit 120; It includes a user input document processing unit 130
The simultaneous reference criteria corpus 110 stores a set of documents that are the basis of the analysis in analyzing the correlation between sentences appearing together in the same document,
The simultaneous reference information processing unit 120
A reference phrase extraction unit 121 for extracting a basic discourse unit from each document of the simultaneous reference reference corpus 110;
A referenced event extraction unit 122 for extracting event information from each document of the concurrent reference reference corpus 110;
A semantic vector conversion that converts each sentence included in each document of the above mentioned reference corpus 110 into a sentence semantic vector, and converts basic discourse units and event information included in the document into a discourse semantic vector and an event semantic vector, respectively. Part 123;
A simultaneous reference frequency calculation unit 124 for determining the frequency at which two different texts corresponding to two different semantic vectors occur simultaneously in the same document;
A co-occurrence information database 125 storing the frequency at which two different texts corresponding to two different semantic vectors occur simultaneously in the same document throughout the co-referenced reference corpus 110;
It consists of,
The user input document processing unit 130
An input unit 131 for receiving an input document to be analyzed from a user;
A pre-processing unit (132) for extracting basic discourse units and event information from the input document;
A set of texts having the maximum simultaneous reference probability among texts not present in the input document, for texts present in the input document, through the simultaneous generation information for the texts present in the input document from the concurrent information database 125 Simultaneous generation information processing unit 133 for generating a;
A sentence corresponding to a basic discourse unit and an event existing in a text set having a maximum simultaneous mention probability among texts not present in the input document is retrieved from the sentence conversion database 137, and both the basic discourse unit and the event are in sentence form. A sentence conversion unit 134 that converts all items included in the text set having the maximum simultaneous reference probability to have a sentence form;
In order to generate a document with maximized document consistency when merging by adding a set of texts having the maximum simultaneous mention probability among texts not present in the input document to existing texts in the input document, relatively additional texts are added. A sentence adding position calculating unit 135 for calculating a position to have;
The sentence included in the input document and the newly added sentence, which are not included, are divided through bold, color coding, and the like to output the merged document to the user, and the original text source is provided for each newly added sentence. An output unit 136 that provides an output to the user and allows a user to access a document corresponding to the original text of the newly added information through a click-type interaction for each original text source;
A sentence conversion database 137 for storing the discourse-sentence and event-sentence pair for converting the unit text corresponding to a basic discourse unit and an event into a sentence in the sentence conversion unit 134;
System consisting of.