KR20010004090A

KR20010004090A - Hyperlink generator for korean language terminology based HTML

Info

Publication number: KR20010004090A
Application number: KR1019990024696A
Authority: KR
Inventors: 홍기채; 문병주; 정현수; 김홍배
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1999-06-28
Filing date: 1999-06-28
Publication date: 2001-01-15
Also published as: KR100374114B1

Abstract

PURPOSE: A generator of Korean vocabulary/abbreviation hyperlink based on HTML is provided to easily and rapidly sample vocabulary/abbreviation by using HTML analyzing technology and morpheme analyzing method. CONSTITUTION: An HTML analyzer(20) samples real document content of an HTML document for automatically generating a hyperlink tag based on HTML/TEXT documents. A morpheme analyzer(30) samples vocabulary/abbreviation through analyzing morphemes. A document formatter(40) generates a hyperlink document. Therefore, vocabulary/abbreviation are sampled by using a vocabulary/abbreviation dictionary(70) and a dependent morpheme dictionary(60) after separating morphemes into independent/dependent morphemes.

Description

Hyperlink generator for korean language terminology based HTML}

본 발명은 한글 용어/약어 하이퍼링크 생성기에 관한 것으로서, 특히, 인터넷 전문 정보 시스템 등에서 보다 정확한 정보 제공을 위해 형태소 분석 기법를 이용하여 용어/약어를 추출하고 이를 기반으로 인터넷의 HTML 환경에서 쉽게 활용할 수 있도록 하이퍼링크된 문서를 생성하는 HTML 기반 한글 용어/약어 하이퍼링크 생성기에 관한 것이다.The present invention relates to a Korean terminology / abbreviation hyperlink generator, and in particular, to extract terminology / abbreviation using a morphological analysis technique to provide more accurate information in an Internet specialized information system, and to easily use it in the HTML environment of the Internet. An HTML-based Hangul / abbreviation hyperlink generator for generating hyperlinked documents.

최근 인터넷 환경의 급속한 발전과 사용자의 급증으로 인하여 많은 새로운 기술이 등장하고 있으며, 특히 정보검색 분야에 있어서는 효율적인 정보검색을 위한 인터넷 정보 검색기술들이 개발되어 전문 정보시스템등에서 활용되고 있는데, 이러한 검색을 빠르고 정확하게 하도록 하기 위해 색인 기술이 사용된다.Recently, due to the rapid development of the Internet environment and the rapid increase of users, many new technologies have emerged. Especially in the field of information retrieval, Internet information retrieval technologies for efficient information retrieval have been developed and used in specialized information systems. Indexing techniques are used to ensure accuracy.

이러한 색인 기술은 사용자와 정보를 연결시켜주는 매개체로 정보를 검색할 수 있도록 돕는 기술로서, 이러한 색인을 자동으로 수행하도록 하기 위한 방법으로 통계에 의한 방법, 형태소 분석에 의한 방법 그리고 구문 분석과 의미 분석에 의한 방법 등이 사용되고 있다.This indexing technology is a technology that helps users search for information through a medium that connects users with information.It is a method to perform such an index automatically, by statistical methods, by morphological analysis, and by syntax and semantic analysis. The method by etc. is used.

먼저, 통계에 의한 색인 방법은 각 단어에 대하여 문헌에서 나타나는 빈도수를 구한 후, 현실적으로 색인어가 되기 어려운 것과 불용어를 버리고 높은 빈도의 단어들을 색인어로 선정하는 방법인데, 이러한 통계에 의한 색인 방법은 한국어에서는 용언의 활용이나 체언의 조사결합으로 인하여 사용하기 어렵다는 단점이 있다.First, the indexing method by statistics is to find the frequency appearing in the literature for each word, and then to select the high frequency words as index words by discarding the ones that are hard to become realistic index words and the stopwords. There is a disadvantage in that it is difficult to use due to the use of verbs or the investigational combination of statements.

또한, 형태소 분석에 의한 색인 방법은 문장이나 낱말을 의미를 지닌 실질형태소(명사, 대명사, 수사, 부사, 형용사, 동사)와 의미는 없고 문법적인 관계만 가진 형식형태소(조사, 어미, 선어미, 접두사, 접미사)로 분리한 후, 형식형태소를 버리고 나머지 실질형태소만을 색인어로 선정하는 방법으로서, 모든 실질형태소를 색인어로 선정함으로써 메모리 공간이 낭비되고 처리속도가 느려진다. 특히, 복합명사에 있어서는 n-gram방식(각각의 색인분절에 대하여 인접한 n개의 음절을 추출하는 방식)과 명사사전에 의한 모든 경우의 단어를 색인어로 선정하기 때문에 비슷한 용어/약어에 대해서는 정확성이 떨어지며, 또한, 형태소에 대한 각각의 사전을 필요로 한다는 단점이 있다.In addition, the indexing method by morphological analysis is a morphological morpheme (investigation, ending, pre-mother, noun, pronoun, rhetoric, adverbs, adjectives, verbs) that has a meaning or sentence without meaning. After separating the morphemes and selecting only the remaining real morphemes as index words, the memory space is wasted and processing speed is slowed down by selecting all real morphemes as index words. In particular, in compound nouns, the n-gram method (the method of extracting n adjacent syllables for each index segment) and the words in all cases by the noun dictionary are selected as index words, so the accuracy of similar terms / abbreviations is inferior. In addition, there is a disadvantage in that each dictionary for a morpheme is required.

이러한 형태소 분석에 의한 색인 방법보다 한 단계 더 발전된 방법이 구문 분석에 의한 색인 방법이고, 상기 구문 분석에 의한 색인 방법 보다 더 발전된 방법이 의미 분석에 의한 색인 방법으로서, 이들은 형태소 분석을 한다는 점에서 공통점이 있지만 상기 형태소 분석에 의한 색인 방법이 단순한 형태적인 정보만을 이용하는데 반해 상기 구문분석 또는 의미분석에 의한 색인 방법은 여러 낱말이 어울려 구성하는 문장에서 필요한 문법정보와 의미정보를 가지고 분석한다는 점에서 상기 형태소 분석에 의한 색인 방법과 다르다. 따라서, 상기 구문분석 또는 의미분석에 의한 색인 방법은 형태소 분석에 의한 색인 방법 보다 효율적이지만 현실적으로 구현하기가 어려우며 제한된 환경에서만 사용되거나 이론적인 모델로만 제시되고 있다.The method that is one step further than the index method by morphological analysis is the index method by syntax analysis, and the method further developed by the syntax method by index analysis is the index method by semantic analysis. However, while the index method by morpheme analysis uses only simple morphological information, the index method by syntax analysis or semantic analysis analyzes with necessary grammar information and semantic information in a sentence composed of various words. It is different from the index method by morphological analysis. Therefore, the indexing method using the syntax analysis or semantic analysis is more efficient than the morphological analysis indexing method, but is difficult to implement in reality, and is used only in a limited environment or presented only as a theoretical model.

한편, 이러한 사용자의 특정정보 요구에 대한 효과적이고 빠른 정보 검색에 중점을 두는 종래의 인터넷 정보 검색 기술은 정보의 내용에 있어서 생소한 용어/약어의 사용과 분야별 용어/약어의 중복사용으로 인한 정보의 모호성에 대한 문제 해결 까지는 고려하지 않고 있는데, 서술적 특성을 갖는 한글에 있어서, 이러한 모호성에 대한 문제는 더욱 심각하다.On the other hand, the conventional Internet information retrieval technology that focuses on the effective and fast information retrieval for the user's specific information needs, information ambiguity due to the use of unfamiliar terms / abbreviations and overlapping terms / abbreviations in the information content The problem of ambiguity is more serious in Hangeul with descriptive characteristics.

따라서, 본 발명에서는 이러한 문제점을 해결하기 위해 HTML 분석 기술과 형태소 분석 기법을 응용하여 간결하면서도 빠르게 용어/약어를 추출하고 이를 인터넷의 HTML 환경에서 쉽게 활용할 수 있도록 하이퍼링크된 문서를 생성하도록 하는 HTML 기반 한글 용어/약어 하이퍼링크 생성기를 제공하고자 한다.Therefore, in the present invention, to solve this problem, by applying HTML analysis technology and stemming analysis method, HTML-based to extract terms / abbreviations quickly and easily and generate hyperlinked documents so that they can be easily utilized in the HTML environment of the Internet. We want to provide a Korean term / abbreviation hyperlink generator.

상기 목적을 달성하기 위해 본 발명에서 제공하는 한글 용어/약어 하이퍼링크 생성기는 HTML과 TEXT 문서를 기반으로 정보의 내용 중에 있는 용어/약어에 대한 부가적인 설명을 지원하는 하이퍼링크 태그를 자동으로 생성하기 위해 HTML문서의 실제 문서내용을 추출하는 HTML분석기와, 한글의 서술적 특성을 고려한 형태소 분석을 통한 용어/약어를 추출하는 새로운 형태소분석기와, 하이퍼링크된 문서를 생성해주는 문서포맷터로 구성되며, 형태소를 독립적으로 쓰일 수 있는 자립형태소와 그렇지 않은 의존형태소로 분리한 후 용어/약어 사전과 상기 의존형태소 사전만을 가지고 인터넷 문서에 포함된 용어/약어를 추출하도록 함으로써 간결하고 신속하게 처리할 수 있도록 한 것을 특징으로 한다.In order to achieve the above object, the Korean terminology / abbreviation hyperlink generator provided by the present invention automatically generates a hyperlink tag that supports an additional description of the term / abbreviation in the content of information based on HTML and TEXT documents. It consists of an HTML parser that extracts the actual document contents of HTML documents, a new morpheme parser that extracts terms / abbreviations through morphological analysis considering the descriptive characteristics of Hangul, and a document formatter that creates hyperlinked documents. To separate independent morphemes and independent morphemes that can be used independently, and then to extract terms / abbreviations contained in Internet documents using only the term / abbreviation dictionary and the dependent morpheme dictionaries, It features.

도 1은 본 발명의 한글 용어/약어 하이퍼링크 생성기에 대한 전체 구성도,1 is an overall configuration diagram of the Hangul terms / abbreviation hyperlink generator of the present invention,

도 2는 본 발명의 HTML 분석기에 대한 구성도,2 is a block diagram of an HTML analyzer of the present invention;

도 3은 HTML DTD에서의 내용태그 정의에 대한 예시도,3 is an exemplary diagram for defining a content tag in an HTML DTD;

도 4는 HTML 분석기의 처리 및 그 결과에 대한 예시도,4 is an exemplary diagram for processing and results of an HTML analyzer;

도 5는 본 발명의 형태소 분석기에 대한 구성도,5 is a block diagram of a morpheme analyzer of the present invention;

도 6은 본 발명의 문서 포맷터에 대한 구성도,6 is a block diagram of a document formatter of the present invention;

도 7은 본 발명의 하이퍼링크 생성기에 의해 생성된 최종 결과 문서 및 용어/약어 사전에 대한 예시도.7 is an exemplary diagram of a final result document and term / abbreviation dictionary generated by the hyperlink generator of the present invention.

〈도면의 주요부분에 대한 부호의 설명〉<Explanation of symbols for main parts of drawing>

10 : 소스 문서 20 : HTML 분석기10: Source Document 20: HTML Analyzer

21 : 문서 분석기 22 : HTML 명세21: Document Analyzer 22: HTML Specification

23 : 문서 변환기 30 : 형태소 분석기23: document converter 30: stemmer

31 : 형태소군 글로브 생성기 33 : 하이퍼링크 생성기31: Morphological group globe generator 33: Hyperlink generator

40 : 문서 포맷터 41 : 글로브 검색기40: Document Formatter 41: Globe Finder

42 : 사전 검색기 43 : 문서 생성기42: Dictionary Finder 43: Document Generator

50 : 하이퍼링크된 문서 60 : 형태소 사전50: hyperlinked document 60: stemming dictionary

70 : 한글 용어/약어 사전70: Hangul term / abbreviation dictionary

이하, 첨부된 도면을 참조하여 본 발명의 한글 용어/약어 하이퍼링크 생성기를 좀 더 상세히 설명하고자 한다.Hereinafter, the Korean terminology / abbreviation hyperlink generator of the present invention will be described in more detail with reference to the accompanying drawings.

도 1은 본 발명의 한글 용어/약어 하이퍼링크 생성기에 대한 전체 구성도로서, 도 1을 참조하면 본 발명의 하이퍼링크 생성기는 HTML 또는 텍스트(Text)로 구성된 소스 문서(10)의 입력에 의해 동작하며, 상기 HTML 문서(10)의 실제 내용 부분을 분석/추출하여 텍스트(Text) 객체 리스트를 생성하는 HTML 분석기(20)와, 사전에 구축된 형태소 사전(60)과 한글 용어/약어 사전(70)을 이용하여 상기 HTML 분석기(20)에서 출력되는 텍스트(Text) 객체 리스트에 대한 형태소 분석을 통한 용어/약어를 추출하고 하이퍼링크 객체정보를 생성하는 형태소 분석기(30)와, 상기 형태소 분석기(30)의 결과를 이용하여 하이퍼링크된 문서를 출력하는 문서포맷터(40)로 구성된다.1 is an overall configuration diagram of the Korean terminology / abbreviation hyperlink generator of the present invention. Referring to FIG. 1, the hyperlink generator of the present invention operates by input of a source document 10 composed of HTML or text. And an HTML analyzer 20 for analyzing / extracting actual content of the HTML document 10 to generate a list of text objects, a morpheme dictionary 60 and a Hangul term / abbreviation dictionary, which are built in advance. Stemming analyzer 30 for extracting terms / abbreviations through stemming analysis of a list of text objects output from the HTML analyzer 20 and generating hyperlink object information using the stemming analyzer 30. The document formatter 40 outputs a hyperlinked document by using the result of.

이와 같이 구성된 본 발명의 한글 하이퍼링크 생성기는 HTML과 텍스트(Text) 문서를 기반으로 정보의 내용중에 있는 HTML 문서의 실제 내용을 분석, 추출하고 형태소를 분석하여 전문 정보시스템 등에서 관련되는 용어/약어 등의 부가적인 설명을 돕기 위한 인터넷 웹기반의 하이퍼링크를 생성하도록 하는데, 이를 위한 각 요소들의 구체적인 내용을 살펴보면 다음과 같다.The Hangul hyperlink generator of the present invention configured as described above is based on HTML and text documents, and analyzes and extracts the actual content of the HTML document in the content of information, and analyzes the morphemes. Internet hyperlink is created to help the additional explanation of the elements. The detailed contents of each element are as follows.

먼저, 상기 HTML 분석기(20)는 HTML문서의 실제 내용 부분을 추출하기 위해 HTML DTD(Document Type Definition)에서 정의한 태그(Tag)중에서 문서의 실제 내용 부분인 내용 태그를 정의하고 이를 기반으로 각 태그 내의 문서 내용을 절단, 추출하여 텍스트 리스트 객체를 생성하여 출력하며, 그 텍스트 또는 텍스트 리스트 객체를 입력받은 상기 형태소 분석기(30)는 형태소 사전(60) 및 한글 용어/약어 사전(70)을 이용하여 그 텍스트 또는 텍스트 리스트 객체로부터 하이퍼링크 객체를 추출한 후 이들의 구조적 객체 모임인 형태소군 글로브를 생성하는데, 상기 형태소를 독립적으로 쓰일 수 있는 자립형태소와 그렇지 않은 의존형태소로 분리하고 용어/약어 사전과 의존형태소 사전만을 가지고 용어/약어를 추출하도록 함으로써 간결하며 처리속도가 빠른 것이 특징이다.First, the HTML parser 20 defines a content tag that is an actual content portion of a document among tags defined in an HTML document type definition (DTD) to extract the actual content portion of an HTML document, and based on this, the HTML analyzer 20 By cutting and extracting document content, a text list object is generated and outputted, and the text stemmer 30 which receives the text or text list object uses the morpheme dictionary 60 and the Hangul term / abbreviation dictionary 70. After extracting hyperlink objects from text or text list objects, we create a morpheme group glove, which is a collection of structural objects. The morphemes are divided into independent and independent morphemes that can be used independently. It features concise and fast processing speed by extracting terms / abbreviations using only dictionaries.

이 때, 상기 형태소 사전(60)은 상기 HTML 분석기(20)의 처리 결과에 대한 형태소 분석에 사용되며, 용어/약어 사전(70)은 그 HTML 분석기(20)의 처리결과에 포함된 용어/약어 검색 및 하이퍼링크 문서생성시 사용되는데, 각 사전은 쉽게 사용할 수 있도록 텍스트 문서로 저장되어 있으며, 특히 용어/약어 사전(70)은 사용자의 각 분야별 사전구축을 필요로 한다.In this case, the morpheme dictionary 60 is used for morphological analysis of the processing result of the HTML analyzer 20, and the term / abbreviation dictionary 70 is a term / abbreviation included in the processing result of the HTML analyzer 20. It is used when searching and generating hyperlink documents. Each dictionary is stored as a text document for easy use. In particular, the term / abbreviation dictionary 70 requires the user to build a dictionary for each field.

한편, 상기 사전들(60, 70)을 로드하고 검색하는 알고리즘으로는 AVL 트리(Adelson Velskii and Landis Tree)를 사용하여 사전에 대한 용어/약어의 추가,삭제등에 대한 검색속도와 안정성을 보장하였다.Meanwhile, as an algorithm for loading and searching the dictionaries 60 and 70, the AVL tree (Adelson Velskii and Landis Tree) is used to guarantee the search speed and stability for the addition and deletion of terms / abbreviations for the dictionary.

상기와 같이 형태소 분석기(30)에서 형태소군 글로브가 생성되면 상기 문서 포맷터(40)는 그 형태소군 글로브를 이용하여 사전검색을 통한 하이퍼링크된 문서를 생성한다.When the morpheme group glove is generated in the morpheme analyzer 30 as described above, the document formatter 40 generates a hyperlinked document through a dictionary search using the morpheme group glove.

이러한 본 발명의 하이퍼링크 생성기 각각의 구성 요소들을 도 2 내지 도 6을 참조하여 좀 더 구체적으로 설명한다.Components of each of the hyperlink generators of the present invention will be described in more detail with reference to FIGS. 2 to 6.

도 2는 본 발명의 HTML 분석기에 대한 구성도로서, 도 2를 참조하면 본 발명의 HTML 분석기는 내부의 HTML 명세(22)에 의해 외부에서 입력받은 HTML 문서(10)를 원소 분석처리하는 문서 분석기(21)와 그 문서분석처리의 결과에 대하여 토큰화 및 변환 처리하여 인덱스, 위치, 길이 등의 객체 정보와 함께 텍스트 객체를 만들고 이들의 모임인 텍스트 객체 리스트(24)를 생성하도록 하는 문서 변환기(23)로 구성된다.FIG. 2 is a block diagram of an HTML analyzer of the present invention. Referring to FIG. 2, the HTML analyzer of the present invention is a document analyzer which performs an elemental analysis process on an HTML document 10 input from the outside by an HTML specification 22 therein. (21) and a document converter for tokenizing and converting the results of the document analysis processing to create a text object along with object information such as index, position, and length, and to generate a text object list 24 that is a collection thereof ( 23).

이 때, 상기 문서분석기(21)는 HTML 문서정의(DTD)에서 정의하는 태그중에서 HTML 문서의 실제 내용(Contents)부분을 담고 있는 태그를 "내용태그"로 정의하고 그 내용태그를 〈P〉,〈PRE〉,〈DT〉,〈DD〉,〈LI〉,〈TH〉,〈TD〉 등으로 정의하였으며 이들 태그가 담고 있는 텍스트를 추출한다.At this time, the document analyzer 21 defines a tag containing the actual contents portion of the HTML document as a "content tag" among the tags defined in the HTML document definition (DTD), and defines the content tag as , It is defined as <PRE>, <DT>, <DD>, <LI>, <TH>, and <TD> and extracts the text contained in these tags.

도 3에 이러한 HTML DTD에서의 내용태그 정의에 대한 예시도가 나타나 있는데, HTML DTD에서의 내용태그 정의에서는 HTML DTD v4.0 (W3C)에서 정의한 태그 중에서 본 발명에서 내용태그로 정의한 부분을 보여주고 있다.3 shows an example of the definition of the content tag in the HTML DTD, the definition of the content tag in the HTML DTD shows the part defined as the content tag in the present invention among the tags defined in the HTML DTD v4.0 (W3C). have.

도 3을 참조하면, 내용태그인 Paragraphs(P), Preformatted Text(PRE), Lists (DT,DD,LI), Table(TH,TD)등은 "%inline"과 "%flow"로 구성되며, "%inline"은 "#PCDATA", "%fontstyle", "%phrase", "%special", "%formctrl"등으로 "%flow"는 "%block"과 "%inline"으로 각각 구성된다. 즉 "%inline"은 #PCDATA, TT, I, B, U, S, STRIKE, BIG, SMALL, EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, ACRONYM, A , IMG, APPLET, OBJECT, FONT, BASEFONT, BR, SCRIPT, MAP, Q, SUB, SUP, SPAN, BDO, IFRAME, INPUT, SELECT, TEXTAREA, LABEL, BUTTON 등의 종속태그를 가지고 있으며 "%flow"는 "%inline", "%heading", "%list", "%preformatted", DL, DIV, CENTER, NOSCRIPT, NOFRAMES, BLOCKQUOTE, FORM, ISINDEX, HR, TABLE, FIELDSET, ADDRESS 등의 종속태그를 가진다.Referring to FIG. 3, the content tags Paragraphs (P), Preformatted Text (PRE), Lists (DT, DD, LI), Table (TH, TD), etc. are composed of “% inline” and “% flow”. "% inline" consists of "#PCDATA", "% fontstyle", "% phrase", "% special", "% formctrl", etc. "% flow" consists of "% block" and "% inline" respectively. "% Inline" means #PCDATA, TT, I, B, U, S, STRIKE, BIG, SMALL, EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, ACRONYM, A, IMG, APPLET , OBJECT, FONT, BASEFONT, BR, SCRIPT, MAP, Q, SUB, SUP, SPAN, BDO, IFRAME, INPUT, SELECT, TEXTAREA, LABEL, BUTTON, etc., and "% flow" is "% inline" , "% heading", "% list", "% preformatted", DL, DIV, CENTER, NOSCRIPT, NOFRAMES, BLOCKQUOTE, FORM, ISINDEX, HR, TABLE, FIELDSET, ADDRESS.

따라서 문서분석에서는 이들 종속태그까지 고려해야만 하며, 이들 종속태그중에는 종속태그 그 자체가 하이퍼링크 객체이기 때문에 내용까지 없는 것으로 간주하여 처리하는 태그(〈A〉)와, 글꼴 표현에 대한 태그이기 때문에 태그는 없는 것으로 하고 그 태그의 내용만 추출하여 처리해야하는 태그(〈SUB〉 또는 〈/SUB〉)가 있는데, 본 발명에서는 전자(〈A〉)를 '무시태그'로 정의하고, 후자(〈SUB〉 또는 〈/SUB〉)를 '가상태그'로 정의한다.Therefore, in document analysis, these dependent tags must be considered. Among these dependent tags, tags that are regarded as having no contents because they are hyperlink objects (<A>) and tags for font expressions are processed. There is a tag ( or ) which should be extracted and processed only by the contents of the tag, and in the present invention, the former (<A>) is defined as an 'ignored tag' and the latter (). Or ).

도 3에 나타난 상기 무시 태그와 가상 태그의 예를 구분하여 표시하면 다음과 같다.An example of the disregarding tag and the virtual tag shown in FIG. 3 is displayed as follows.

1. 무시태그1. Ignore Tag

: DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, ACRONYM, A , IMG, APPLET, OBJECT, SCRIPT, MAP, SPAN, BDO, INPUT, SELECT, TEXTAREA, LABEL, BUTTON, FIELDSET, ADDRESS: DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, ACRONYM, A, IMG, APPLET, OBJECT, SCRIPT, MAP, SPAN, BDO, INPUT, SELECT, TEXTAREA, LABEL, BUTTON, FIELDSET, ADDRESS

2. 가상태그2. State

: TT, I, B, U, S, STRIKE, BIG, SMALL, EM, STRONG, FONT, BASEFONT, BR, Q, SUB, SUP, IFRAME, DL, DIV, CENTER, NOSCRIPT, NOFRAMES, BLOCKQUOTE, FORM, ISINDEX, HR, TABLE 등 내용태그와 무시태그가 아닌 태그: TT, I, B, U, S, STRIKE, BIG, SMALL, EM, STRONG, FONT, BASEFONT, BR, Q, SUB, SUP, IFRAME, DL, DIV, CENTER, NOSCRIPT, NOFRAMES, BLOCKQUOTE, FORM, ISINDEX Tags other than content and ignore tags

한편, 도 4는 본 발명에 의한 HTML 분석기의 처리 및 그 결과를 나타낸 예시도로서, 내용태그, 무시태그, 가상태그를 이용한 문서분석처리에 의한 HTML 문서(400)와 그에 의해 추출된 텍스트의 예(410)를 나타내었다.On the other hand, Figure 4 is an exemplary view showing the processing and results of the HTML analyzer according to the present invention, an example of the HTML document 400 and the text extracted by the document analysis processing using the content tag, ignore tag, temporary tag 410 is shown.

도 4를 참조하면, 상기 HTML 문서(400)에서 〈HTML〉 및 〈/HTML〉태그는 HTML 문서의 시작 및 종료를 나타내는 태그이고, 첫 번째 〈P〉 및 〈/P〉 태그는 내용 태그로서 그 사이의 내용이 추출(401)되며, 두 번째 〈P〉 및 〈/P〉는 그 사이에 또다른 종류의 태그들을 포함하는데, 〈br〉은 그 이전까지의 내용 만을 추출하도록 하는 태그로서, 상기 〈br〉태그를 만나면, 〈P〉태그에서부터 시작하여 가상태그인 〈br〉 이전까지의 내용만 추출(402)하며, 이미지 파일과 하이퍼링크됨을 나타내는 태그(〈IMG〉)와 문서 파일과 하이퍼링크됨을 나타내는 태그(〈A〉, 〈/A〉)는 무시태그로서, 텍스트에는 그 내용이 추출되지 않는다. 따라서, 상기 추출된 내용 뒤에는 무시태그〈/A〉 다음부터 〈/P〉까지의 내용이 추출(403)된다.Referring to FIG. 4, the <HTML> and </ HTML> tags in the HTML document 400 are tags indicating the start and end of the HTML document, and the first and tags are the content tags. The content between the two is extracted (401), and the second and the include another kind of tags therebetween, and the " br " When the tag is encountered, only the content starting from the tag and before the temporary tag is extracted (402), the tag indicating that the image file is hyperlinked (<IMG>), the document file and the hyperlink (<A>, </A>) indicating that the tag is a ignore tag, the contents of which are not extracted from the text. Therefore, after the extracted contents, contents from the ignore tag </A> to are extracted (403).

한편, 테이블의 시작과 끝을 나타내는 태그(〈Table〉, 〈/Table〉) 사이의 내용 중에서는 테이블 내에서의 행과 열을 구분하기 위한 태그들(〈TR〉, 〈/TR〉, 〈TD〉, 〈/TD〉, 〈TH〉, 〈/TH〉)을 제외한 내용이 추출(404)된다.On the other hand, among the contents between the tags (<Table>, </ Table>) indicating the start and end of the table, tags (<TR>, </ TR>, <TD) for distinguishing rows and columns in the table ), </ TD>, <TH>, and </ TH>) are extracted (404).

즉, 도 2의 문서 분석기(21)의 분석 결과로 추출되는 내용은 상기 도 4의 텍스트 영역(410)과 같다.That is, the content extracted as the analysis result of the document analyzer 21 of FIG. 2 is the same as the text area 410 of FIG. 4.

도 5는 본 발명의 형태소 분석기에 대한 구성도로서, 도 5를 참조하면, 상기 형태소 분석기는 상기 HTML 분석기(20)의 결과인 텍스트 객체 리스트(24)가 입력되면 그를 형태소 사전(60)을 참조하여 텍스트 객체 단위로 형태소를 분석하고 형태소군 글로브를 생성하는 형태소군 글로브 생성기(31)와 상기 형태소군 글로브 생성기(31)에서 출력되는 결과 글로브(32)가 입력되면 용어/약어 사전(70)을 참조하여 용어/약어를 추출하고 객체 링크를 수행하는 하이퍼링크 생성기(33)로 구성되며, 상기 하이퍼 링크 생성기(33)에서는 결과 글로브(34)가 추출된다.5 is a schematic diagram of a stemmer of the present invention. Referring to FIG. 5, when the text object list 24 that is the result of the HTML analyzer 20 is input, the stemmer may refer to the stemmer dictionary 60. When the morpheme group globe generator 31 and the resultant globe 32 outputted from the morpheme group globe generator 31 are inputted, the term / abbreviation dictionary 70 is inputted. The hyperlink generator 33 extracts a term / abbreviation and performs an object link. The result globe 34 is extracted from the hyperlink generator 33.

이 때, 상기 형태소군 글로브 생성기(31)는 형태소 분석시 형태소를 독립적으로 쓰일 수 있는 자립형태소와 그렇지 않은 의존형태소로 구분하고 의존형태소가 없을 경우를 자립형태소로 결정한다. 따라서 상기 형태소 사전(60)은 "~가", "~는", "~의", "~에서", 등의 의존형태소 사전만을 필요로 한다.At this time, the morpheme group glove generator 31 is divided into independent morphemes that can be used independently of the morphemes in the morpheme analysis and dependent morphemes that do not have a dependent morpheme. Therefore, the morpheme dictionary 60 needs only dependent morpheme dictionaries such as "to", "to", "of", "to", and the like.

본 발명에서 제시하는 80개의 의존형태소를 나열하면 다음과 같다.Eighty dependent morphemes of the present invention are listed as follows.

〈 가, 고, 과, 과는, 그려, 까지, 께, 께서, 께옵서, 나, 나마, 는, 더러, 도, 되는, 되면, 된다면, 될때, 들이, 들은, 들과, 등은, 등이, 라, 라고, 로서, 로써, 를, 마다, 마저, 만, 만큼, 보다, 부터, 뿐, 서도, 시여, 시피, 아, 야, 에, 에게, 에게는, 에는, 에서, 에서는, 에서도, 에서부터, 여, 와, 와는, 와의, 요, 으로, 으로는, 으며, 은, 을, 의, 이, 이나, 이라면, 이며, 이면, 이시여, 이어서, 이어야, 이야, 이여, 인, 조차, 처럼, 커녕, 키, 하게, 하고, 하기, 하는, 하면, 한테, "." 〉〈Going, high, and, and, drawn, until, ,, ,, 나, me, maybe, ,, 더, ,, ,, ,,, 될, 들이, 들은, 들, 등 , La, as, as, as, every, even, only, as much as, from, as, only, calligraphy, poetry, shih, ah, yah, to, to, to, to, to, to, to , W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W Instead, let's, do, do, do, do, tell, "." 〉

한편, 형태소군은 문장의 시작이나 의존형태소의 종료시점에서부터 의존형태소가 발견될때까지의 범위로 정의하는데, 예를 들면, 다음과 같은 텍스트 객체에 대하여 강조된 부분이 의존형태소이다.On the other hand, the morpheme group is defined as the range from the beginning of the sentence or the end of the dependent morpheme until the dependency morpheme is found. For example, the emphasis on the following text objects is the dependent morpheme.

〈 국내에서도 상용 소프트웨어 개발에 자바언어응용을 채택 하는 개발업체들이 잇따르고 있다. 〉〈Java language application adopted for commercial software development in Korea Developers are following. 〉

따라서 상기 예문에서 추출되는 형태소군은 "국내", "상용소프트웨어 개발", "자바 언어 응용", "채택", "개발업체", "잇따르고", "있다" 등 7개가 되며, 이와 같이 상기 형태소군은 의존형태소에 의해 구분된다.Therefore, the morpheme groups extracted from the example sentences are seven such as "domestic", "commercial software development", "Java language application", "adoption", "development company", "following", "is", and so on. The groups are distinguished by dependent morphemes.

또한 추출된 형태소군을 글로브의 원소로 하여 AVL 트리 알고리즘을 적용한 형태소군 글로브를 생성한다.Also, using the extracted morpheme group as an element of a globe, a morpheme group globe to which the AVL tree algorithm is applied is generated.

상기와 같이 형태소군 글로브 생성기(31)에서 결과 글로브(32)를 생성하면 하이퍼링크 생성기(33)는 상기 결과 글로브의 각 형태소군에서 용어/약어를 추출하고 하이퍼링크 정보를 수록한 형태소군 글로브(34)를 생성한다.When the resultant glove 32 is generated by the morpheme group glove generator 31 as described above, the hyperlink generator 33 extracts a term / abbreviation from each morpheme group of the resultant glove and contains the hyperlink information. 34).

이 때, 상기 용어/약어의 추출방법으로는 형태소군의 단어를 분리하여 순서적 조합을 만들고 그 순서적 조합을 용어/약어 사전과 비교하여 사전에 존재하는 조합을 하이퍼링크 객체로 결정한다. 특히, 복합명사와 같이 여러 개의 조합이 나올 경우에는 조합의 개수와 순서가 우선순위로 부여된다.At this time, the method of extracting the term / abbreviation separates the words of the morpheme group to form an ordered combination, and compares the ordered combination with the term / abbreviation dictionary to determine a combination existing in the dictionary as a hyperlink object. In particular, when multiple combinations appear, such as compound nouns, the number and order of combinations are given priority.

예를들어 "자바 언어 응용"이라는 형태소군에 대해서 순서적 조합의 경우의 수는 다음과 같이 6가지이며 순서대로 우선순위가 부여된다.For example, for the morpheme group "Java language application", the number of cases of ordinal combinations is as follows, and the priority is given in order.

(자바 언어 응용) 〉 (자바 언어) 〉 (언어 응용) 〉 (자바) 〉 (언어) 〉 (응용)(Java language application)> (Java language)> (Language application)> (Java)> (Language)> (Application)

만약 용어/약어 사전에 "자바언어", "언어응용", "언어", "자바" 등이 존재한다면, 이들 중 우선 순위가 가장 높은 것이 "자바언어"이므로 하이퍼링크 객체로서 추출되는 조합은 (자바언어)이다.If there are "Java language", "language application", "language", "Java", etc. in the term / abbreviation dictionary, the combination that is extracted as a hyperlink object because the highest priority among them is "Java language" ( Java language).

따라서, 상기 하이퍼링크 생성기(33)에서는 "자바 언어 응용"이라는 형태소군에 대하여 "자바 언어"부분에 대한 하이퍼링크 정보를 글로브에 수록하게 된다. 또한, 2개 이상으로 이루어진 조합은 다음과 같은 여러개의 내부조합을 가지며 이러한 내부조합은 모두 동일것으로 처리한다.Accordingly, the hyperlink generator 33 stores the hyperlink information on the "Java language" portion in the globe for the morpheme group "Java language application". In addition, a combination consisting of two or more has several internal combinations as follows, and all of these internal combinations are treated as the same.

(자바 언어 응용) : (자바 언어 응용),(자바언어응용),(자바 언어응용),(자바언어 응용)(Java language application): (Java language application), (Java language application), (Java language application), (Java language application)

(자바 언어) : (자바 언어), (자바언어)(Java language): (Java language), (Java language)

(언어 응용) : (언어 응용), (언어응용)(Language Application): (Language Application), (Language Application)

도 6은 본 발명의 문서 포맷터에 대한 구성도로서, 도 6을 참조하면, 문서 포맷터는 상기 형태소 분석기(30)에서 출력되는 형태소군 글로브(34)를 검색하는 글로브 검색기(41)와, 용어/약어 사전(70)을 참조하여 하이퍼링크 객체의 하이퍼텍스트 태그를 추가하기 위한 사전검색기(42)와, 상기 글로브 검색기(41)의 출력을 HTML 문서나 텍스트, 또는 문자변수로 생성하기 위한 문서생성기(43)로 구성된다.6 is a block diagram of a document formatter of the present invention. Referring to FIG. 6, the document formatter includes a globe searcher 41 for searching for a stem group group glove 34 output from the stemmer analyzer 30, and the term / A dictionary searcher 42 for adding a hypertext tag of a hyperlink object with reference to the abbreviation dictionary 70, and a document generator for generating an output of the globe searcher 41 as an HTML document, text, or a text variable ( 43).

이 때, 상기 글로브 검색기(41)는 형태소군 글로브의 하이퍼링크 객체를 검색하여 사전검색기(42)에 넘겨주고 상기 사전검색기(42)는 용어/약어 사전(70)을 검색하여 하이퍼링크 태그에서 사용되는 질의(Query)를 명확하게 해주며, 문서 생성기(43)는 상기 글로브 검색기(41)에서 출력된 형태소군 글로브의 모든 정보를 순차적으로 검색하여 파일이나 스트링(String) 변수에 기록/저장하는 기능을 갖는다.At this time, the globe searcher 41 searches the hyperlink object of the morpheme group globe and passes it to the dictionary searcher 42. The dictionary searcher 42 searches the term / abbreviation dictionary 70 and uses it in the hyperlink tag. In order to clarify the query, the document generator 43 sequentially records and stores all the information of the morpheme group glove output from the glove searcher 41 and records / stores it in a file or a string variable. Has

한편, 도 7은 이러한 본 발명의 하이퍼링크 생성기에 의해 생성된 최종 결과 문서(71) 및 용어/약어 사전(72)에 대한 예시도로서, 도 7을 참조하면, 상기 최종 결과 문서(71)의 내용중 용어/약어 사전(72)에 등록된 용어/약어가 하이퍼링크되었음을 알수 있다.Meanwhile, FIG. 7 is an exemplary diagram of the final result document 71 and the term / abbreviation dictionary 72 generated by the hyperlink generator of the present invention. Referring to FIG. 7, the final result document 71 It can be seen that the term / abbreviation registered in the term / abbreviation dictionary 72 is hyperlinked.

이 때, 사전(72)의 구축은 텍스트 문서로 쉽게 만들 수 있으며 용어/약어의 식별자로서 콤마(,), 세미콜론(;), 따옴표(" ") 등을 사용하며, 동일한 라인의 식별자에 의한 용어/약어는 동의어로 인식한다. 즉, 상기 용어/약어 사전(72)의 "인터넷"과 "INTERNET", "컴퓨터"와 "COMPUTER"는 동의어이다.At this time, the construction of the dictionary 72 can be easily made into a text document, and a comma (,), a semicolon (;), quotes (""), etc. are used as identifiers of terms / abbreviations. Abbreviations are recognized as synonyms. In other words, "Internet" and "INTERNET", "Computer" and "COMPUTER" in the term / abbreviation dictionary 72 are synonymous.

한편, 상기 HTML 문서(71)에서 첫 번째 내용 태그인 〈P〉 및 〈/P〉 사이에 존재하는 내용중 마지막 라인인 태그 "〈A Href = "TermSearch?key = 문서 표준"〉 문서표준 〈/A〉" 부분을 보면 하이퍼링크 객체는 "문서표준"이지만 질의는 "표준 문서"이다. 이는 하이퍼링크 객체의 내부조합이 모두 동일한 것으로 처리되기 때문에 용어/약어 내용검색을 위한 질의는 용어/약어 사전에 있는 것으로 생성되어야 하기 때문이다.On the other hand, in the HTML document 71, the tag "<A Href = "TermSearch? key = document standard"> which is the last line among the contents existing between the first content tags and is used. A> ", the hyperlink object is" document standard ", but the query is" standard document ". This is because queries for term / abbreviated content search must be generated as being in the term / abbreviation dictionary because all of the internal combinations of hyperlink objects are treated as the same.

상기와 같은 본 발명은 HTML 분석 기술과 형태소 분석 기법을 응용하여 간결하면서도 빠르게 용어/약어를 추출하고 이를 인터넷의 HTML 환경에서 쉽게 활용할 수 있도록 하이퍼링크된 문서를 생성하도록 함으로써, 인터넷 웹기반의 전문 정보시스템 등에서 정보의 내용에 있어서 생소한 용어/약어의 사용과 분야별 중복사용으로 인한 사용자의 모호성을 해결할 수 있다.The present invention as described above, by applying the HTML analysis technology and morphological analysis techniques to extract the terms / abbreviations quickly and concisely and to generate hyperlinked documents to easily use them in the Internet HTML environment, Internet-based professional information It is possible to solve the ambiguity of the user due to the use of unfamiliar terms / abbreviations in the contents of information in the system and the overlapping use of the fields.

또한, 이를 위해 수작업에 의해 처리하고 있는 HTML 문서의 용어/약어 하이퍼링크 작업을 자동으로 처리함으로써 오류를 피할 수 있고 비용절감의 효과를 기대할 수 있으며 기술적으로는 용어/약어 추출을 위한 새로운 한글 형태소 분석 방법을 제시함으로써 국내 인터넷 관련 정보서비스 분야의 기반기술의 축적에 기여할 수 있다.In addition, by automatically processing the term / abbreviation hyperlink operation of the HTML document being processed manually, errors can be avoided and cost reduction can be expected. Technically, new Hangul stemming analysis for term / abbreviation extraction By presenting the method, it can contribute to the accumulation of basic technology in the field of Internet related information service in Korea.

Claims

An HTML analyzer operated by input of a source document composed of HTML or text and generating a text object list by analyzing / extracting an actual content portion of the HTML document;

A morpheme analyzer for extracting a term / abbreviation through morphological analysis of a list of text objects output from the HTML analyzer by using a morpheme dictionary and a Hangul term / abbreviation dictionary constructed in a dictionary and generating hyperlink object information; And

An HTML-based Hangul / abbreviation hyperlink generator, characterized in that the document formatter for outputting a hyperlinked document using the results of the stemmer.

The method of claim 1, wherein the HTML parser is

A document analyzer for elementally analyzing an HTML document input from the outside by an internal HTML specification; And

An HTML-based converter comprising a document converter configured to tokenize and convert the document analyzer's processing result to create a text object with object information such as index, position, and length, and to generate a list of text objects that are a group of them. Hangul term / abbreviation hyperlink generator.

The method of claim 1, wherein the morpheme analyzer

A morpheme group globe generator configured to analyze a morpheme in units of text objects by referring to the morpheme dictionary and to generate a morpheme group globe when the text object list which is a result of the HTML analyzer is input; And

An HTML-based Hangul term / abbreviation hyperlink generator, comprising: a hyperlink generator for extracting a term / abbreviation from a stem group group glove output from the stem group group glove generator and performing an object link by referring to the term / abbreviation dictionary.

The method according to claim 1 or 3, wherein the morpheme analyzer is

An HTML-based Hangul / abbreviation hyperlink generator characterized by separating the morphemes into independent morphemes that can be used independently and dependent morphemes that are not, and extracting the terms / abbreviations with only the term / abbreviation dictionary and the dependent morpheme dictionary.

The method of claim 3, wherein the hyperlink generator

HTML-based Hangul terminology, which separates the words of the morpheme groups to form an ordered combination, compares the ordered combination with the term / abbreviation dictionary, and extracts the combination existing in the dictionary as a term / abbreviation and decides it as a hyperlink object. / Abbreviation hyperlink generator.

The method of claim 3, wherein the hyperlink generator

When creating the ordinal combination by separating the words of the morpheme group, if several combinations appear like compound nouns, the priority is given according to the number and order of the combinations, and the highest priority among the words existing in the term / abbreviation dictionary An HTML-based Hangul / abbreviation hyperlink generator characterized by determining a combination as a hyperlink object.

The method of claim 1, wherein the document formatter

A globe searcher for searching for and outputting a hyperlink object of a stem stem group globe output from the stem stem analyzer;

A dictionary searcher for searching the term / abbreviation dictionary to add a hypertext tag for a hyperlink object of a stem group globe output from the globe searcher and outputting a search result to the globe searcher; And

In order to generate all the information of the morpheme group glove output from the globe searcher as an HTML document, text, or a character variable, a document generator for sequentially searching and recording / saving all the information of the morpheme group glove in a file or string variable HTML-based Hangul / abbreviation hyperlink generator, characterized in that.