JPH1021249A

JPH1021249A - Method for generating key word extraction rule

Info

Publication number: JPH1021249A
Application number: JP8186877A
Authority: JP
Inventors: Yoshifumi Sato; 佳史里; Masanori Kato; 雅則加藤; Hisafumi Azuma; 尚史東
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-06-28
Filing date: 1996-06-28
Publication date: 1998-01-23
Anticipated expiration: 2016-06-28
Also published as: JP3724878B2

Abstract

PROBLEM TO BE SOLVED: To provide the key word extraction rule generating method for extracting a character string (key word) representing a logical structure which is used to convert an unstructured document into a structured document. SOLUTION: A logical structure extraction part 102 generates element information 103 representing the adjacency relation between logical structure elements containing character strings from given logical structure definitions 101, an output format information extraction part 105 generates output format information 106 as information regarding a layout and a character string at the time of output of the logical structure elements from given output format definitions 104, and a display part 108 actuates a verification part 111 and an input part 109 to determine a key word corresponding element as a character string corresponding element to be extracted as a key word and generate format conditions for extracting the key word, thereby generating an extraction rule 113. The verification part 111 verifies whether or not the key word corresponding element is set sufficiently at a certain point of time and the complementary information input part 109 generates format conditions under which a user extracts the key word from the information 106.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字認識装置やワ
ードプロセッサ等の手段によって入力された、文書の構
造を明示的に表す情報を含まない文書（以下「非構造化
文書」と呼ぶ）を、文書の構造を明示的に表す情報を含
む構造化文書に変換する際に用いるキーワード抽出ルー
ルを生成する方法に関するものである。The present invention relates to a document (hereinafter referred to as an "unstructured document") that does not include information that explicitly indicates the structure of a document and that is input by means such as a character recognition device or a word processor. The present invention relates to a method for generating a keyword extraction rule used when converting a document into a structured document including information that explicitly expresses the structure of the document.

【０００２】[0002]

【従来の技術】構造化文書の形式の一つに、論理構造を
明示的に表す情報をテキスト中に埋め込む方法がある。
一般にユーザが作成した構造化文書（以下「文書インス
タンス」と呼ぶ）は、文書の論理構造を規定する論理構
造定義を記述したファイルを指定する部分と、文書の内
容を表す内容テキスト部からなることが多い。論理構造
定義には、その文書の論理構造と、その構成要素を表す
マーク（以下、「タグ」と呼ぶ）が定義される。また、
内容テキスト部には、論理構造定義内で定義されたタグ
を、そのタグに対応する論理構造の内容となる文字列が
一意に定まるように挿入し、文書の論理構造を明示的に
表現する。2. Description of the Related Art One of the formats of a structured document is a method of embedding information explicitly representing a logical structure in a text.
In general, a structured document created by a user (hereinafter referred to as a “document instance”) is composed of a part that specifies a file that describes a logical structure definition that defines the logical structure of the document, and a content text part that indicates the contents of the document. There are many. The logical structure definition defines the logical structure of the document and marks (hereinafter, referred to as “tags”) that represent the components. Also,
In the content text portion, a tag defined in the logical structure definition is inserted so that a character string serving as the content of the logical structure corresponding to the tag is uniquely determined, and the logical structure of the document is explicitly expressed.

【０００３】このようにして構造化された文書インスタ
ンスを出力する際には、論理構造の各構成要素（以下
「エレメント」と呼ぶ）をどのような書式で出力するか
を規定する出力書式定義を記述したファイル参照し、出
力すべきイメージを生成する。この方法によると、文書
インスタンスと出力書式定義とが独立しているために、
出力に用いる個々の装置やシステムに関わらず文書イン
スタンスを交換することができる。When outputting a document instance structured in this way, an output format definition that defines in what format each component of the logical structure (hereinafter referred to as an “element”) is output is defined. The image to be output is generated by referring to the described file. According to this method, since the document instance and the output format definition are independent,
Document instances can be exchanged regardless of individual devices or systems used for output.

【０００４】また、こうした構造化文書における文字列
の内容は、例えば「＜著者名＞」や「＜タイトル＞」と
いうような、エレメントと一対一に対応するタグの挿入
によって明示的に表現されてるため、構造化文書に対応
した全文検索システム等のツールと組み合わせることに
より、文書インスタンスの集合をそのままデータベース
として利用することができる。構造化文書とその利用に
ついては、文献「ＳＧＭＬのススメ」（吉岡誠編著、オ
ーム社）等において詳しく解説されている。[0004] Further, the contents of a character string in such a structured document are explicitly expressed by inserting a tag such as "<author name>" or "<title>" that corresponds to the element on a one-to-one basis. Therefore, by combining with a tool such as a full-text search system corresponding to a structured document, a set of document instances can be used as it is as a database. Structured documents and their use are described in detail in the document "Recommendations of SGML" (edited by Makoto Yoshioka, Ohmsha).

【０００５】こうした利点から、大量の文書を蓄積，利
用する文書処理システムにおける文書管理形式として、
構造化文書形式の採用が進んでいる。それと共に、既存
の紙面文書やワープロ入力文書などの非構造化文書を構
造化文書へと変換する手法について検討が行なわれてい
る。[0005] Due to these advantages, as a document management format in a document processing system that stores and uses a large amount of documents,
The adoption of structured document format is progressing. At the same time, a method of converting an unstructured document such as an existing paper document or a word processor input document into a structured document is being studied.

【０００６】非構造化文書の構造化文書への変換に関す
る従来の技術としては、特開昭６２−２４９７０や、
「文書画像のＯＤＡ論理構造化文書への変換方式（電子
情報通信学会論文誌，Ｄ−ＩＩ，Ｖｏｌ．Ｊ７６−ＤＩ
Ｉ，Ｎｏ．１１，ｐｐ．２２７４−２２８４）」に見ら
れる方法がある。これは、非構造化文書から「第１章」
や「１．１」等の論理構造を表現する特徴的な文字列
（以下「キーワード」と呼ぶ）を抽出し、抽出したキー
ワードを手がかりとして文書の論理構造を認識すること
により、構造化文書を生成するものである。Conventional techniques relating to conversion of an unstructured document into a structured document include Japanese Patent Application Laid-Open No. 62-24970 and
"Conversion method of document image to ODA logical structured document (Transactions of IEICE, D-II, Vol. J76-DI
I, No. 11, pp. 2274-2284) ". This is from Chapter 1 in unstructured documents.
By extracting a character string (hereinafter, referred to as a “keyword”) representing a logical structure such as “.” Or “1.1” and recognizing the logical structure of the document using the extracted keyword as a clue, the structured document is To generate.

【０００７】しかし、従来の技術においては、キーワー
ドを抽出するためのルールの作成を支援するという観点
が存在せず、その手段については未だ発表されていな
い。そのため、キーワードとすべきエレメントの決定お
よびキーワードの抽出に必要なレイアウトや文字列に関
する条件の設定を、全て人手によって行なう必要があ
る。However, in the prior art, there is no viewpoint for supporting the creation of a rule for extracting a keyword, and no means has been disclosed yet. Therefore, it is necessary to manually determine all elements to be keywords and to set layout and character string conditions necessary for keyword extraction.

【０００８】[0008]

【発明が解決しようとする課題】従来の方法の問題点と
して、（１）キーワードとして抽出するエレメント（以下「キ
ーワード対応要素」と呼ぶ）の決定を支援する手段がな
いことである。文字列を内容とするエレメントを全てキ
ーワードとして抽出するわけではない。特にレイアウト
や文字列に特徴のないエレメントについては、キーワー
ドとして抽出せず、キーワードの間に挟まれる文字列、
すなわち非キーワードとして扱う。Problems with the conventional method are as follows: (1) There is no means for supporting the determination of an element to be extracted as a keyword (hereinafter, referred to as a "keyword corresponding element"). Not all elements having a character string as a keyword are extracted as keywords. In particular, elements that have no characteristic in the layout or character string are not extracted as keywords, and character strings sandwiched between keywords,
That is, it is treated as a non-keyword.

【０００９】どのエレメントをキーワード対応要素とす
るかを決定する際には、「文書インスタンス中で非キー
ワードが隣接してはならない」という拘束条件が課せら
れる。これは、非キーワードは「キーワードの間に挟ま
れる文字列」であることから、非キーワードは必ずキー
ワードと隣接する必要があるためである。しかし、従来
の方法には、キーワード対応要素と設定したエレメント
の集合が、この拘束条件を満たすか否かを自動的に検定
する手段が存在しない。そのため、設定したキーワード
対応要素の集合がこの拘束条件を満たさない場合、論理
構造認識のためのルール作成時または論理構造認識時に
不都合が生じ、その結果、再びキーワード対応要素を設
定し直す必要がでてくる。そして、このサイクルを適切
なキーワード対応要素の集合が設定されるまで繰り返す
必要がある。When deciding which element is to be a keyword-corresponding element, a constraint is imposed that "non-keywords must not be adjacent in a document instance". This is because the non-keyword is a “character string sandwiched between keywords”, and therefore the non-keyword must be adjacent to the keyword. However, in the conventional method, there is no means for automatically testing whether or not the set of elements set as the keyword corresponding elements satisfies the constraint condition. Therefore, if the set of keyword-corresponding elements does not satisfy this constraint, inconvenience occurs when creating a rule for recognizing the logical structure or recognizing the logical structure. As a result, it is necessary to set the keyword-corresponding element again. Come. Then, it is necessary to repeat this cycle until an appropriate set of keyword corresponding elements is set.

【００１０】（２）キーワードの抽出に必要なレイアウ
トや文字列に関する条件の設定を支援する方法が存在し
ないことである。そのため、キーワードの抽出に必要な
情報は、対象とする非構造化文書そのものや、非構造化
文書の記述様式を定めた規則集等から人手で抽出する必
要があり、これには多大な労力を要する。(2) There is no method for supporting the setting of layout and character string conditions necessary for keyword extraction. Therefore, it is necessary to manually extract the information required for keyword extraction from the target unstructured document itself or from a set of rules that define the description format of the unstructured document. It costs.

【００１１】本発明の目的は、上記問題点を解決し、非
構造化文書から構造化文書を生成する際に、非構造化文
書からキーワードを抽出するためのキーワード抽出ルー
ル生成方法を提供することにある。An object of the present invention is to solve the above problems and to provide a method for generating a keyword extraction rule for extracting a keyword from an unstructured document when generating a structured document from the unstructured document. It is in.

【００１２】[0012]

【課題を解決するための手段】上記目的を達成するた
め、本発明は、文書の論理構造の構成要素を表わす特徴
的な文字列すなわちキーワードを非構造化文書から抽出
するためのルールであり、非構造化文書から構造化文書
を生成する際に用いられるキーワード抽出ルールを生成
するためのキーワード抽出ルール生成方法であり、対象
とする文書に与えられた論理構造定義から論理構造情報
を抽出し文字列対応要素情報を生成する文字列対応要素
情報生成ステップと、前記対象とする文書に与えられた
出力書式定義から出力書式情報を抽出し出力書式情報を
生成する出力書式情報生成ステップと、前記生成した文
字列対応要素情報と出力書式情報に基づきキーワード抽
出ルールを生成するキーワード抽出ルール生成ステップ
からなるようにしている。According to the present invention, there is provided a rule for extracting a characteristic character string, that is, a keyword, representing a component of a logical structure of a document from an unstructured document. This is a keyword extraction rule generation method for generating a keyword extraction rule used when generating a structured document from an unstructured document, and extracts logical structure information from a logical structure definition given to a target document and generates a character. A character string corresponding element information generating step of generating column corresponding element information; an output format information generating step of extracting output format information from an output format definition given to the target document to generate output format information; A keyword extraction rule generation step of generating a keyword extraction rule based on the extracted character string corresponding element information and output format information That.

【００１３】さらに、前記文字列対応要素情報生成ステ
ップは、文字列対応要素情報として文字列対応要素と該
文字列対応要素に後接しうる文字列対応要素とを対とし
て生成し、前記出力書式情報生成ステップは、出力書式
情報として文書の論理構造の構成要素を出力する際のレ
イアウトと文字列に関する情報抽出するようにしてい
る。Further, the character string corresponding element information generating step generates a character string corresponding element and a character string corresponding element that can follow the character string corresponding element as a pair as character string corresponding element information, and generates the output format information. The generation step is to extract information on the layout and character strings when outputting the components of the logical structure of the document as output format information.

【００１４】さらに、前記キーワード抽出ルール生成ス
テップは、前記出力書式情報をキーワード抽出に必要な
項目毎にユーザに対して表示し、ユーザの入力に従い、
非構造化文書上の出力様式に沿うように前記出力書式情
報を修正し、かつ欠如している情報を前記出力書式情報
に補うようにしている。Further, the keyword extraction rule generation step displays the output format information for each item necessary for keyword extraction to a user, and according to the user's input,
The output format information is modified so as to conform to the output format on the unstructured document, and the missing information is supplemented by the output format information.

【００１５】さらに、前記キーワード抽出ルール生成ス
テップは、前記出力書式情報から前記論理構造の構成要
素のどれをキーワードとして抽出するかをユーザが決定
するとき、前記文字列対応要素情報に基づき抽出すべき
前記論理構造の構成要素を指示、表示してユーザの決定
を支援するようにしている。Further, in the keyword extraction rule generating step, when the user determines which of the components of the logical structure is to be extracted as a keyword from the output format information, the extraction should be performed based on the character string corresponding element information. The components of the logical structure are indicated and displayed to assist the user's decision.

【００１６】[0016]

【発明の実施の形態】図面を参照して本発明の一実施例
を説明する。本実施例においては、構造化文書形式の一
例としてＳＧＭＬ（Ｓｔａｎｄａ−ｒｄＧｅｎｅｒａ
ｌｉｚｅｄＭａｒｋｕｐＬａｎｇｕａｇｅ）形式を
採用し、論理構造定義としては、対象とする文書に対し
て設定されたＳＧＭＬの文書型定義であるＤＴＤ（Ｄｏ
ｃｕｍｅｎｔＴｙｐｅＤｅｆｉｎｉｔｉｏｎ）を用
いる。ＳＧＭＬ及びＤＴＤの処理内容や記述規則は、Ｉ
ＳＯ（国際標準化機構）の標準規約であるＩＳＯ８８７
９において規定されており、その詳細は文献「ＳＧＭＬ
入門」（ＭａｒｔｉｎＢｒｙａｎ著、アスキー出版
局）において解説されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS One embodiment of the present invention will be described with reference to the drawings. In the present embodiment, as an example of a structured document format, SGML (Stand-rd Genera) is used.
It employs a size-dependent Markup Language (DLD) format, and the logical structure definition is DTD (Do), which is the SGML document type definition set for the target document.
document type definition). The processing contents and description rules of SGML and DTD
ISO887 which is the standard agreement of SO (International Organization for Standardization)
9 and details are described in the document "SGML
Introduction (Martin Bryan, ASCII Publishing).

【００１７】本実施例は、図１に示す「キーワード抽出
ルール生成方法」に関するものであるが、まず構造化文
書生成方法の流れについて説明し、本発明の位置付けを
具体的な例を示しながら述べる。The present embodiment relates to the "keyword extraction rule generation method" shown in FIG. 1. First, the flow of the structured document generation method will be described, and the position of the present invention will be described while showing specific examples. .

【００１８】図２は、構造化文書生成方法の流れを示す
ブロック図である。この流れ自体は、特願平７−２２３
０１７や特開昭６２−２４９７０、あるいは「文書画像
のＯＤＡ論理構造化文書への変換方式（電子情報通信学
会論文誌，Ｄ−ＩＩ，Ｖｏｌ．Ｊ７６−ＤＩＩ，Ｎｏ．
１１，ｐｐ．２２７４−２２８４）」といった従来技術
で共通して採用されている。FIG. 2 is a block diagram showing the flow of the structured document generation method. This flow itself is described in Japanese Patent Application No. 7-223.
017, Japanese Patent Application Laid-Open No. 62-24970, or "Conversion method of document image to ODA logical structured document (Transactions of IEICE Transactions, D-II, Vol. J76-DII, No.
11, pp. 2274-2284)).

【００１９】２０１は、文字認識装置やワードプロセッ
サ等の手段によって入力された、文書の構造を明示的に
表す情報を含まない文書、すなわち非構造化文書であ
る。非構造化文書の例を図３に示す。これは、法規を例
に紙面文書に対して文字認識を行なった結果であり、論
理構造を示す明示的な表記は存在しないが、文書の各構
成要素はスペース等を用いて読み易いようにレイアウト
されている。Reference numeral 201 denotes a document which does not include information explicitly indicating the structure of a document, that is, an unstructured document, which is input by means such as a character recognition device or a word processor. FIG. 3 shows an example of an unstructured document. This is the result of performing character recognition on paper documents using laws and regulations as an example. Although there is no explicit notation indicating the logical structure, each component of the document is laid out using spaces, etc., so that it is easy to read. Have been.

【００２０】このようなテキスト形式の電子化文書を文
書処理システムで活用するために、論理構造定義（図２
の２０７）が設定されている。図３の非構造化文書に対
応する論理構造定義（ＤＴＤ）の例を図４に示す。冒頭
の４０１は、この論理構造定義が「条例」という名称で
あることを示す。４０２〜４１５はエレメントの定義で
あり、「！ＥＬＥＭＥＮＴ」の次にエレメントの名称が
記述され、その後に「（」と「）」で挟む形式で、その
エレメントを構成する要素の集まりであるモデルグルー
プが記述される。例えば、モデルグループは、（公布年
月日，例規番号，公布文？）、（＃ＰＣＤＡＴＡ）等で
あり、題名は「＃ＰＣＤＡＴＡ」をモデルグループの要
素とするエレメントであり、条例は「題名」，「公
布」，「本則」をモデルグループのそれぞれの要素とす
るエレメントである。モデルグループは、一個以上のエ
レメントや、「＃ＰＣＤＡＴＡ」などデータを表す内容
トークンを要素とする集合であり、入れ子状にモデルグ
ループ自身を要素とすることもできる。In order to utilize such a text format electronic document in a document processing system, a logical structure definition (FIG. 2)
207) is set. FIG. 4 shows an example of a logical structure definition (DTD) corresponding to the unstructured document in FIG. The first reference numeral 401 indicates that this logical structure definition is called “regulation”. Reference numerals 402 to 415 denote element definitions, in which a model name is described after "! ELEMENT", followed by "(" and ")", which is a group of elements constituting the element. Is described. For example, the model group is (promulgation date, regulation number, promulgation text?), (#PCDATA), and the like. The title is an element having “#PCDATA” as an element of the model group, and the ordinance is “title”. , “Promulgation”, and “main rules” are the elements of the model group. The model group is a set including one or more elements or a content token representing data such as “#PCDATA” as an element, and the model group itself may be nested as an element.

【００２１】４０２は、エレメント「条例」が、「題
名」「公布」「本則」といったエレメントの並びによっ
て構成されることを示す。また、４０９は、エレメント
「条」が「見出し？」「条番号」「条規定」「号＊」と
いったエレメントの並びによって構成されることを示
す。アスタリスク（”＊”）の付いた要素は、その要素
が０回以上複数回出現可能であることを意味し、クエス
チョン（”？”）の付いた要素は、その要素が存在して
もしなくてもよいことを意味する。例えば、４０９は、
「見出し」が存在しなくてもよく、また「号」が０回以
上複数回出現可能であることを表現している。モデルグ
ループの内容が（＃ＰＣＤＡＴＡ）である４０３、４０
５〜４０７等は、それぞれ「題名」「公布年月日」「例
規番号」「公布文」といったエレメントが、そのモデル
グループの内容を表す文字列を保持することを意味す
る。Numeral 402 indicates that the element "ordinance" is constituted by a sequence of elements such as "title", "promulgation", and "main rule". Reference numeral 409 indicates that the element “article” is composed of an array of elements such as “heading?”, “Article number”, “article rule”, and “number *”. An element with an asterisk ("*") means that the element can appear zero or more times, and an element with a question ("?") Indicates whether the element exists or not. Also means good. For example, 409 is
The "heading" does not need to be present, and "number" can appear zero or more times. 403, 40 in which the contents of the model group are (#PCDATA)
Elements 5 to 407 and the like mean that elements such as “title”, “promulgation date”, “regulation number”, and “promulgation sentence” respectively hold character strings representing the contents of the model group.

【００２２】４０１から４１６までの論理構造をツリー
状に表現したものを図５に示す。図５では、図４の４０
２〜４１５に定義された各エレメントをそれぞれノード
として表現しており、上位のノードに対応するエレメン
トが、下位のノードに対応するエレメントから構成され
ていることを示している。「＃ＰＣＤＡＴＡ」を下位に
もつノードは、対応するエレメントがモデルグループと
して文字列を保持すること意味している。FIG. 5 shows a tree-like representation of the logical structure from 401 to 416. In FIG. 5, 40 in FIG.
Each element defined in 2 to 415 is expressed as a node, and indicates that the element corresponding to the upper node is configured from the element corresponding to the lower node. A node having “#PCDATA” at a lower level means that the corresponding element holds a character string as a model group.

【００２３】図３の非構造化文書の内容を、図４の論理
構造定義に従って構造化した構造化文書を図６に示す。
図４の中で定義されたエレメント（例えば４０４の「公
布」）の内容は、図６の構造化文書中において、その要
素の始まりを表す記号（ここでは６０１の＜公布＞）と
終りを表す記号（ここでは６０２の＜／公布＞）に挟ま
れる記号と文字列によって表現されている。このような
構造化文書を、図３に示すような非構造化文書から生成
することが、図２に示した構造化文書生成処理の目的で
ある。FIG. 6 shows a structured document obtained by structuring the contents of the unstructured document of FIG. 3 according to the logical structure definition of FIG.
The content of the element defined in FIG. 4 (for example, “promulgation” of 404) represents a symbol (here, <promulgation> of 601) and an end indicating the start of the element in the structured document of FIG. It is represented by a symbol and a character string sandwiched between symbols (here, 602 </ promulgation>). The purpose of the structured document generation process shown in FIG. 2 is to generate such a structured document from an unstructured document as shown in FIG.

【００２４】図２の構造化文書生成処理は、大きく二つ
に分けられる。一つは２０２のキーワード抽出処理であ
り、もう一つは２０４の論理構造認識処理である。キー
ワード抽出処理２０２は、キーワード抽出ルール２０３
を参照して、非構造化文書の中から「第１条」や「２．
１．１」のような、論理構造を表す特徴的な文字列であ
るキーワードを抽出する処理である。論理構造認識処理
２０４は、論理構造認識ルール２０５を参照して、キー
ワード抽出処理２０２で抽出されたキーワードを手がか
りとして論理構造認識を行ない、図５に示したようなツ
リー状の論理構造を文書に割り当てることにより、図６
のような構造化文書２０６を生成する処理である。The structured document generating process shown in FIG. 2 is roughly divided into two. One is a keyword extraction process of 202, and the other is a logical structure recognition process of 204. The keyword extraction process 202 includes a keyword extraction rule 203
With reference to “Article 1” and “2.
This is a process of extracting a keyword that is a characteristic character string representing a logical structure, such as “1.1”. The logical structure recognition processing 204 refers to the logical structure recognition rule 205 and performs logical structure recognition using the keywords extracted in the keyword extraction processing 202 as clues, and converts the tree-like logical structure shown in FIG. 5 into a document. By assigning, FIG.
This is a process for generating a structured document 206 as described above.

【００２５】以下、キーワード抽出処理２０２について
詳細に説明する。キーワード抽出ルール２０３の例を図
７に示す。キーワード抽出ルールは、キーワードとして
抽出すべきエレメント名と、それを抽出するためのレイ
アウト及び文字列に関する条件である書式条件とを組合
せたルールの集合である。図７における書式条件の記述
要素の説明を図８に示す。図７においては、各行の先頭
の項目がキーワードの名称であり、二番目以降の項目が
書式条件である。図７における７０１は、キーワード
「題名」の書式条件が、「センタリングされている行で
あり、最初に文字『○』が存在し、それに任意長の文字
列が続き、最後に文字列『条例』または文字列『規則』
で行が終る。」という条件であることを意味する。ま
た、７０２については、キーワード「公布年月日」の書
式条件が、「行頭から任意個のスペースを置いて文字列
『大正』または文字列『昭和』が存在し、その後は順に
整数，『年』，整数，『月』，整数，『日』と続き、行
が終る」という条件であることを意味する。Hereinafter, the keyword extracting process 202 will be described in detail. FIG. 7 shows an example of the keyword extraction rule 203. The keyword extraction rule is a set of rules that combine an element name to be extracted as a keyword and a format condition that is a condition related to a layout and a character string for extracting the keyword. FIG. 8 illustrates the description elements of the format condition in FIG. In FIG. 7, the first item of each line is a keyword name, and the second and subsequent items are format conditions. 701 in FIG. 7 indicates that the format condition of the keyword “title” is “centered line, where a character“ 『” exists first, followed by a character string of an arbitrary length, and finally a character string “rule” Or the string "rule"
Ends the line. ". Regarding 702, the format condition of the keyword “promulgation date” is “character string“ Taisho ”or character string“ Showa ”with an arbitrary number of spaces from the beginning of the line. , Integer, "month", integer, "day", and the line ends ".

【００２６】図２のキーワード抽出処理２０２では、キ
ーワード抽出ルールの書式条件に適合する文字列が非構
造化文書中に存在するか否かを判定し、適合する場合に
はその文字列をキーワードとして抽出する。図３の非構
造化文書キーワードの抽出例を図９に示す。図２の論理
構造認識処理２０４では、抽出されたキーワードを手が
かりにして論理構造認識を行なうことにより構造化文書
を生成するが、この処理の詳細については、特願平７−
２２３０１７や特開昭６２−２４９７０、あるいは「文
書画像のＯＤＡ論理構造化文書への変換方式（電子情報
通信学会論文誌，Ｄ−ＩＩ，Ｖｏｌ．Ｊ７６−ＤＩＩ，
Ｎｏ．１１，ｐｐ．２２７４−２２８４）」に開示され
ている。In the keyword extraction process 202 shown in FIG. 2, it is determined whether or not a character string conforming to the format condition of the keyword extraction rule exists in the unstructured document. Extract. FIG. 9 shows an example of extracting the unstructured document keywords in FIG. In the logical structure recognition processing 204 of FIG. 2, a structured document is generated by performing logical structure recognition using the extracted keywords as clues.
223017, JP-A-62-24970, or "Method of converting document image to ODA logical structured document (Transactions of IEICE Transactions, D-II, Vol. J76-DII,
No. 11, pp. 2274-2284) ".

【００２７】本実施例で詳述するキーワード抽出ルール
生成方法は、図２のキーワード抽出ルール２０３の作成
を支援するものである。キーワード抽出ルールは、従来
全て人手で生成していたが、本システムは、与えられた
論理構造定義と出力書式定義とを用いて、キーワード抽
出ルールの作成支援を行なう。The keyword extraction rule generation method described in detail in the present embodiment supports the creation of the keyword extraction rule 203 shown in FIG. Conventionally, all of the keyword extraction rules have been manually generated, but the present system uses the given logical structure definition and output format definition to support the creation of the keyword extraction rules.

【００２８】図１は、キーワード抽出ルール生成方法の
構成を示すブロック図である。まず、図１を用いて本シ
ステムの処理概要を説明する。１０１は、対象文書に対
して設定された論理構造定義であり、構造化文書中に現
われるエレメントと、エレメント間の関係とが定義され
ている。論理構造情報抽出部１０２では、論理構造定義
１０１を参照して、文字列に直接対応する論理構造要素
であるエレメント（以下「文字列対応要素」と呼ぶ）
と、それらの間の隣接情報とを記述した文字列対応要素
情報１０３を生成する。上記の文字列対応要素は、図５
における「＃ＰＣＤＡＴＡ」を下位に持つノードのエレ
メントであり、題名、公布年月日、例規番号、・・・、
号番号、号規定がこれにあたり、図４の場合には、４０
３、４０５、４０６、４０７、４１０、４１１、４１
２、４１４、４１５のエレメントがこれにあたる。１０
４は、対象文書に対して設定された出力書式定義であ
り、各エレメントをどのような書式で出力するのかが定
義されている。出力書式情報抽出部１０５では、出力書
式定義１０４を参照して、各エレメントの出力時のレイ
アウトや出力文字列に関する情報の中から、キーワード
抽出ルールの作成に必要な項目を可能な限り抽出する。
以下、この項目自体を「要件項目」と呼び、各項目につ
いて抽出された情報を「要件項目内容」と呼ぶ。出力書
式情報１０６は、各文字列対応要素についての要件項目
内容を記述したものである。FIG. 1 is a block diagram showing a configuration of a keyword extraction rule generation method. First, a processing outline of the present system will be described with reference to FIG. Reference numeral 101 denotes a logical structure definition set for the target document, in which elements appearing in the structured document and relationships between the elements are defined. The logical structure information extraction unit 102 refers to the logical structure definition 101, and is an element that is a logical structure element directly corresponding to a character string (hereinafter, referred to as a “character string corresponding element”).
And character string corresponding element information 103 describing the adjacent information between them. The above-mentioned character string corresponding element is shown in FIG.
Is a node element having “#PCDATA” as a lower order in the title, title, promulgation date, regulation number,.
This corresponds to the issue number and issue rules, and in the case of FIG.
3, 405, 406, 407, 410, 411, 41
The elements 2, 414 and 415 correspond to this. 10
Reference numeral 4 denotes an output format definition set for the target document, in which format each element is output. The output format information extraction unit 105 refers to the output format definition 104 and extracts items necessary for creating a keyword extraction rule as much as possible from information on layout and output character strings at the time of output of each element.
Hereinafter, this item itself is referred to as “requirement item”, and information extracted for each item is referred to as “requirement item content”. The output format information 106 describes the contents of requirement items for each character string corresponding element.

【００２９】キーワード抽出ルール作成部１０７では、
各文字列対応要素に関する出力書式情報１０６内の要件
項目内容を、入出力装置１１２を通じてユーザに提示す
る。そして、ユーザが入力する情報を受理して要件項目
内容の修正を行ない、修正後の要件項目内容を基にキー
ワード抽出ルール１１３を生成する。In the keyword extraction rule creating section 107,
The requirement item contents in the output format information 106 relating to each character string corresponding element are presented to the user through the input / output device 112. Then, the information input by the user is received, the requirement item contents are corrected, and the keyword extraction rule 113 is generated based on the corrected requirement item contents.

【００３０】キーワード抽出ルール作成部１０７におけ
る処理をより具体的に述べる。キーワード情報表示部１
０８では、文字列対応要素情報１０３に記述された文字
列対応要素の名称をユーザに対して表示する。ある文字
列対応要素がキーワード対応要素として設定され、書式
条件が付与されている場合には、その書式条件を、文字
列対応要素の名称に併せて表示する。各文字列対応要素
に対する書式条件の設定は、補完情報入力部１０９にお
いて行なう。補完情報入力部１０９では、出力書式情報
１０６を参照して、ユーザの選択した文字列対応要素に
ついての要件項目内容を表示する。ユーザは、表示され
た要件項目内容が、非構造化文書上のレイアウト及び文
字列と異る場合にはこれを修正する。また、出力書式情
報抽出部１０５で抽出できなかった要件項目について内
容を付与する。このようにして、全ての要件項目内容
が、非構造化文書上のレイアウトおよび文字列に適合し
た内容になるように、要件項目内容の編集を行なう。１
１０は、要件項目内容の一つである文字列条件をユーザ
が編集する際に、その編集を支援する文字列条件入力部
である。補完情報入力部１０９では、全ての要件項目の
編集が終了すると、その要件項目内容からキーワード抽
出に用いる書式条件を生成する。そして、書式条件を返
り値として処理をキーワード表示部に戻す。キーワード
情報表示部１０８では、補完情報入力部１０９において
書式条件を生成した文字列対応要素をキーワード対応要
素として設定し、その要素名に併せて書式条件を表示す
る。The processing in the keyword extraction rule creation unit 107 will be described more specifically. Keyword information display section 1
At 08, the name of the character string corresponding element described in the character string corresponding element information 103 is displayed to the user. When a certain character string corresponding element is set as a keyword corresponding element and a format condition is given, the format condition is displayed together with the name of the character string corresponding element. The setting of the format condition for each character string corresponding element is performed in the complementary information input unit 109. The supplementary information input unit 109 refers to the output format information 106 and displays the requirement item contents for the character string corresponding element selected by the user. The user corrects the displayed requirement item contents if they differ from the layout and character strings on the unstructured document. In addition, contents are given to requirement items that could not be extracted by the output format information extraction unit 105. In this way, the requirement item contents are edited so that all the requirement item contents conform to the layout and character strings on the unstructured document. 1
Reference numeral 10 denotes a character string condition input unit that assists the user in editing a character string condition that is one of the requirement item contents. When all the requirement items have been edited, the supplementary information input unit 109 generates a format condition to be used for keyword extraction from the requirement item contents. Then, the process returns to the keyword display unit with the format condition as a return value. The keyword information display unit 108 sets the character string corresponding element for which the format condition was generated in the complementary information input unit 109 as a keyword corresponding element, and displays the format condition along with the element name.

【００３１】以上の手続きによってキーワード対応要素
を定めていくが、ある時点において定められたキーワー
ド対応要素の集合が、「非キーワードが隣接してはなら
ない」という拘束条件を満たすか否かを、要素隣接検定
部１１１によって検定する。要素隣接検定部１１１は、
文字列対応要素情報１０３に記述された文字列対応要素
間の隣接情報を参照して、キーワード対応要素以外の文
字列対応要素（以下「非キーワード対応要素」と呼ぶ）
が隣接するか否かを検定する。非キーワード対応要素同
士が隣接しうる場合には、そのどちらかに対して書式条
件の生成を行ない、キーワード対応要素として設定す
る。逆に、非キーワード対応要素同士が隣接する可能
性のない場合には、その時点で十分なキーワード対応要
素が設定できていることになる。このとき、各キーワー
ド対応要素の名称と書式条件との組み合わせの集合を、
キーワード抽出ルール１１３とする。以上が、キーワー
ド抽出ルール生成方法の処理概要である。以下、図１に
おける各処理の詳細な説明を行なう。The keyword-corresponding element is determined by the above procedure. It is determined whether or not a set of keyword-corresponding elements determined at a certain point in time satisfies the constraint that “non-keywords must not be adjacent”. The test is performed by the adjacency test unit 111. The element adjacency test unit 111
A character string corresponding element other than the keyword corresponding element (hereinafter referred to as a “non-keyword corresponding element”) with reference to the adjacent information between the character string corresponding elements described in the character string corresponding element information 103.
Test whether are adjacent. If the non-keyword corresponding elements can be adjacent to each other, a format condition is generated for one of them and set as a keyword corresponding element. Conversely, if there is no possibility that non-keyword corresponding elements are adjacent to each other, it means that sufficient keyword corresponding elements have been set at that time. At this time, a set of combinations of the name of each keyword
The keyword extraction rule 113 is used. The above is the processing outline of the keyword extraction rule generation method. Hereinafter, each process in FIG. 1 will be described in detail.

【００３２】論理構造情報抽出部１０２では、図４およ
び図５に具体例を示したような論理構造定義１０１を参
照して、文字列対応要素と、文字列対応要素間の隣接可
能性についての情報を抽出し、文字列対応要素情報１０
３として出力する。文字列対応要素とは、論理構造定義
において、文字列を意味する（＃ＰＣＤＡＴＡ）をモデ
ルグループの要素とするエレメントのことである。図４
の論理構造定義における文字列対応要素を図１０に示
す。図１０の例では、エレメント「題名」「公布年月
日」「例規番号」「公布文」「見出し」「条番号」「条
規定」「号番号」「号規定」が文字列対応要素として抽
出される。The logical structure information extraction unit 102 refers to the logical structure definition 101 as shown in a specific example in FIGS. 4 and 5, and determines the character string corresponding element and the possibility of adjacency between the character string corresponding elements. Information is extracted, and character string corresponding element information 10 is extracted.
Output as 3. The character string-corresponding element is an element having (#PCDATA) meaning a character string as an element of the model group in the logical structure definition. FIG.
FIG. 10 shows character string corresponding elements in the logical structure definition of FIG. In the example of FIG. 10, the elements "title", "promulgation date", "regulation number", "promulgation text", "headline", "article number", "article rule", "number", and "number rule" are extracted as character string corresponding elements. Is done.

【００３３】論理構造情報抽出部１０２では、文字列対
応要素間での隣接の可能性を調べる。具体的には、以下
の二つの処理を行なう。１．各エレメント毎に、その冒頭及び末尾に現われうる
文字列対応要素の集合を求める。例えば図６の構造化文
書において、エレメント「公布」の冒頭に文字列対応要
素「公布年月日」が現われており、またエレメント「公
布」の末尾には文字列対応要素「公布文」が現われてい
る。ここでの処理は、このようなエレメントの冒頭及び
末尾に現われうる要素を、図４に示すような論理構造
定義から導くものである。２．論理構造定義のモデルグループ内で隣接するエレメ
ントの組み合わせを求める。各組み合わせについて、前
側のエレメントの最後に現われうる文字列対応要素と、
後ろ側のエレメントの最初に現われうる文字列対応要素
とが、隣接する可能性を有することになる。The logical structure information extraction unit 102 examines the possibility of adjacent character string corresponding elements. Specifically, the following two processes are performed. 1. For each element, a set of character string corresponding elements that can appear at the beginning and end of the element is obtained. For example, in the structured document of FIG. 6, a character string corresponding element “promulgation date” appears at the beginning of the element “promulgation”, and a character string corresponding element “promulgation sentence” appears at the end of the element “promulgation”. ing. In this processing, elements that can appear at the beginning and end of such an element are derived from the logical structure definition as shown in FIG. 2. Find combinations of adjacent elements in the model group of the logical structure definition. For each combination, a string-corresponding element that may appear at the end of the preceding element,
The string-corresponding element that can appear first at the back of the element will have the possibility of being adjacent.

【００３４】本実施例においては、この二つの処理を容
易にするための準備として、図４に示した論理構造定義
をＢＮＦ（ＢｕｃｋｕｓＮａｕｒＦｏｒｍ）記法を
用いた表現に変換する。ＢＮＦは「生成規則」と呼ばれ
るルールの集合である。各生成規則は「Ａ：ＢＣ」
というようにコロン’：’によって区切られた左辺と右
辺から成り、左辺の要素が、右辺に記述された要素の並
びによって成り立つことを意味する。「Ａ：ＢＣ」
という生成規則の例では、要素Ａが、「ＢＣ」という
要素の並びによって構成されることを意味する。また、
記号’｜’は並列を表す記号であり、例えば「Ａ：
Ｂ｜Ｃ」という生成規則は、要素Ａが要素Ｂまたは
要素Ｃから成り立つことを意味する。ＢＮＦの詳細につ
いては、文献「ｙａｃｃとｌｅｘの使い方」（斉藤孝
著、ＨＢＪ出版局）等において解説されている。In the present embodiment, as a preparation for facilitating these two processes, the logical structure definition shown in FIG. 4 is converted into an expression using the BNF (Buckus Naur Form) notation. BNF is a set of rules called “generation rules”. Each production rule is "A: BC"
Thus, it consists of a left side and a right side separated by a colon ':', which means that the elements on the left side are realized by the arrangement of the elements described on the right side. "A: BC"
Means that the element A is composed of a sequence of elements “BC”. Also,
The symbol '|' is a symbol indicating parallelism, for example, "A:
The generation rule “B | C” means that the element A is composed of the element B or the element C. The details of BNF are described in the document "How to use yacc and lex" (Takashi Saito, HBJ Publishing Bureau) and the like.

【００３５】図１１に、図４に示した論理構造定義（Ｄ
ＴＤ）をＢＮＦ記法を用いて表現する際の変換規則を示
し、図１２にＢＮＦ記法で表現した論理構造定義の例を
示す。例えば、図４における４０４の定義は、図１２の
１２０３および１２０４に示した生成規則に変換され
る。ここでは、図４の「公布文」が、図１１の変換規則
１１０１によって、図１２の１２０３の「ｏｐｔ０」に
置き換えられている。そして、「ｏｐｔ０」の定義が１
２０４に記述されている。以下、本実施例においては、
ＢＮＦ記法によって表現した論理構造定義の各生成規則
における右辺を、左辺のエレメントの「内容モデル」と
呼ぶことにする。FIG. 11 shows the logical structure definition (D
FIG. 12 shows a conversion rule when expressing (TD) using BNF notation, and FIG. 12 shows an example of a logical structure definition expressed using BNF notation. For example, the definition of 404 in FIG. 4 is converted into the generation rules shown in 1203 and 1204 in FIG. Here, “promulgation sentence” in FIG. 4 is replaced by “opt0” 1203 in FIG. 12 by the conversion rule 1101 in FIG. And the definition of “opt0” is 1
204. Hereinafter, in the present embodiment,
The right side in each generation rule of the logical structure definition expressed by the BNF notation will be referred to as the “content model” of the element on the left side.

【００３６】ＢＮＦ記法によって表現した論理構造定義
から、各エレメント毎にその冒頭と末尾に現われうる文
字列対応要素の集合を求める手続きについて説明する。
この手続きのアルゴリズムを図１３に示す。図１３にお
いてＡから始まる手続きは、エレメントを入力引数と
し、そのエレメントの冒頭に現われうる文字列対応要素
の集合を返り値とする手続きであり、再帰呼び出しを含
む。ここで、この手続き内で用いられている変数ｍｇ及
びｅｌｅｍは、Ａに手続きが進むごとに新たに生成され
る局所的な変数である。また、Ｆｉｒｓｔ［××］は、
エレメント××の冒頭に現われうる文字列対応要素の集
合を表す大域的な変数である。A procedure for obtaining a set of character string corresponding elements that can appear at the beginning and end of each element from the logical structure definition expressed by the BNF notation will be described.
FIG. 13 shows the algorithm of this procedure. In FIG. 13, the procedure starting with A is a procedure that takes an element as an input argument and returns a set of character string corresponding elements that can appear at the beginning of the element as a return value, and includes a recursive call. Here, the variables mg and elem used in this procedure are local variables newly generated each time the procedure proceeds to A. Also, First [xx] is
This is a global variable representing a set of character string corresponding elements that can appear at the beginning of element XX.

【００３７】あるエレメントの冒頭に現われうる文字列
対応要素の集合を求めるには、そのエレメントを引数
（図１３中のｎｔ）として手続きＡを実行する。手続き
Ａでは、まずｎｔの冒頭に現われうる文字列対応要素の
集合を表すＦｉｒｓｔ［ｎｔ］を空集合にセットする
（１３０１）。また、ｎｔの内容モデルにおいて、並列
記号’｜’で区切られたエレメント列のうち、最初のエ
レメント列を変数ｍｇに代入する（１３０２）。並列記
号が存在しない場合は、内容モデル全体をｍｇとする。
そして、変数ｅｌｅｍに、ｍｇの最初のエレメントを代
入する（１３０３）。次に、１３０４において、ｅｌｅ
ｍが文字列対応要素であるか否かを調べる。ｅｌｅｍが
文字列対応要素である場合には、Ｆｉｒｓｔ［ｎｔ］に
ｅｌｅｍを加え（１３０５）、１３０９に進む。逆にｅ
ｌｅｍが文字列対応要素でない場合には、Ｆｉｒｓｔ
［ｅｌｅｍ］が設定されていれば（１３０６）Ｆｉｒｓ
ｔ［ｅｌｅｍ］の内容をＦｉｒｓｔ［ｎｔ］に加え（１
３０８）、１３０９に進む。また、１３０６においてＦ
ｉｒｓｔ［ｅｌｅｍ］が設定されていない場合には、ｅ
ｌｅｍを引数として、手続きＡを再帰的に実行する（１
３０７）。そして、その返り値すなわちＦｉｒｓｔ［ｅ
ｌｅｍ］の内容をＦｉｒｓｔ［ｎｔ］に加え（１３０
８）、１３０９に進む。１３０９では、ｎｔの内容モデ
ルにおいてｍｇが並列記号で区切られた最後のエレメン
ト列であるか否かを調べる。ｍｇが最後のエレメント列
でない場合には、変数ｍｇに次のエレメント列を代入し
（１３１０）、１３０３に戻る。逆にｍｇが最後のエレ
メント列である場合には、Ｆｉｒｓｔ［ｎｔ］を返り値
として、この手続きを呼び出した手続きに処理を戻す
（１３１１）。To obtain a set of character string corresponding elements that can appear at the beginning of a certain element, the procedure A is executed with the element as an argument (nt in FIG. 13). In procedure A, First [nt] representing a set of character string corresponding elements that can appear at the beginning of nt is set to an empty set (1301). Also, in the content model of nt, the first element string among the element strings separated by the parallel symbol '|' is substituted for the variable mg (1302). If there is no parallel symbol, the whole content model is set to mg.
Then, the first element of mg is substituted for the variable elem (1303). Next, at 1304, ele
It is checked whether or not m is a character string corresponding element. If elem is a character string corresponding element, elem is added to First [nt] (1305), and the flow advances to 1309. Conversely e
If lem is not a character string corresponding element, First
If [elem] is set (1306)
The content of t [elem] is added to First [nt] (1
308), and proceed to 1309. In 1306, F
If rst [elem] is not set, e
recursively executes procedure A with lem as an argument (1
307). Then, the return value, that is, First [e
lem] is added to First [nt] (130
8) Go to 1309. In 1309, it is checked whether or not mg is the last element string delimited by the parallel symbol in the content model of nt. If mg is not the last element sequence, the next element sequence is substituted for the variable mg (1310), and the process returns to 1303. Conversely, if mg is the last element sequence, the process returns to the procedure that called this procedure, using First [nt] as the return value (1311).

【００３８】以上、図１３に示した手続きを、全てのエ
レメントについてＦｉｒｓｔ［］が設定されるまで実施
することにより、各エレメントについて、冒頭に現われ
うる文字列対応要素の集合を求めることができる。ま
た、末尾に現われうる文字列対応要素の集合Ｌａｓ
ｔ［］を求めるには、図１３に対して以下の２つの置換
を行なう事により、図１３と同様の手順で求めることが
出来る。ａ．図１３中のＦｉｒｓｔ［ＸＸＸ］をＬａｓｔ［ＸＸ
Ｘ］に置き換える。ｂ．１３０３の「最初のエレメント」を「最後のエレメ
ント」に置き換える。By executing the procedure shown in FIG. 13 until First [] is set for all elements, a set of character string corresponding elements that can appear at the beginning of each element can be obtained. Also, a set Las of character string corresponding elements that can appear at the end
In order to obtain t [], the following two substitutions are performed on FIG. 13 to obtain t [] in the same procedure as in FIG. a. First [XXX] in FIG. 13 is changed to Last [XX].
X]. b. The “first element” of 1303 is replaced with the “last element”.

【００３９】図１４に、図４に示した論理構造定義中の
エレメントについて、冒頭及び末尾に現われうる文字列
対応要素の集合、すなわちＦｉｒｓｔ［］とＬａｓ
ｔ［］とを求めた結果を示す。以上の手続きにより、各
エレメントについて冒頭に現われうる文字列対応要素の
集合Ｆｉｒｓｔ［］と、末尾に現われうる文字列対応要
素の集合Ｌａｓｔ［］を求めることができる。FIG. 14 shows a set of elements corresponding to character strings that can appear at the beginning and end of the elements in the logical structure definition shown in FIG. 4, that is, First [] and Las.
The result of obtaining t [] is shown. With the above procedure, a set First [] of character string corresponding elements that can appear at the beginning and a set Last [] of character string corresponding elements that can appear at the end can be obtained for each element.

【００４０】次に、論理構造定義の内容モデル内で隣接
するエレメントの組み合わせを求める。各組み合わせに
ついて、前側のエレメントのＬａｓｔ［］の要素と、後
ろ側のエレメントのＦｉｒｓｔ［］の要素とが、隣接す
る可能性を有することになる。この処理例を図１５に示
す。本図は、図１２の１２０１の「条例：題名公布
本則」という生成規則についての処理例である。この
生成規則では、エレメント「条例」の内容モデルにおい
て、題名と公布が隣接し、また公布と本則が隣接してい
る（１５０１）。そのため、Ｌａｓｔ［題名］の要素に
Ｆｉｒｓｔ［公布］の要素が後接しうる（１５０２）。
すなわち、文字列対応要素「題名」には、文字列対応要
素「公布年月日」が後接しうる（１５０４）。また、Ｌ
ａｓｔ［公布］の要素にＦｉｒｓｔ［本則］の要素が後
接しうる（１５０３）。すなわち、文字列対応要素「公
布文」と「例規番号」には、どちらも文字列対応要素
「見出し」及び「条番号」が後接しうる（１５０５）。
この手続きを、ＢＮＦ記法で表現した論理構造定義中の
全ての生成規則に対して適用することにより、全ての文
字列対応要素について後接しうる文字列対応要素の集合
を求めることができ、これがすなわち文字列対応要素情
報（図１の１０３）になる。文字列対応要素情報１０３
の例を図１６に示す。以上、図１１〜図１５に示した手
続きによって、図１の論理構造情報抽出部１０２におい
て文字列対応要素情報１０３が生成される。Next, a combination of adjacent elements in the content model of the logical structure definition is obtained. For each combination, there is a possibility that the Last [] element of the front element and the First [] element of the rear element are adjacent to each other. FIG. 15 shows an example of this processing. This drawing is an example of processing regarding a generation rule 1201 in FIG. 12, “Ordinance: title promulgation main rule”. In this generation rule, in the content model of the element “regulations”, the title and the promulgation are adjacent, and the promulgation and the main rule are adjacent (1501). Therefore, the element of Last [title] can be followed by the element of First [promulgation] (1502).
That is, the character string corresponding element “Title” can be followed by the character string corresponding element “Promulgation Date” (1504). Also, L
An element of the first [promulgation] can be followed by an element of the first [main rule] (1503). That is, the character string corresponding elements "heading" and "article number" can both follow the character string corresponding elements "promulgation sentence" and "regulation number" (1505).
By applying this procedure to all the production rules in the logical structure definition expressed in the BNF notation, a set of character string corresponding elements that can be followed by all the character string corresponding elements can be obtained. It becomes character string corresponding element information (103 in FIG. 1). Character string corresponding element information 103
FIG. 16 shows an example. As described above, the character string corresponding element information 103 is generated in the logical structure information extraction unit 102 in FIG. 1 by the procedures shown in FIGS.

【００４１】次に、図１の出力書式情報抽出部１０５に
おいて、出力書式定義１０４から出力書式情報１０６を
抽出する処理について説明する。１０４は、対象文書に
対して設定された出力書式定義であり、各エレメントを
どのような書式で出力するのかが定義されている。図１
７に、図４の論理構造定義に沿った構造化文書のために
用意された出力書式定義の例の一部を示す。１７０１
は、１７０１〜１７１１がエレメント「題名」の出力書
式に関する定義であることを示す。［フォント種類］１７０２は、「題名」を出力する際の
フォントの種類がゴシック体であることを示し、［フォ
ントサイズ］１７０３は、そのフォントのサイズが１２
ｐｔであることを示す。ｐｔ（ポイント）は長さの単位
であり、１ｐｔ＝１／７２インチである。［文字ピッチ］１７０４は、「題名」の文字ピッチが１
４ｐｔであることを示す。１７０５の［オフセット１］
と１７０６の［オフセット２］は、それぞれこの文書を
出力する領域の左端および右端から、最低どれくらいの
スペースを空けて「題名」の内容を出力するかを表すも
のである。１７０７の［冒頭変位］は、他の行と比べて
特殊なオフセットを取ることが多い第一行目の、［オフ
セット１］との差を表す。１７０８の［前要素との接
続］は、直前に現れる要素との間にどのような文字列を
出力するかを表す。１７０８の例では、直前に現れる要
素を出力した後、改行して「題名」を出力することを示
している。１７０９の［文字列情報］は、どのような文
字列を出力するかを記述するものであり、１７０９の例
では、題名に相当する文字列（ＣＯＮＴＥＮＴ）、つま
り構造化文書においてタグ＜題名＞とタグ＜／題名＞に
挟まれる文字列をそのまま出力することを意味してい
る。１７１０の［配置］は、［オフセット１］と［オフ
セット２］によって指定された区間内に、内容文字列を
どのように配置するかを示すものである。左寄せ、右寄
せ、センタリング、均等割り付けの４種類の割り付け方
法に応じて、それぞれｓｔａｒｔ、ｅｎｄ、ｃｅｎｔｅ
ｒ、ｊｕｓｔｉｆｙの４つの値をとる。１７１０の例で
は、「題名」の内容文字列をセンタリングして出力する
ことを表している。Next, the process of extracting the output format information 106 from the output format definition 104 in the output format information extraction unit 105 of FIG. 1 will be described. Reference numeral 104 denotes an output format definition set for the target document, in which format each element is output. FIG.
FIG. 7 shows a part of an example of an output format definition prepared for a structured document according to the logical structure definition of FIG. 1701
Indicates that 1701 to 1711 are definitions relating to the output format of the element “title”. [Font type] 1702 indicates that the font type when outputting “title” is Gothic, and [Font size] 1703 indicates that the font size is 12
pt. pt (point) is a unit of length, and 1 pt = 1/72 inch. [Character pitch] 1704 indicates that the character pitch of the “title” is 1
4 pt. [Offset 1] of 1705
And [Offset 2] of 1706 indicate the minimum amount of space from the left end and right end of the output area of this document to output the content of “title”. [Top displacement] 1707 represents a difference from [Offset 1] of the first row, which often takes a special offset compared to other rows. 1708 [Connection with previous element] indicates what kind of character string is output between the element and the element that appears immediately before. The example of 1708 indicates that after outputting the element appearing immediately before, a line feed is performed and “title” is output. [Character string information] 1709 describes what kind of character string is output. In the example of 1709, the character string (CONTENT) corresponding to the title, that is, the tag <title> and This means that a character string sandwiched between tags </ title> is output as it is. [Arrangement] 1710 shows how the content character string is arranged in the section specified by [Offset 1] and [Offset 2]. Start, end, and center according to four types of allocation methods: left alignment, right alignment, centering, and even allocation
It takes four values, r and justify. The example of 1710 indicates that the content character string of “title” is centered and output.

【００４２】このような出力書式定義は、本来構造化文
書を出力するためのものであり、非構造化文書の書式を
表現するためのものではない。しかし、例えば法規文書
のように記述様式に規則性のある文書については、出力
書式定義がその規則に即して定義されていることが多
い。このような文書については、出力書式定義中のレイ
アウトや文字列に関する情報の多くを、非構造化文書か
らキーワードを抽出するための情報として利用すること
ができる。Such an output format definition is originally for outputting a structured document, and is not for expressing the format of an unstructured document. However, for a document such as a legal document having a regular description format, the output format definition is often defined in accordance with the rules. For such documents, much of the information on the layout and character strings in the output format definition can be used as information for extracting keywords from unstructured documents.

【００４３】出力書式情報抽出部１０５では、出力書式
定義１０４を参照して、各エレメントの出力時のレイア
ウトに関する情報と出力文字列に関する情報の中から、
キーワードの抽出に必要な項目を可能な限り抽出する。
前述したように、この項目自体を「要件項目」と呼び、
各項目について抽出される情報を「要件項目内容」と呼
ぶ。The output format information extraction unit 105 refers to the output format definition 104 and selects from the information on the layout at the time of output of each element and the information on the output character string.
Extract items necessary for keyword extraction as much as possible.
As mentioned earlier, this item is itself called a "requirement item"
Information extracted for each item is referred to as “requirement item contents”.

【００４４】図１８に、図７に示したキーワード抽出ル
ールを作成する際に、各キーワード毎に必要な要件項目
の例を示す。［論理構造要素名］１８０１は、対象とする文字列対応
要素の名称であり、文字列を値とする。１８０２の［左
スペース］と１８０３の［右スペース］は、このエレメ
ントを出力する領域に対して、それぞれ左端および右端
から最低何文字分のスペースを空けて内容文字列が記述
されているかを表す条件である。１８０４の［第一行ス
ペース］は、他の行と比べて特殊なオフセットを取るこ
とが多い第一行目が、左側に何文字分のスペースを空け
て始まるかを表す。１８０５の［文字列条件］は、この
キーワードがどのような文字列によって記述されている
かを示す。１８０６の［割り付け］は、左スペース１８
０２と右スペース１８０３によって定まる領域におい
て、キーワードがどのように割り付けられているかを示
す項目であり、右寄せ、左寄せ、センタリング、均等、
の４種類の値をとる。１８０７の［前接文字列］および
１８０８［後接文字列］は、それぞれ注目しているキー
ワードの前後に現われる文字列対応要素との間に、どの
ような文字列が挟まれるのかを表す文字列である。FIG. 18 shows an example of requirement items required for each keyword when creating the keyword extraction rule shown in FIG. [Logical structure element name] 1801 is the name of the target character string corresponding element, and the character string is a value. A [left space] 1802 and a [right space] 1803 are conditions that indicate a space of at least the number of characters from the left end and the right end of the area where the element is output, and describe the content character string. It is. A [first line space] 1804 indicates how many characters of the left side of the first line, which often takes a special offset compared to other lines, start. [Character string condition] 1805 indicates what character string is used to describe this keyword. [Assign] of 1806 is the left space 18
02 is an item indicating how keywords are assigned in an area defined by the right space 1803 and right alignment, left alignment, centering, uniform,
Takes four values. 1807 [prefixed character string] and 1808 [postfixed character string] are character strings representing what character strings are sandwiched between character string corresponding elements appearing before and after the target keyword, respectively. It is.

【００４５】出力書式情報抽出部１０５では、出力書式
定義１０４を参照して、図１８に示したような要件項目
に関する情報、すなわち要件項目内容を可能な限り抽出
する。以下、図１７に示した出力書式定義から、要件項
目内容を抽出する例を図１９に示す。ある文字列対応要
素についての要件項目内容を抽出するには、出力書式定
義中の、その文字列対応要素に関する定義を利用する。
例えば、条番号に関する要件項目は、図１７の１７１２
〜１７２２の条番号に関する定義から抽出する。要件項
目［左スペース］及び［右スペース］は、それぞれ出力
書式定義中の［オフセット１］および［オフセット２］
と同じ内容を表す項目であるため、長さの単位をｐｔか
ら文字数へと変換するだけでよい。具体的には、［オフ
セット１］および［オフセット２］の値を［文字ピッ
チ］の値で割ればよい（１９０１および１９０２）。要
件項目［第一行スペース］は、出力書式定義中の［オフ
セット１］に［冒頭変位］を加えたものに相当する。そ
こで、その両者の和を［文字ピッチ］で割った値を内容
とする（１９０３）。要件項目［文字列条件］は、出力
書式定義中の［文字列情報］を参照して作成する（１９
０４）が、図１７の例では全ての要素について［文字列
情報］が”ＣＯＮＴＥＮＴ”、つまり文書インスタンス
中の内容文字列をそのまま出力することになっているた
め、出力書式定義から文字列に関する具体的な情報は得
られない。要件項目［割り付け］は、出力書式定義中の
［配置］と同じ概念を表す項目であるため、１９０５の
規則に従って値を変換する。要件項目［前接文字列］
は、出力書式定義中の［前要素との接続］の内容をその
まま代入する（１９０６）。要件項目：［後接文字列］は、文字列対応要素情報と、
出力書式定義中の他の要素の［前要素との接続］を利用
して求める（１９０７）。具体的には、まず文字列対応
要素情報を用いて、注目する文字列対応要素に後接する
文字列対応要素（以下、「後接要素」と呼ぶ）を求め
る。次に、全ての後接要素について、その要素の［前要
素との接続］を調べ、その内容がどの後接要素について
も同じであれば、その内容を注目する文字列対応要素の
［後接文字列］として設定する。後接要素によって［前
要素との接続］の内容が異る場合には、［後接文字列］
は設定しない。例えば条番号については、図１６の文字
列対応要素情報の１６０６より、条番号の後接要素は条
規定だけであることが分る。従って、条規定の［前要素
との接続］である「” ”」が条番号の［後接文字列］
の内容となる。以上の手続きを全ての文字列対応要素に
対して適用することにより、図１の出力書式情報１０６
が生成される。The output format information extraction unit 105 refers to the output format definition 104 and extracts information on requirement items as shown in FIG. 18, that is, the content of requirement items as much as possible. FIG. 19 shows an example in which the requirement item contents are extracted from the output format definition shown in FIG. In order to extract the requirement item contents of a certain character string corresponding element, the definition of the character string corresponding element in the output format definition is used.
For example, the requirement item related to the article number is 1712 in FIG.
Extracted from the definition related to the article number of ~ 1722. The requirement items [Left space] and [Right space] are [Offset 1] and [Offset 2] in the output format definition, respectively.
Since the item has the same content as the above, it is only necessary to convert the unit of length from pt to the number of characters. Specifically, the values of [Offset 1] and [Offset 2] may be divided by the value of [Character pitch] (1901 and 1902). The requirement item [first line space] is equivalent to [offset 1] in the output format definition plus [head displacement]. Therefore, a value obtained by dividing the sum of the two by [character pitch] is used as the content (1903). The requirement item [character string condition] is created with reference to [character string information] in the output format definition (19).
However, in the example of FIG. 17, the [character string information] is "CONTENT" for all elements, that is, the content character string in the document instance is output as it is. Information cannot be obtained. Since the requirement item [assignment] is an item representing the same concept as [arrangement] in the output format definition, the value is converted according to the rule of 1905. Requirement item [prefix string]
Substitutes the content of [Connection with previous element] in the output format definition as it is (1906). Requirement item: [Postscript string] is a string corresponding element information,
It is obtained by using [connection with previous element] of another element in the output format definition (1907). Specifically, first, using the character string corresponding element information, a character string corresponding element adjacent to the target character string corresponding element (hereinafter, referred to as a “following element”) is obtained. Next, for all the succeeding elements, the [connection to the preceding element] of the element is checked, and if the content is the same for all the succeeding elements, the content of the character string corresponding element of interest is checked. Character string]. If the content of [Connection with previous element] differs depending on the succeeding element, [Back character string]
Is not set. For example, as for the article number, it can be seen from the character string corresponding element information 1606 of FIG. Therefore, “” ”, which is the [connection with the preceding element] in the article rule, is the [suffix character string] of the article number.
It becomes the contents of. By applying the above procedure to all the character string corresponding elements, the output format information 106 shown in FIG.
Is generated.

【００４６】図１のキーワード抽出ルール作成部１０７
では、文字列対応要素情報１０３と出力書式情報１０６
の情報を、入出力装置１１２を通じてユーザに提示す
る。そして、ユーザから補完情報の入力を受け、要件項
目情報の追加，修正を行なうことにより、キーワード抽
出ルール１１３を生成する。以下、キーワード抽出ルー
ル作成部１０７における具体的な処理について説明す
る。キーワード情報表示部１０８では、ユーザに対し
て、文字列対応要素名と、ある時点でどの文字列対応要
素がキーワード対応要素として設定されているかを示す
情報を提示する。そして、ある文字列対応要素をキーワ
ード対応要素として設定するようユーザから指示された
場合には、補完情報入力部１０９を起動し、その文字列
対応要素の要件項目内容を補完して書式条件を生成す
る。また、その時点において「非キーワードが隣接して
はならない」という拘束条件を満たすのに十分なキーワ
ード対応要素が設定されているか否か検定するようユー
ザから指示された場合には、要素隣接検定部１１１を起
動し、検定を行なう。The keyword extraction rule creating section 107 shown in FIG.
Now, character string corresponding element information 103 and output format information 106
Is presented to the user through the input / output device 112. Then, the supplementary information is input by the user, and the requirement item information is added and corrected, thereby generating the keyword extraction rule 113. Hereinafter, specific processing in the keyword extraction rule creation unit 107 will be described. The keyword information display unit 108 presents the user with a character string corresponding element name and information indicating which character string corresponding element is set as a keyword corresponding element at a certain point in time. Then, when the user instructs to set a certain character string corresponding element as a keyword corresponding element, the supplementary information input unit 109 is activated, and the format condition is generated by complementing the requirement item contents of the character string corresponding element. I do. At that time, if the user instructs whether or not a keyword-corresponding element sufficient to satisfy the constraint that “non-keywords must not be adjacent” is set, the element adjacency test unit Activate 111 and perform verification.

【００４７】キーワード情報表示部１０８が入出力装置
１１２を通じてユーザに表示するインタフェースの例を
図２０に示し、処理フローを図２１に示す。この二つの
図を用いて、キーワード情報表示部１０８の動作を説明
する。キーワード情報表示部１０８は、起動時に文字列
対応要素情報１０３を読み込み、各文字列対応要素の名
称を得る（２１０１）。２００１は、キーワード情報表
示窓であり、文字列対応要素名を全て表示する要素名表
示領域２００２と、キーワード対応要素として設定され
た文字列対応要素について、その書式条件を表示する書
式条件表示領域２００３から構成される。処理２１０２
において、文字列対応要素名と、その時点においてキー
ワード対応要素として設定された要素の書式条件とを表
示するが、最初はどの要素についても書式条件が設定さ
れていないため、書式条件表示領域２００３には何も表
示されない。ある文字列対応要素に対して書式条件を付
与し、その要素をキーワード対応要素として設定するに
は、ユーザが例えばマウスを用いて要素名表示領域２０
０２中の要素名をダブルクリックすることにより、補完
情報入力部（図１の１０９）を起動する（２１０４）。
補完情報入力部１０９の動作については後述するが、文
字列対応要素名を補完情報入力部１０９に渡し、その書
式条件を返り値として受けとる。そして、ユーザの指示
した文字列対応要素をキーワード対応要素として設定し
（２１０５）、その書式条件を書式条件表示領域２００
３に表示する（２１０２）。図２０の例は、ある時点に
おけるインタフェースの表示例を示したものである。こ
の時点では、２００６の題名と２００７の項番号の二つ
の文字列対応要素に書式条件が付与されており、これは
この二つの文字列対応要素がキーワード対応要素として
設定されていることを意味する。FIG. 20 shows an example of an interface displayed by the keyword information display unit 108 to the user through the input / output device 112, and FIG. 21 shows a processing flow. The operation of the keyword information display unit 108 will be described with reference to these two figures. The keyword information display unit 108 reads the character string corresponding element information 103 at the time of startup, and obtains the name of each character string corresponding element (2101). Reference numeral 2001 denotes a keyword information display window, which is an element name display area 2002 for displaying all character string corresponding element names, and a format condition display area 2003 for displaying format conditions of character string corresponding elements set as keyword corresponding elements. Consists of Process 2102
In, the character string corresponding element name and the format condition of the element set as the keyword corresponding element at that time are displayed, but since no format condition is initially set for any element, the format condition display area 2003 displays Does not display anything. To assign a format condition to a certain character string corresponding element and set the element as a keyword corresponding element, the user can use a mouse, for example, to display the element name display area 20.
By double-clicking on the element name in 02, the complementary information input unit (109 in FIG. 1) is activated (2104).
Although the operation of the supplementary information input unit 109 will be described later, the character string corresponding element name is passed to the supplementary information input unit 109, and the format condition is received as a return value. Then, the character string corresponding element designated by the user is set as a keyword corresponding element (2105), and the format condition is set in the format condition display area 200.
3 is displayed (2102). The example of FIG. 20 shows a display example of the interface at a certain time. At this point, the formatting condition is given to the two character string corresponding elements of the title of 2006 and the item number of 2007, which means that these two character string corresponding elements are set as the keyword corresponding elements. .

【００４８】２００４は隣接チェックボタンであり、こ
のボタンをクリックすると、その時点で設定されている
キーワード対応要素の集合が「非キーワードが隣接して
はならない」という拘束条件を満たすのに十分であるか
否かを検定する要素隣接検定部（図１の１１１）が起動
される（２１０６）。要素隣接検定部１１１の動作につ
いては後述するが、その検定を行ない、拘束条件を満た
すのに十分なキーワード対応要素が設定されていること
が判明した場合、ユーザは終了ボタンをクリックし、キ
ーワード情報表示部１０８の処理を終了することを指示
する。キーワード情報表示部１０８は、キーワード対応
要素名とその書式条件とを、キーワード抽出ルール（図
１の１１３）として出力し、処理を終了する（２１０
７）。以上がキーワード情報表示部１０８の処理内容で
ある。Reference numeral 2004 denotes an adjacency check button. When this button is clicked, the set of keyword-corresponding elements set at that time is sufficient to satisfy the constraint that “non-keywords must not be adjacent”. The element adjacency verification unit (111 in FIG. 1) for verifying whether or not is activated is started (2106). The operation of the element adjacency verification unit 111 will be described later. When the verification is performed and it is determined that a keyword-corresponding element sufficient to satisfy the constraint condition is set, the user clicks the end button and clicks the keyword information. It is instructed to end the processing of the display unit 108. The keyword information display unit 108 outputs the keyword corresponding element name and its format condition as a keyword extraction rule (113 in FIG. 1), and ends the processing (210).
7). The above is the processing contents of the keyword information display unit 108.

【００４９】次に、キーワード情報表示部１０８におい
て、要素名をダブルクリックした際に起動される補完情
報入力部１０９のインタフェースを図２２に示し、その
処理フローを図２３に示す。補完情報入力部１０９で
は、キーワード情報表示部１０８から渡された、キーワ
ード対応要素として書式条件を設定すべき要素名を読み
込み（２３０１）、その要素に対応する要件項目内容を
出力書式情報（図１の１０６）から読み込む（２３０
２）。そして、要件項目内容を要件項目編集窓２２０１
に表示する（２３０３）。要件項目編集窓２２０１は、
表示内容を編集できる窓であり、表示内容が非構造化文
書上の記述様式と異なる場合は、ユーザがその内容を変
更する。また、出力書式情報抽出部１０５において抽出
できなかった要件項目内容（例えば、図１８および図１
９の抽出例における「文字列条件」）については要件項
目編集窓が空白になっているため、ユーザはその編集窓
に要件項目内容を入力する（２３０４→２３０３）。文
字列条件についても要件項目編集窓上で編集してもよい
が、文字列条件入力ボタン２２０２をクリックして文字
列条件入力部（図１の１１０）を起動することにより
（２３０５）、より容易に入力することができる。文字
列条件入力部１１０の処理については後述する。文字列
条件を入力した後の表示例を図２２中の「文字列条件入
力後」に示す。Next, FIG. 22 shows an interface of the supplementary information input unit 109 which is activated when the element name is double-clicked on the keyword information display unit 108, and FIG. 23 shows a processing flow thereof. The supplementary information input unit 109 reads the element name for which the format condition is to be set as the keyword corresponding element passed from the keyword information display unit 108 (2301), and outputs the requirement item contents corresponding to the element to the output format information (FIG. 1). From (106) of (230)
2). Then, the content of the requirement item is displayed in the requirement item editing window 2201.
(2303). The requirement item edit window 2201 is
This window allows the user to edit the displayed content. If the displayed content is different from the description format on the unstructured document, the user changes the content. Also, the contents of the requirement items that could not be extracted by the output format information extraction unit 105 (for example, FIG. 18 and FIG. 1)
For the “character string condition” in the extraction example of No. 9), the requirement item editing window is blank, so the user inputs the requirement item content in the editing window (2304 → 2303). The character string condition may also be edited on the requirement item edit window, but the character string condition input button 2202 is clicked to activate the character string condition input unit (110 in FIG. 1) (2305), which makes it easier. Can be entered. The processing of the character string condition input unit 110 will be described later. A display example after inputting the character string condition is shown in "After inputting character string condition" in FIG.

【００５０】要件項目内容の編集が終了し、全ての要件
項目内容が非構造化文書上の記述様式に適合すると、ユ
ーザは終了ボタン２２０３をクリックし、補完情報入力
部１０９の処理を終了することを指示する。補完情報入
力部１０９は、要件項目を編集した文字列対応要素の要
件項目内容から書式条件を生成し（２３０６）、その書
式条件を返り値として処理をキーワード情報表示部１０
８に戻す（２３０７）。要件項目内容から書式条件を生
成する処理フローを図２４に示す。図２２の「文字列条
件入力後」に示した条番号の要件項目内容を書式条件に
変換する例を、点線枠で囲む形で処理フローに付与す
る。まず、要件項目［文字列条件］の内容（例えば「”
第” ＮＵＭ１ ”条”」）を書式条件に代入する。そ
して、要件項目［前接文字列］の内容が改行であるか否
かを調べる（２４０１）。改行であれば、２４０３へ進
む。改行でなければ、書式条件を’［’と’］’とで挟
み、その直前に’＋’と［前接文字列］の内容を付加す
る（２４０２）。その際、空白についてはＳＰＣ｛整
数｝に変換する。次に、処理２４０３において、要件項
目［後接文字列］の内容が改行であるか否かを調べる。
改行であれば、書式条件の末尾に’＄’を付加して（２
４０５）、処理２４０６へ進む。改行でなければ、書式
条件中に’［’と’］’が存在しない場合には’［’
と’］’とで挟み、直後に［後接文字列］の内容と’
＋’とを付加する（２４０４，例えば「［”第” ＮＵ
Ｍ１ ”条”］ＳＰＣ１＋」）。処理２４０６では、
要件項目［割り付け］の内容がセンタリングであるか否
かを調べる。センタリングである場合には、書式条件の
冒頭に’Ｃ’を付加し（２４０７）、書式条件の生成を
終了する。逆にセンタリングでない場合には、処理２４
０８に進み、［割り付け］の内容に従ってＡおよびＢの
処理を行なう。［割り付け］の内容が左寄せならばＡ、
右寄せならばＢ、均等ならばＡとＢの両方の処理を実行
し、書式条件の生成を終了する。Ａでは、書式条件の冒
頭部に’＾ＳＰＣｘ’を付加する（２４０９）。ただし
ｘは［冒頭インデント］の内容である（例えば「＾ＳＰ
Ｃ０［”第” ＮＵＭ１ ”条”］ＳＰＣ１
＋」）。Ｂでは、まず書式条件の末尾部に’ＳＰＣｙ
＄’を付加する（２４１０）。ここで、ｙは［右スペー
ス］の内容である。次に、書式条件の冒頭に’＾’また
は’＋’が存在しなければ、冒頭に’！’を付加する
（２４１１）。補完情報入力部１０９は、以上の手続き
によって得られる書式条件を返り値として、処理をキー
ワード情報表示部１０８に戻す（図２３の２３０７）。
以上が、補完情報入力部１０９の処理内容である。When the editing of the requirement item contents is completed and all the requirement item contents conform to the description format on the unstructured document, the user clicks the end button 2203 to terminate the processing of the supplementary information input unit 109. Instruct. The supplementary information input unit 109 generates a format condition from the content of the requirement item of the character string corresponding element in which the requirement item has been edited (2306), and uses the format condition as a return value to process the keyword information display unit 10.
8 (2307). FIG. 24 shows a processing flow for generating a format condition from the contents of a requirement item. An example of converting the contents of the requirement item of the article number shown in “After inputting the character string condition” in FIG. First, the contents of the requirement item [string condition] (for example, ""
"NUM1""") is substituted into the format condition. Then, it is checked whether or not the content of the requirement item [preceding character string] is a line feed (2401). If it is a line feed, go to 2403. If it is not a line feed, the format condition is sandwiched between '[' and ']', and '+' and the contents of [preceding character string] are added immediately before the format condition (2402). At this time, blanks are converted into SPCs {integer}. Next, in processing 2403, it is checked whether or not the content of the requirement item [subsequent character string] is a line feed.
If it is a line feed, add “@” to the end of the format condition (2
405), and proceed to processing 2406. If it is not a line feed, if there is no '[' and ']' in the format condition,
And ']', followed immediately by the contents of [subscript string] and '
+ '(2404, for example, "["
M1 "Section"] SPC1 + "). In process 2406,
Check whether the content of the requirement item [assignment] is centering. In the case of centering, 'C' is added to the beginning of the format condition (2407), and the generation of the format condition ends. Conversely, if it is not centering, process 24
In step 08, the processes of A and B are performed in accordance with the content of [assignment]. If the content of [Assign] is left-justified, A,
If it is right-justified, B is executed, and if it is equal, both A and B are executed, and the generation of the format condition is ended. In A, '@SPCx' is added to the beginning of the format condition (2409). Where x is the content of [start indent] (for example, "@SP
C0 ["No. 1 NUM1"] SPC1
+ "). In B, first, 'SPCy
＄ ′ is added (2410). Here, y is the content of [right space]. Next, if '＾' or '+' does not exist at the beginning of the format condition, '!''Is added (2411). The complementary information input unit 109 returns the process to the keyword information display unit 108 with the format condition obtained by the above procedure as a return value (2307 in FIG. 23).
The above is the processing content of the supplementary information input unit 109.

【００５１】次に、補完情報入力部１０９において、文
字列条件入力ボタンをクリックした際に起動する文字列
条件入力部１１０のインタフェースを図２５に示し、そ
の処理フローを図２６に示す。文字列条件入力部１１０
は、文字列条件においてよく用いられる文字列の入力を
ボタン化することにより、入力の手間を削減することを
目的とする。２５０１は文字列条件表示窓であり、この
窓上でユーザが文字列条件の編集を行なう。２５０２は
文字列条件表示窓内のカーソルであり、このカーソルの
位置する箇所にユーザの挿入する文字を挿入することを
表す。２５０３〜２５０８は編集ボタンであり、これら
をクリックすると、それぞれ図２６の表に示した処理を
行なう（２６０２）。このボタンでは入力できない文
字、例えばＮＵＭやＳＰＣの後に続く文字等について
は、ユーザがキーボードから入力する。２５０９はクリ
アボタンであり、ユーザがこのボタンをクリックする
と、文字列条件表示窓内の内容がクリアされる（２６０
３）。２５１０は終了ボタンであり、ユーザがこのボタ
ンをクリックすると、文字列条件入力部１１０は、文字
列条件表示窓２５０１内の内容を返り値として、処理を
補完情報入力部１０９に戻す（２６０４）。以上が、文
字列条件入力部１１０の処理内容である。Next, FIG. 25 shows the interface of the character string condition input section 110 which is activated when the character string condition input button is clicked in the complementary information input section 109, and FIG. 26 shows the processing flow thereof. String condition input unit 110
The object of the present invention is to reduce the trouble of inputting by converting a character string input frequently used in a character string condition into a button. Reference numeral 2501 denotes a character string condition display window on which a user edits a character string condition. Reference numeral 2502 denotes a cursor in the character string condition display window, which indicates that a character to be inserted by the user is to be inserted at a position where the cursor is located. Reference numerals 2503 to 2508 denote editing buttons. When these buttons are clicked, the processing shown in the table of FIG. 26 is performed (2602). Characters that cannot be input with this button, for example, characters following NUM or SPC, are input by the user from the keyboard. Reference numeral 2509 denotes a clear button. When the user clicks this button, the contents in the character string condition display window are cleared (260).
3). Reference numeral 2510 denotes an end button. When the user clicks this button, the character string condition input unit 110 returns the processing to the complementary information input unit 109 with the contents in the character string condition display window 2501 as a return value (2604). The above is the processing content of the character string condition input unit 110.

【００５２】次に、キーワード情報表示部（図１の１０
８）において、隣接チェックボタンをクリックした際に
起動する要素隣接検定部１１１の処理フローを図２７に
示し、その処理例を図２８に示す。要素隣接検定部１１
１では、まずキーワード情報表示部１０８から与えられ
るキーワード対応要素名を読み込む（２７０１，例えば
２８０１）。次に、文字列対応要素情報（図１の１０
３）を読み込む（２７０２）。そして、全ての文字列対
応要素からキーワード対応要素を差し引いた集合とし
て、非キーワード対応要素群を求める（２７０３，例え
ば２８０２）。処理２７０４では、文字列対応要素情報
を参照して、非キーワード対応要素の後接要素中に非キ
ーワード対応要素が存在するか否かを調べる（例えば２
８０３）。存在する場合には、隣接する非キーワード対
応要素をユーザに提示して（２７０５，例えば２８０
４）処理を終了する。また、存在しない場合には、非キ
ーワードが隣接しないことをユーザに提示して（２７０
６）、処理を終了する。以上が要素隣接検定部１１１の
処理内容である。Next, a keyword information display section (10 in FIG. 1)
In 8), the processing flow of the element adjacency test unit 111 activated when the adjacency check button is clicked is shown in FIG. 27, and FIG. Element adjacency test unit 11
In step 1, a keyword corresponding element name given from the keyword information display unit 108 is read (2701, for example, 2801). Next, character string corresponding element information (10 in FIG. 1)
3) is read (2702). Then, a non-keyword corresponding element group is obtained as a set obtained by subtracting the keyword corresponding elements from all the character string corresponding elements (2703, 2802, for example). In the process 2704, it is checked whether or not the non-keyword corresponding element exists in the succeeding element of the non-keyword corresponding element by referring to the character string corresponding element information (for example, 2
803). If there is, an adjacent non-keyword corresponding element is presented to the user (2705, 280 for example).
4) End the process. If no keyword exists, the user is informed that no non-keyword is adjacent (270)
6), end the process. The above is the processing content of the element adjacency verification unit 111.

【００５３】以上、本実施例に示した形態によって、キ
ーワード抽出ルールの作成を支援することができる。As described above, the embodiment shown in this embodiment can support creation of a keyword extraction rule.

【００５４】[0054]

【発明の効果】以上のように、本発明によれば、与えら
れた論理構造定義から抽出した文字列対応要素間の隣接
情報を用いて、キーワードとして抽出する文字列対応要
素の決定を支援し、またキーワードを抽出する際のレイ
アウトや文字列に関する条件を、与えられた出力書式定
義から抽出することにより、キーワード抽出ルールの作
成に要する労力を大幅に軽減することができる。As described above, according to the present invention, it is possible to support determination of a character string corresponding element to be extracted as a keyword by using adjacent information between character string corresponding elements extracted from a given logical structure definition. In addition, by extracting conditions relating to layout and character strings when extracting keywords from a given output format definition, it is possible to greatly reduce the labor required for creating keyword extraction rules.

[Brief description of the drawings]

【図１】本発明の実施例に係わるキーワード抽出ルール
生成方法の概要を説明するブロック図である。FIG. 1 is a block diagram illustrating an outline of a keyword extraction rule generation method according to an embodiment of the present invention.

【図２】構造化文書生成の全体的な流れを示した図であ
る。FIG. 2 is a diagram showing an overall flow of structured document generation.

【図３】非構造化文書の例を示した図である。FIG. 3 is a diagram showing an example of an unstructured document.

【図４】図３に示した文書に対して設定されたＳＧＭＬ
形式の論理構造定義であるＤＴＤを示した図である。FIG. 4 is an SGML set for the document shown in FIG.
FIG. 3 is a diagram showing a DTD which is a logical structure definition of a format.

【図５】図４に示したＤＴＤをツリー状に表現した図で
ある。FIG. 5 is a diagram expressing the DTD shown in FIG. 4 in a tree shape.

【図６】図２に示した非構造化文書を、図４に示した論
理構造定義に沿う構造化文書に変換した例である。6 is an example in which the unstructured document shown in FIG. 2 is converted into a structured document conforming to the logical structure definition shown in FIG.

【図７】キーワード抽出ルールの例を示した図である。FIG. 7 is a diagram illustrating an example of a keyword extraction rule.

【図８】図７に示したキーワード抽出ルールにおける書
式条件の記述要素を示した図である。8 is a diagram showing description elements of format conditions in the keyword extraction rule shown in FIG.

【図９】キーワードの抽出例を示した図である。FIG. 9 is a diagram showing an example of keyword extraction.

【図１０】文字列対応要素の抽出例を示した図である。FIG. 10 is a diagram showing an example of extracting a character string corresponding element.

【図１１】ＤＴＤをＢＮＦ記法によって記述する際の変
換規則の例を示した図である。FIG. 11 is a diagram showing an example of a conversion rule when a DTD is described in BNF notation.

【図１２】図４のＤＴＤをＢＮＦ記法によって記述した
例である。FIG. 12 is an example in which the DTD of FIG. 4 is described in BNF notation.

【図１３】エレメントの冒頭に現われうる文字列対応要
素を求める手続きを示した図である。FIG. 13 is a diagram showing a procedure for obtaining a character string corresponding element that can appear at the beginning of an element.

【図１４】図１２の論理構造定義について、各エレメン
トの冒頭と末尾に現われうる文字列対応要素を示した図
である。14 is a diagram showing character string corresponding elements that can appear at the beginning and end of each element in the logical structure definition of FIG. 12;

【図１５】図１２の論理構造定義について、文字列対応
要素間の隣接関係を求める処理の例を示した図である。FIG. 15 is a diagram illustrating an example of a process of obtaining an adjacent relationship between character string corresponding elements for the logical structure definition of FIG. 12;

【図１６】文字列対応要素情報の例を示した図である。FIG. 16 is a diagram illustrating an example of character string corresponding element information.

【図１７】出力書式定義の例を示した図である。FIG. 17 is a diagram illustrating an example of an output format definition.

【図１８】キーワードを抽出するために必要な要件項目
の例を示した図である。FIG. 18 is a diagram showing an example of requirement items necessary for extracting a keyword.

【図１９】出力書式定義から要件項目の内容を抽出する
処理の例を示した図である。FIG. 19 is a diagram illustrating an example of a process of extracting the contents of a requirement item from an output format definition.

【図２０】キーワード情報表示部のインタフェース例を
示した図である。FIG. 20 is a diagram illustrating an example of an interface of a keyword information display unit.

【図２１】キーワード情報表示部の処理フローを示した
図である。FIG. 21 is a diagram showing a processing flow of a keyword information display unit.

【図２２】補完情報入力部のインタフェース例を示した
図である。FIG. 22 is a diagram illustrating an example of an interface of a supplementary information input unit.

【図２３】補完情報入力部の処理フローを示した図であ
る。FIG. 23 is a diagram showing a processing flow of a supplementary information input unit.

【図２４】書式条件生成の処理フローを示した図であ
る。FIG. 24 is a diagram showing a processing flow for generating a format condition.

【図２５】文字列条件入力部のインタフェースを示した
図である。FIG. 25 is a diagram showing an interface of a character string condition input unit.

【図２６】文字列条件入力部の処理フローを示した図で
ある。FIG. 26 is a diagram showing a processing flow of a character string condition input unit.

【図２７】要素隣接検定部の処理フローを示した図であ
る。FIG. 27 is a diagram showing a processing flow of an element adjacency test unit.

【図２８】要素隣接検定部の処理例を示した図である。FIG. 28 is a diagram illustrating a processing example of an element adjacency test unit.

[Explanation of symbols]

１０１論理構造定義１０２論理構造情報抽出部１０３文字列対応要素情報１０４出力書式定義１０５出力書式情報抽出部１０６出力書式情報１０７要素隣接検定部１０８キーワード情報表示部１０９補完情報入力部１１０文字列条件入力部２０１非構造化文書２０２キーワード抽出処理２０３キーワード抽出ルール２０４論理構造認識処理２０５論理構造認識ルール２０６構造化文書２０７論理構造定義 101 Logical Structure Definition 102 Logical Structure Information Extraction Unit 103 Character String Corresponding Element Information 104 Output Format Definition 105 Output Format Information Extraction Unit 106 Output Format Information 107 Element Adjacency Verification Unit 108 Keyword Information Display Unit 109 Complementary Information Input Unit 110 Character String Condition Input Part 201 unstructured document 202 keyword extraction process 203 keyword extraction rule 204 logical structure recognition process 205 logical structure recognition rule 206 structured document 207 logical structure definition

Claims

[Claims]

1. A rule for extracting a characteristic character string, that is, a keyword, representing a component of a logical structure of a document from an unstructured document, and is used when a structured document is generated from the unstructured document. A keyword extraction rule generation method for generating a keyword extraction rule, comprising extracting logical structure information from a logical structure definition given to a target document and generating character string corresponding element information. An output format information generating step of extracting output format information from an output format definition given to the target document to generate output format information; and extracting a keyword based on the generated character string corresponding element information and output format information A keyword extraction rule generation method, comprising a keyword extraction rule generation step of generating a rule.

2. The keyword extraction rule generation method according to claim 1, wherein the character string corresponding element information generating step includes, as character string corresponding element information, a character string corresponding element and a character string corresponding to the character string corresponding element. A keyword extraction rule generation method, wherein the output format information generating step extracts information on a layout and a character string when outputting a component of a logical structure of a document as output format information. .

3. The keyword extraction rule generation method according to claim 2, wherein the keyword extraction rule generation step displays the output format information to a user for each item necessary for keyword extraction, A method for generating a keyword extraction rule, wherein the output format information is corrected so as to conform to an output format on an unstructured document, and missing information is supplemented by the output format information.

4. The keyword extraction rule generation method according to claim 2, wherein in the keyword extraction rule generation step, a user determines which of the components of the logical structure is to be extracted as a keyword from the output format information. A keyword extraction rule generation method, which specifies and displays constituent elements of the logical structure to be extracted based on the character string corresponding element information to assist the user in determining the keyword.