JP3220865B2

JP3220865B2 - Full text search method

Info

Publication number: JP3220865B2
Application number: JP05831191A
Authority: JP
Inventors: 敦畠山; 浩道藤澤; 寛次加藤; 川口　　久光; 直材嶺岸
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1991-02-28
Filing date: 1991-02-28
Publication date: 2001-10-22
Anticipated expiration: 2016-10-22
Also published as: JPH04274557A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は，文書データベースを文
字列を指定して文書の全文を対象として探索するフルテ
キストサーチ方式に係わり，特に探索用に補助的なファ
イルを用いて全文探索処理を等価的に高速化するための
検索方法および装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a full-text search method for searching a document database by specifying a character string for the entire text of a document, and more particularly to a full-text search process using an auxiliary file for search. The present invention relates to a search method and apparatus for equivalently increasing the speed.

【０００２】[0002]

【従来の技術】従来の文書検索システムでは，登録する
文書の内容を表す単語（キーワードと呼ぶ）をインデク
スとする方式がとられている。しかし，この方式ではイ
ンデクサーとよばれるキーワード付けの専門家が文書を
逐一読み，内容を理解した上で適切なキーワードを振る
必要があった。この登録時の手間の掛かる作業を回避す
るために，「特開昭６３−１９８１２４」のような本文
中に出現する単語を全てキーワードとしてインデクスフ
ァイルに登録する方法も提案されている。しかし，上記
の方法ではインデクスファイルの作成時に，意味を持つ
最小の単位の単語を決定するのが難しく，単語辞書ある
いは，文法規則の不備のために，文章の解析に失敗し
て，重要な単語がキーワードとして抽出されないという
問題がある。2. Description of the Related Art In a conventional document retrieval system, a method is employed in which a word (referred to as a keyword) representing the contents of a document to be registered is indexed. However, in this method, an indexer called an indexer had to read the document one by one, understand the contents, and assign appropriate keywords. In order to avoid the troublesome work at the time of registration, there has been proposed a method of registering all words appearing in the text as keywords in an index file as disclosed in JP-A-63-198124. However, with the above method, it is difficult to determine the smallest unit word that has meaning when creating an index file, and the analysis of a sentence fails due to inadequate word dictionary or grammatical rules. Is not extracted as a keyword.

【０００３】この問題を解決するために検索時に文書を
文字コード化したテキストとして直接計算機に登録し，
検索時にはテキストデータベース内の全ての文書の内容
を読んで，与えられたキーワード（従来システムにおけ
る統制キーワードと区別するために，以後検索タームと
呼ぶ）を含む文書を探し出だすフルテキストサーチが提
案されている。このフルテキストサーチ方式は，「情報
処理学会研究報告ｖｏｌ．８９，ｎｏ．６６情報学
基礎１４−７テキストデータベース管理システムＳＩ
ＧＭＡとその応用（１９８９．７．２７）」の第２節冒
頭で述べられているように，テキストファイル全体を先
頭から一文字ずつ走査することが大きな特徴である。こ
うすることにより，キーワードに対応する文書識別子等
を記述したインデクスファイルがなくとも，テキストデ
ータベースのテキスト本体を手掛かりに検索することが
可能となる。すなわち，与えられた検索タームでテキス
トデータ全体を文字列探索し，検索タームが記述されて
いる文書のみを検索結果として出力することができる。
しかしながら，このフルテキストサーチ方式は，テキス
トファイル全体を先頭から一文字ずつ走査するために処
理時間が掛かり，大規模なデータベースに適用できない
という問題があった。同文献第２節中にみられるよう
に，汎用の大型計算機を持ってしても，２ＭＢ／ｓ程度
の検索処理速度しか実現できない。この速度でも，数メ
ガバイト程度のデータベースであれば，検索時間は実用
域内に入る。しかし，オフィス等の実用規模のデータベ
ースには数百メガバイトの容量が必要とされ，この場合
には十分な検索レスポンスが得られないことになる。In order to solve this problem, a document is directly registered in a computer as a character-coded text at the time of retrieval,
At the time of retrieval, a full-text search has been proposed that reads the contents of all documents in a text database and searches for documents containing a given keyword (hereinafter referred to as a search term to distinguish it from control keywords in conventional systems). ing. This full-text search method is described in “Information Processing Society of Japan Vol. 89, No. 66, Informatics Fundamentals 14-7 Text Database Management System SI
As described at the beginning of the second section of "GMA and its applications (1989. 7.27)", a major feature is that the entire text file is scanned one character at a time from the beginning. By doing so, even if there is no index file in which a document identifier or the like corresponding to the keyword is described, it is possible to search using the text body of the text database as a clue. That is, it is possible to perform a character string search on the entire text data using a given search term, and output only a document in which the search term is described as a search result.
However, this full-text search method has a problem in that it takes a long processing time to scan the entire text file one character at a time from the beginning, and cannot be applied to a large-scale database. As seen in the second section of the document, even with a general-purpose large computer, only a search processing speed of about 2 MB / s can be realized. Even at this speed, if the database is several megabytes, the search time will be within the practical range. However, a practical-scale database such as an office requires a capacity of several hundred megabytes, and in this case, a sufficient search response cannot be obtained.

【０００４】[0004]

【発明が解決しようとする課題】本発明の解決しようと
する課題は，実用規模のテキストデータベースを対象と
した場合でも，実用上許容しうる十分な検索時間で検索
結果が得られる文書の全文を検索対象とする高速なフル
テキストサーチ方法および装置を提供することにある。The problem to be solved by the present invention is that even when a practical-scale text database is targeted, the full text of a document in which search results can be obtained in a search time sufficient for practical use is obtained. An object of the present invention is to provide a high-speed full-text search method and apparatus to be searched.

【０００５】[0005]

【課題を解決するための手段】上記課題を解決するため
に，以下の処理ステップから構成されるフルテキストサ
ーチ方法を用い，該方法を実施する装置を構成する。（１）本文自体を格納するステップ（２）格納した本文を単語レベルで部分文字列へ分解
し，分解した部分文字列間で相互に文字列の包含関係を
調べ，他の部分文字列に含まれる文字列を排除した部分
文字列の集合からなる凝縮本文を作成するステップ（３）本文中で用いられている文字を重複なく集めた文
字成分表を作成するステップ（４）与えられた検索タームを文字レベルで分解し，検
索タームを構成する全ての文字を含む文書のみを抽出す
る文字成分表サーチのステップ（５）文字成分表で抽出された文書に対応する凝縮本文
を参照し，与えられた検索タームを含む文書を抽出する
凝縮本文サーチのステップ（６）与えられた検索条件式が複数の検索ターム間の本
文中での位置関係を指定している場合には，凝縮本文で
抽出された文書に対応する本文データを参照し，与えら
れた検索タームを含み，なおかつ検索ターム間に付与さ
れた位置関係等の検索条件を満たすもののみを抽出する
本文サーチのステップIn order to solve the above-mentioned problems, a full-text search method including the following processing steps is used, and an apparatus for implementing the method is configured. (1) Step of storing the text itself (2) Decomposing the stored text into partial character strings at the word level, examining the mutual inclusion of the character strings between the decomposed partial character strings, and including them in other partial character strings Step of creating a condensed text composed of a set of partial character strings excluding character strings to be used (3) Step of generating a character component table in which characters used in the text are collected without duplication (4) Given search term At the character level to extract only documents that contain all the characters that make up the search term. (5) Referring to the condensed text corresponding to the document extracted from the character component table, (6) If the given search condition expression specifies the positional relationship in the text between a plurality of search terms, the condensed text is extracted using the condensed text. Referring to text data corresponding to a document, it includes a given search term, step text search for extracting only yet satisfy the search conditions, such as granted positional relationship between search terms

【０００６】[0006]

【作用】このように，文字成分表サーチ，凝縮本文サー
チと階層的に絞り込みを行い最後に本文サーチを行う階
層型プリサーチ手段を設けることによって，文字成分表
サーチ，凝縮本文サーチで与えられた条件式を満たさな
い文書をテキスト本文を参照する以前に切り捨てて，検
索対象のテキスト本文を探索する量を少なくすることが
できる。すなわち，検索処理時間に占める割合が高い本
文検索処理時間を減らすことによって，全体の検索処理
時間を短縮することが可能となる。例えば，「本文中に
“画像”と“処理”とが同一の文（センテンス）内にあ
る文書を探せ」という二つの検索タームの本文中での位
置的な関係まで指定した条件式が与えられた場合，直接
本文を参照する従来の方法では検索処理速度を２ＭＢ／
ｓと仮定して，５００ＭＢのフルテキストを全て探索す
るのに２５０秒，すなわち約４分掛かる。しかし，階層
型プリサーチでは，典型的な場合で，文字成分表でデー
タベース全件の１０％に，凝縮本文でさらにその１０％
に絞り込めたとすると，凝縮本文の容量が本文の３０％
の場合，文字成分表の容量はデータベース全体からみる
と無視できるほど小さいので，検索すべき凝縮本文の容
量は１５ＭＢで，本文データの探索量は全データベース
量の１％，すなわち５ＭＢとなるため，２ＭＢ／ｓの検
索速度でも，１０秒で検索処理を終了できることにな
る。このように，「階層型プリサーチ方式」では，「文字
成分表」と「凝縮本文」という２段階のプリサーチを事前
に行い，それぞれ「文字レベル」と「単語レベル」のふるい
に掛け，最も時間を要する本文サーチの対象となる文書
数をあらかじめ最小に絞り込んでおくことによって，探
索文書容量を削減することができるため，等価的に非常
に高速なフルテキストサーチが実現できることになる。
また，条件式が単一の検索タームあるいは複数の検索タ
ームでのＡＮＤ，ＯＲ，ＮＯＴ条件の場合には，凝縮本
文サーチでの結果をそのまま最終検索結果とすることが
できる。なぜなら，凝縮本文中に存在している単語は，
必ず本文中にも存在するためもう一度本文を検索する必
要がないためである。このように，「単語レベル」での
検索では処理時間のかかる本文サーチをまったく省略す
ることができるため，より一層全体の検索処理時間を短
縮することが可能となる。以上の処理ステップから構成
されるフルテキストサーチ方法によれば，直接本文を探
索する量を予め少なくすることができるため，高速なフ
ルテキストサーチが可能となる。As described above, the character component table search and the condensed text search are provided by providing a hierarchical pre-search means for narrowing down the text component table search and the condensed text search and finally performing the text search. Documents that do not satisfy the conditional expression can be truncated before referring to the text body to reduce the amount of searching for the text body to be searched. That is, it is possible to reduce the entire search processing time by reducing the text search processing time, which accounts for a large proportion of the search processing time. For example, given a conditional expression that specifies the positional relationship in the body of two search terms, "Search for documents in which the" image "and" processing "are in the same sentence (sentence) in the body" In the conventional method, the search processing speed is 2 MB /
Assuming s, it takes 250 seconds, or about 4 minutes, to search through the full 500 MB full text. However, in the case of hierarchical pre-search, in a typical case, the character component table accounts for 10% of the entire database,
If you can narrow down to 30%, the capacity of the condensed text is 30% of the text
In the case of, the capacity of the character component table is so small that it can be ignored from the viewpoint of the whole database. Therefore, the capacity of the condensed text to be searched is 15 MB, and the search amount of the text data is 1% of the total database size, ie, 5 MB. Even at a search speed of 2 MB / s, the search process can be completed in 10 seconds. As described above, in the “hierarchical pre-search method”, two-stage pre-searches of “character component table” and “condensed text” are performed in advance, and sieved at “character level” and “word level” respectively. By narrowing down the number of documents to be subjected to the time-consuming text search to a minimum in advance, the search document capacity can be reduced, and therefore, a very high-speed full-text search can be equivalently realized.
If the conditional expression is an AND, OR, or NOT condition in a single search term or a plurality of search terms, the result of the condensed text search can be used as the final search result. Because the words in the condensed text are
This is because there is no need to search the text again because it is always present in the text. As described above, the search at the "word level" can completely omit a text search that requires a long processing time, so that the entire search processing time can be further reduced. According to the full-text search method including the above processing steps, the amount of direct text search can be reduced in advance, so that high-speed full-text search can be performed.

【０００７】[0007]

【実施例】以下，本発明の第一の実施例について図１を
用いて説明する。本装置は，ディスプレイ１００，キー
ボード１０１，中央制御装置ＣＰＵ１０２，文字成分表
１０５，凝縮本文１０４，及び本文１０３格納用ファイ
ル１１０，フロッピディスクドライバ１０６，主メモリ
２００から構成される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A first embodiment of the present invention will be described below with reference to FIG. This apparatus comprises a display 100, a keyboard 101, a central control unit CPU 102, a character component table 105, a condensed text 104, a file 110 for storing text 103, a floppy disk driver 106, and a main memory 200.

【０００８】主メモリ２００には，本文登録プログラム
２０１，凝縮本文作成登録プログラム２０２，文字成分
表作成登録プログラム２０３，文字成分表サーチプログ
ラム２０４，凝縮本文サーチプログラム２０５，本文サ
ーチプログラム２０６，階層型プリサーチ制御プログラ
ム２０７が格納されるとともに，データエリア２０８が
確保されている。これらのプログラムはＣＰＵ１０２で
実行される。文書の登録の際は，キーボード１０１から
入力されるコマンドにより，ＣＰＵ１０２がフロッピデ
ィスクドライバ１０６に挿入されるフロッピディスクか
ら文書データを読込み，本文登録プログラム２０１を実
行して読み込んだ文書データを本文１０３としてファイ
ル１１０へ格納する。次にＣＰＵ１０２は，凝縮本文作
成登録プログラム２０２を実行して，本文１０３を単語
レベルで部分文字列へ分割し，分割した部分文字列間で
相互に文字列の包含関係を調べ，他の部分文字列に含ま
れる文字列を排除した部分文字列の集合からなる凝縮本
文を作成し，これを凝縮本文１０４としてファイル１１
０へ格納する。最後にＣＰＵ１０２は，文字成分表作成
登録プログラム２０３を実行して，本文１０３中で用い
られている文字を重複なく集めた文字成分表を作成し，
これを文字成分表１０５としてファイル１１０へ格納す
る。In the main memory 200, a text registration program 201, a condensed text creation registration program 202, a character component table creation registration program 203, a character component table search program 204, a condensed text search program 205, a text search program 206, a hierarchical type A search control program 207 is stored, and a data area 208 is secured. These programs are executed by the CPU 102. At the time of document registration, the CPU 102 reads document data from a floppy disk inserted into the floppy disk driver 106 by a command input from the keyboard 101, executes the text registration program 201 and sets the read document data as the text 103. It is stored in the file 110. Next, the CPU 102 executes the condensed text creation / registration program 202, divides the text 103 into partial character strings at the word level, checks the mutual inclusion of the character strings among the divided partial character strings, A condensed text composed of a set of partial character strings excluding the character strings included in the column is created, and the
Store to 0. Finally, the CPU 102 executes the character component table creation registration program 203 to create a character component table in which characters used in the text 103 are collected without duplication.
This is stored in the file 110 as the character component table 105.

【０００９】検索の際には，キーボード１０１から入力
された検索条件式が，ＣＰＵ１０２に送られる。ＣＰＵ
１０２では，まず階層検索制御プログラム２０７を実行
し，その制御に基づいて文字成分表サーチプログラム２
０４，凝縮本文サーチプログラム２０５，本文サーチプ
ログラム２０６を順次実行する。すなわち，文字成分表
サーチでは，入力された検索条件式中の検索タームを文
字レベルで分解し，検索タームを構成する全ての文字を
含む文書のみを抽出する。そして，文字成分表で抽出さ
れた文書に対応する凝縮本文を参照し，与えられた検索
タームを含む文書を抽出する。もし，与えられた検索条
件式中に単一の検索タームか，もしくは複数の検索ター
ム間の論理的な関係が指定されているのみで，本文中で
の位置関係までは指定されていない場合には，ここで検
索を終了し，凝縮本文サーチの結果を検索結果として出
力する。それ以外の場合，すなわち与えられた検索条件
式中に複数の検索ターム間の本文中での位置関係が指定
されている場合には，凝縮本文サーチで抽出された文書
に対応する本文データを参照し，与えられた検索ターム
を含み，なおかつ検索ターム間に付与された位置関係等
の検索条件を満たすもののみを抽出し，検索結果として
出力する。以上が本発明の第一の実施例のフルテキスト
サーチ装置の概略である。At the time of a search, a search condition expression input from the keyboard 101 is sent to the CPU 102. CPU
In step 102, first, the hierarchical search control program 207 is executed, and based on the control, the character component table search program 2 is executed.
04, a condensed text search program 205, and a text search program 206 are sequentially executed. That is, in the character component table search, the search terms in the input search condition expression are decomposed at the character level, and only documents containing all the characters constituting the search terms are extracted. Then, referring to the condensed text corresponding to the document extracted in the character component table, a document including the given search term is extracted. If only a single search term or a logical relationship between multiple search terms is specified in the given search condition expression, but not a positional relationship in the text, Terminates the search here and outputs the result of the condensed text search as the search result. In other cases, that is, when a given search condition expression specifies a positional relationship in the text between multiple search terms, refer to the text data corresponding to the document extracted by the condensed text search. Then, only those that include the given search terms and satisfy the search conditions such as the positional relationship given between the search terms are extracted and output as the search results. The above is the outline of the full-text search device according to the first embodiment of the present invention.

【００１０】以下，本発明の特徴である文字成分表サー
チ，凝縮本文サーチ，及び本文サーチと絞り込みを行う
階層型プリサーチ方式の，登録及び検索方法について概
略を説明する。まず「凝縮本文」と「文字成分表」の作
成は，文書の登録時に自動的に行う。この処理内容を，
図２に示す。本図で，登録すべき文書が入力されると，
まずそのまま「本文」として格納する。次に，この
「本文」から「凝縮本文」を作成する。「凝縮本文」
は，「本文」の中から文字種（漢字、ひらがな、カタカ
ナ、英字等）ごとに文字列を分割し，繰り返し現れる言
葉の重複を排除して作成される。本文が「あいまい検索
のための検索技術・・・・・」という文書１の場合に
は，「検索」が重複語として切り捨てられ，「あいま
い」と「検索技術」及び「のための」が「凝縮本文」と
して残ることになる。また，「本文」から「文字成分
表」を作成する。ここでは，「本文」に現われる文字を
１ビットの情報で表す。文書１の例では，「あ」と
「い」があるのでそれぞれ‘１’を，また「う」はない
ので‘０’を設定する。「検」と「索」も同様にそれぞ
れ‘１’を設定する。以下同様にして，文字成分表の該
当文字部分に，「本文」にその文字がある場合には
‘１’を，存在しない場合には‘０’を設定する。この
ようにして，文書の登録時に「凝縮本文」と「文字成分
表」を自動的に作成し，階層型プリサーチの準備をして
おく。Hereinafter, registration and search methods of a character component table search, a condensed text search, and a hierarchical presearch method for performing text search and refinement, which are features of the present invention, will be briefly described. First, the "condensed text" and the "character component table" are created automatically when a document is registered. This processing content,
As shown in FIG. In this figure, when the document to be registered is entered,
First, it is stored as "text" as it is. Next, a “condensed text” is created from the “text”. "Condensed text"
Is created by dividing a character string for each character type (Kanji, Hiragana, Katakana, English characters, etc.) from the "body" and eliminating repeated words that appear repeatedly. In the case where the text is document 1 of "search technique for fuzzy search ...", "search" is truncated as a duplicate word, and "fuzzy", "search technique" and "for" It will remain as "condensed text". In addition, a "character component table" is created from the "text". Here, characters appearing in the “body” are represented by 1-bit information. In the example of the document 1, "1" is set because there are "A" and "I", and "0" is set because there is no "U". Similarly, “1” is set for “detect” and “cord”, respectively. Similarly, in the corresponding character portion of the character component table, "1" is set when the character is present in "body", and "0" is set when it does not exist. In this way, a "condensed text" and a "character component table" are automatically created when a document is registered, and preparation for hierarchical presearch is made.

【００１１】検索時には，図３に示すように，登録の逆
の順序でこれらの補助ファイルを参照する。まず第１
に，文字成分表サーチでは，文字成分表を参照し，検索
ターム中の全ての文字に対応する文字成分表の該当文字
部分に‘１’が設定されているものを選びだす。第２
に，凝縮本文サーチでは，文字成分表で選びだされた文
書の凝縮本文を参照し，条件式に与えられた検索ターム
を含む文書を選びだす。最後に本文サーチでは，検索タ
ームの本文での出現位置が条件式と適合するもののみを
選びだす。本図の例では，検索［４Ｃ］理解すなわち，「“検索”と“理解”が本文中で４文字以内
に近接して現れるものを探せ」という条件式で検索した
例を示している。結果として文書４の“検索”と“理
解”が本文中で４文字離れている文書が抽出される。At the time of retrieval, these auxiliary files are referred to in the reverse order of registration, as shown in FIG. First,
In the character component table search, the character component table is referred to, and the character component table corresponding to all the characters in the search term and a character component set to “1” is selected. Second
In the condensed text search, the condensed text of the document selected in the character component table is referred to, and a document including the search term given in the conditional expression is selected. Finally, in the text search, only those terms whose appearance in the text of the search term matches the conditional expression are selected. In the example of this figure, search [4C] comprehension, that is, an example in which a search is performed using a conditional expression of “search for a character in which“ search ”and“ comprehension ”appear within four characters in proximity to each other”. As a result, a document in which “search” and “understanding” of document 4 are separated by 4 characters in the text is extracted.

【００１２】以下，本実施例で用いる文字種分割・重複
語排除型凝縮本文及び文字コード依存型文字成分表の作
成方法と，これらを用いた階層型プリサーチの制御方法
について具体的に説明する。まず最初に本実施例で用い
る文字種分割・重複語排除型凝縮本文の作成方法につい
て説明する。図４に示すように，まず本文テキストから
文字種により文字列を分割する。この時の文字種とは，
漢字，ひらがな，カタカナ，英字，数字，記号その他で
ある。これらの単一文字種の連なりからなる文字列毎に
本文の文字列を分割する。次に，分割した文字列のそれ
ぞれについて，同一文書内にある他の部分文字列にその
部分文字列がそっくり含まれてしまう場合，その文字列
を重複文字列として凝縮本文の対象から排除する。例え
ば，“検索”という部分文字列は，同一文書内にある他
の“知的検索技術”という部分文字列に完全に含まれる
ので，この“検索”は凝縮本文には登録しない。しか
し，凝縮本文サーチでは，たとえ“検索”という文字列
は凝縮本文に登録されていなくとも，“知的検索技術”
の部分文字列としてヒットすることになる。このよう
に，部分文字列の重複登録を排除して，得られた部分文
字列には，図５に示すように文書毎に文字列の間にセパ
レータを挿入する。本図では，セパレータとして記
号‘，’を用いている。図２，図３ではこのセパレータ
は記号‘｜’で表されているが，このセパレータは特に
文字として表す必要はなく，文字に割り当てられていな
い特殊なコードを使用することもできる。Hereinafter, a method for creating a character type division / duplicate word elimination type condensed text and a character code dependent type character component table used in this embodiment and a method for controlling a hierarchical presearch using these will be described in detail. First, a description will be given of a method of creating a character type division / duplicate word elimination type condensed text used in the present embodiment. As shown in FIG. 4, first, a character string is divided from the body text according to the character type. The character type at this time is
Kanji, Hiragana, Katakana, English letters, numbers, symbols, etc. The character string of the body is divided for each character string composed of a series of these single character types. Next, for each of the divided character strings, when the other partial character strings in the same document completely include the partial character strings, the character strings are excluded from the target of the condensed text as duplicate character strings. For example, since the partial character string “search” is completely included in another partial character string “intelligent search technology” in the same document, this “search” is not registered in the condensed text. However, in the condensed text search, even if the character string “search” is not registered in the condensed text, “intelligent search technology”
Will be hit as a substring. As described above, the overlapping registration of the partial character strings is eliminated, and a separator is inserted between the character strings for each document in the obtained partial character string as shown in FIG. In this drawing, symbols ',' are used as separators. In FIGS. 2 and 3, this separator is represented by the symbol "|", but this separator does not need to be particularly represented as a character, and a special code not assigned to a character can be used.

【００１３】次に本実施例で用いる文字コード依存型文
字成分表の作成方法について説明する。図６に示すよう
に，文字コード依存型文字成分表は，文字コードによっ
て存在を示すビット情報として，１を立てるビット位置
を決定する文字成分表である。本図ではシフトＪＩＳコ
ードを例に説明している。同図で（ＸＸＸＸ）Ｈは文字
コードを１６進表示したものである。例えば“検索”と
いう文字列が文書１の本文中に存在することを示すの
に，文書１のビットリストの（８Ｃ９Ｆ）Ｈ，（８ＤＦ
５）Ｈ番目に１を設定する。ビットリスト中のこの文字
に対応するビット位置を文字成分表のエントリ番号と呼
ぶことにする。例えば‘検’のエントリ番号は（８Ｃ９
Ｆ）Ｈ，または１０進表示すれば３５９９９となる。Next, a method of creating a character code dependent type character component table used in this embodiment will be described. As shown in FIG. 6, the character code dependent type character component table is a character component table that determines a bit position at which 1 is set as bit information indicating presence by a character code. In the figure, the shift JIS code is described as an example. In the figure, (XXXX) H is a character code displayed in hexadecimal. For example, to indicate that the character string “search” exists in the text of document 1, (8C9F) H, (8DF
5) Set 1 to H-th. The bit position corresponding to this character in the bit list is called the entry number of the character component table. For example, the entry number of “check” is (8C9
F) 35999 if H or decimal notation is displayed.

【００１４】以上の文字成分表と凝縮本文を用いた階層
型プリサーチの制御及びサーチ動作について説明する。
まず検索条件式中の検索タームをそれぞれ一文字単位に
分解し，文字成分表サーチを行う。ここでは与えられた
検索タームを構成する文字コードに対応するビットリス
ト中のエントリ番号の位置がすべて１となるビットリス
トを持つ文書を求めることとなる。例えば，“検索”と
いう文字列が検索タームとして与えられた場合，
‘検’，‘索’に対応するビットリストの（８Ｃ９Ｆ）
Ｈ，（８ＤＦ５）Ｈ番目のビットがすべて１である文書
１，２，３，４，．．．を文字成分表サーチの検索結果
とする。すなわち，図７に示すように‘検’を示す（８
Ｃ９Ｆ）Ｈのエントリ番号のビットリスト７０１と，
‘索’を示す（８ＤＦ５）Ｈのエントリ番号のビットリ
スト７０２との間でビット毎にＡＮＤ演算を施し，ビッ
トＡＮＤ演算結果７０３を得る。このビットＡＮＤ演算
結果７０３のビットリスト中で，１となっているビット
位置に対応する文書番号が文字成分表サーチの検索結果
としてのヒット文書を表すことになる。すなわち，
‘検’と‘索’を全て含む文書が抽出されることにな
る。また，“湖”のように検索タームがただ１個の文字
から構成される場合は，ここで文字成分表サーチの結果
を出力して検索を終了することができる。The control and search operation of the hierarchical pre-search using the character component table and the condensed text will be described.
First, the search terms in the search condition expression are decomposed into character units, and a character component table search is performed. Here, a document having a bit list in which the positions of the entry numbers in the bit list corresponding to the character codes constituting the given search term are all 1 is obtained. For example, if the character string “search” is given as a search term,
(8C9F) of the bit list corresponding to 'Detect' and 'Search'
H, (8DF5) Documents 1, 2, 3, 4,. . . Is the search result of the character component table search. In other words, “test” is indicated as shown in FIG.
C9F) A bit list 701 of H entry numbers,
An AND operation is performed for each bit with the bit list 702 of the entry number of (8DF5) H indicating “search”, and a bit AND operation result 703 is obtained. In the bit list of the bit AND operation result 703, a document number corresponding to a bit position of 1 indicates a hit document as a search result of the character component table search. That is,
A document including all of the search and the search will be extracted. If the search term is composed of only one character such as "lake", the result of the character component table search can be output here and the search can be terminated.

【００１５】次に文字成分表サーチで抽出された文書の
凝縮本文に対してサーチを行う。ここでは図５のように
文書毎に登録された凝縮本文の内容をスキャンして，与
えられた検索タームを単語として含む文書を抽出する。
つまり，‘検’と‘索’の２文字が“検索”と連続して
現れる文書のみを抽出する。すなわち，‘検’と‘索’
が含まれていても，“検出”と“探索”というように，
別の単語として現われるようなものはここで切り捨てて
しまう。このためには，文字成分表サーチで絞り込まれ
た文書毎の凝縮本文について本文テキストデータと同じ
ように，一文字ずつスキャンしながら探索する。この
時，文字成分表サーチで得られた結果の文書番号に対応
する凝縮本文しかスキャンしない。例えば，文字成分表
サーチの結果が文書番号１，２，３，４，．．．であれ
ば，凝縮本文サーチでは，文書番号１，２，３，
４，．．．の凝縮本文をスキャンする。そして，実際に
凝縮本文中に検索タームが存在する文書を凝縮本文サー
チの検索結果として出力する。Next, a search is performed on the condensed text of the document extracted by the character component table search. Here, as shown in FIG. 5, the content of the condensed text registered for each document is scanned, and a document containing a given search term as a word is extracted.
That is, only documents in which the two characters “search” and “search” appear consecutively as “search” are extracted. In other words, 'test' and 'chord'
Even if is included, like "detection" and "search",
Anything that appears as another word is truncated here. For this purpose, the condensed text for each document narrowed down by the character component table search is searched while scanning one character at a time in the same manner as the text data. At this time, only the condensed text corresponding to the document number obtained as a result of the character component table search is scanned. For example, if the result of the character component table search is document numbers 1, 2, 3, 4,. . . Then, in the condensed text search, the document numbers 1, 2, 3,
4,. . . Scan the condensed text of. Then, a document in which the search term actually exists in the condensed text is output as a search result of the condensed text search.

【００１６】このように，「階層型プリサーチ方式」で
は，「文字成分表」と「凝縮本文」という２段階のプリ
サーチを事前に行い，それぞれ「文字レベル」と「単語
レベル」のふるいに掛け，最も時間を要する本文サーチ
の対象となる文書数をあらかじめ最小に絞り込んでおく
ことによって，探索文書容量を削減することができるた
め，等価的に非常に高速なフルテキストサーチが実現で
きることになる。すなわち，文字成分表サーチでは，文
字成分表が文字の存在を１ビットの情報で表しているた
め，サーチするデータ容量を極めて小さくすることがで
き，その結果検索時間も短時間に納めることが可能とな
る。さらに，キーワードを構成する文字毎のビットリス
トの論理積を取ることによって，キーワードに関連のな
い文書を大幅に切り捨て，以降の対象文書を格段に絞り
込むことが可能となる。また，凝縮本文サーチでは，本
文を直接スキャンするよりもデータ量が少ない分，検索
処理時間が短縮できることになる。As described above, in the "hierarchical pre-search method", two-stage pre-searches of "character component table" and "condensed text" are performed in advance, and the "character level" and "word level" are sieved, respectively. By narrowing down the number of documents to be subjected to the most time-consuming text search in advance to the minimum, the search document volume can be reduced, so that a very high-speed full text search can be equivalently realized. . That is, in the character component table search, since the character component table indicates the presence of a character by one-bit information, the data capacity to be searched can be extremely reduced, and as a result, the search time can be reduced to a short time. Becomes Further, by taking the logical product of the bit list for each character constituting the keyword, documents that are not related to the keyword can be largely discarded, and subsequent documents can be significantly narrowed down. Also, in the condensed text search, the search processing time can be shortened by the smaller amount of data than by directly scanning the text.

【００１７】次に，本発明の第二の実施例を説明する。
本実施例は，複数の検索タームが指定された場合でも，
効率的に階層型プリサーチを行うことのできるフルテキ
ストサーチ方法を提供するものである。例えば，「“検
索”ＡＮＤ“理解”」という条件式が与えられたときに
は，まず，第１ステップとして文字成分表をサーチす
る。ここでは与えられた検索ターム毎にそのすべての文
字を含む文書を探し，その後検索ターム間で与えられた
条件を満たすような文書を出力する。「“検索”ＡＮＤ
“理解”」という条件式の場合には，“検索”の２文字
を含み，かつ“理解”の２文字を含む文書を探す。すな
わち，「（‘検’ＡＮＤ‘索’）ＡＮＤ（‘理’ＡＮＤ‘解’）」従って，「‘検’ＡＮＤ‘索’ＡＮＤ‘理’ＡＮＤ ‘解’」つまり，上記の４文字を同時に含む文書を検索する。次
に，この文字成分表サーチの結果絞り込まれた文書の凝
縮本文をサーチする。ここでは，指定されたキーワード
が単語として現われる文書だけを抽出する。すなわち，
“検索”と“理解”を両方同時に含む文書を検索する。Next, a second embodiment of the present invention will be described.
In this embodiment, even when multiple search terms are specified,
An object of the present invention is to provide a full-text search method capable of efficiently performing a hierarchical pre-search. For example, when a conditional expression "" search "AND" understand "" is given, a character component table is searched as a first step. Here, a document containing all the characters is searched for each given search term, and then a document that satisfies the given conditions between the search terms is output. "" Search "AND
In the case of the conditional expression “understand”, a document that includes the two characters “search” and that includes the two characters “understand” is searched. In other words, "('test' AND 'search') AND ('rea' AND 'solution')" Therefore, "'test' AND 'search' AND 'reason' AND 'solution'" Search for the containing document. Next, the condensed text of the document narrowed down as a result of the character component table search is searched. Here, only documents in which the specified keyword appears as a word are extracted. That is,
Search for documents containing both "search" and "understanding" at the same time.

【００１８】この例の場合のように，検索ターム間の関
係が“ＡＮＤ”，“ＯＲ”等の論理条件だけで，その他
にキーワード間の位置関係を規定する条件が指定されて
いない場合には，ここで検索を終了し，凝縮本文サーチ
の結果を最終検索結果として出力する。もし，位置条件
が指定されている場合には，凝縮本文サーチで抽出され
た文書の本文をサーチし，指定条件に合致するものを抽
出し，これを最終検索条件として出力する。以上が本実
施例における検索動作の説明である。このように，文字
成分表サーチ，凝縮本文サーチで検索ターム間の論理積
を取ることにより，複数個の検索タームが指定された場
合でも，効率的に階層型プリサーチを行い，高速なフル
テキストサーチを実現することができる。As in the case of this example, when the relationship between the search terms is only a logical condition such as "AND" or "OR" and no other condition for specifying the positional relationship between the keywords is specified. Here, the search is terminated, and the result of the condensed text search is output as the final search result. If the position condition is specified, the text of the document extracted by the condensed text search is searched to extract a document that satisfies the specified condition, and this is output as the final search condition. The above is the description of the search operation in this embodiment. In this way, by taking the logical product between the search terms in the character component table search and the condensed text search, even if multiple search terms are specified, hierarchical pre-search can be performed efficiently, and high-speed full text search can be performed. A search can be realized.

【００１９】これより第三の実施例として，さらに一般
的に階層型プリサーチの検索制御について詳細に説明す
る。図８にこのときの階層型プリサーチの制御の手順を
ＰＡＤ図にて説明する。ここでは「“計算機”と“知的
インタフェース”のどちらかを含む文書を探せ」すなわ
ち「“計算機”ＯＲ“知的インタフェース”」という検索式を例にあげて説明する。まず，最初にステ
ップ８０００で文字成分表サーチを行う。ここでは与え
られた検索ターム毎にそのすべての文字を含む文書を探
し，その後検索ターム間に与えられた複合条件を満たす
ような文書を出力する。この例では，図９に示すように
“計算機”を構成する３個の文字のそれぞれについて文
字成分表の該当するエントリ番号間のビットＡＮＤ演算
を行い，次に同様に“知的インタフェース”を構成する
９個の文字のそれぞれについて文字成分表の該当するエ
ントリ番号間のビットＡＮＤ演算を行い，最後に先に作
成した“計算機”に対するのときのビットＡＮＤ演算結
果とそのビット列のＯＲ演算を行う。すなわち，「（‘計’ＡＮＤ‘算’ＡＮＤ‘機’）ＯＲ（‘知’ＡＮＤ‘的’ＡＮＤ‘イ’ＡＮＤ ‘ン’ＡＮＤ‘タ’ＡＮＤ‘フ’ＡＮＤ ‘ェ’ＡＮＤ‘ー’ＡＮＤ‘ス’）」という検索式を実行することになる。これにより，“計
算機”を構成する３個の文字をすべて含む文書，もしく
は“知的インタフェース”を構成する９個の文字をすべ
て含む文書が抽出される。以上の文字成分表サーチの結
果件数が０件であれば，第８図に示すようにここで０件
という検索結果を出力して検索を終了する。また，
‘湖’のように検索タームがただ１個の文字から構成さ
れる場合も，ここで文字成分表サーチの結果を出力して
検索を終了する。As a third embodiment, the search control of the hierarchical presearch will be described in more detail. FIG. 8 illustrates a procedure of controlling the hierarchical pre-search at this time with reference to a PAD diagram. Here, a search expression "search for a document containing either" computer "or" intelligent interface "", that is, "" computer "OR" intelligent interface "" will be described as an example. First, in step 8000, a character component table search is performed. Here, a document containing all the characters is searched for each given search term, and then a document that satisfies the compound condition given between the search terms is output. In this example, as shown in FIG. 9, for each of the three characters constituting the “computer”, a bit AND operation is performed between the corresponding entry numbers in the character component table, and then the “intelligent interface” is similarly constructed. The bit AND operation between the corresponding entry numbers in the character component table is performed for each of the nine characters to be performed, and the bit AND operation result for the "computer" created previously and the OR operation of the bit string are performed. That is, "('total' AND 'operation' AND 'machine') OR ('knowledge' AND 'like' AND 'i' AND 'n'AND'ta'AND'F' AND 'e'AND'-'AND'Su')". As a result, a document including all three characters constituting the “computer” or a document including all nine characters constituting the “intelligent interface” is extracted. If the number of the results of the character component table search described above is 0, a search result of 0 is output as shown in FIG. 8, and the search is terminated. Also,
Even when the search term is composed of only one character such as “lake”, the result of the character component table search is output here and the search is terminated.

【００２０】もし，検索タームが複数の文字で構成され
ていて，かつ文字成分表サーチの結果件数が０件でなけ
れば，次に凝縮本文サーチを行う。凝縮本文に登録され
ている内容は，文字種ごとに分割された文字列である。
例えば，“知的インタフェース”のように，途中で文字
種が異なれば凝縮本文では部分文字列へ分解され，「知
的，インタフェース」のように分割点にセパレータが入
る。したがって，“知的インタフェース”のように異な
る文字種から構成される検索タームの場合，このままで
は凝縮本文をサーチしても該当する文字列が存在しない
ことになる。そこで，凝縮本文サーチに入る前に検索タ
ームをチェックし，異なる文字種で構成される検索ター
ムはこれを文字種毎に分割する。このように文字種で分
割するという処理を施した検索タームを元々の検索ター
ムと区別して，分割検索タームと呼ぶ。そして凝縮本文
サーチは，例えば“計算機”，“知的”，“インタフェ
ース”のように分割検索タームで検索する。ただし，分
割検索タームに関しては，分割元を同じくするターム間
でＡＮＤ条件で検索を行う。例えば，「“計算機”ＯＲ“知的インタフェース”」という条件式の場合，凝縮本文サーチでは「“計算機”ＯＲ（“知的”ＡＮＤ“インタフェース”）」すなわち，「“知的”と“インタフェース”が同一文書
内に存在するか，または“計算機”が存在する文書を探
せ」という条件式として検索を行うことになる。If the search term is composed of a plurality of characters and the result of the character component table search is not zero, a condensed text search is performed next. The content registered in the condensed text is a character string divided for each character type.
For example, if the character type is different in the middle, such as "intelligent interface", the condensed text is decomposed into partial character strings, and a separator is inserted at the division point, such as "intelligent interface". Therefore, in the case of a search term composed of different character types, such as "intelligent interface", there is no corresponding character string even if the condensed text is searched as it is. Therefore, a search term is checked before entering the condensed text search, and a search term composed of different character types is divided for each character type. A search term that has been subjected to the process of splitting by character type in this way is distinguished from the original search term, and is called a split search term. In the condensed text search, for example, a search is performed using a split search term such as "computer", "intelligent", or "interface". However, with respect to the divided search term, a search is performed between terms having the same division source under an AND condition. For example, in the case of the conditional expression "" computer "OR" intelligent interface ", the condensed text search returns""computer" OR ("intelligent" AND "interface") ", that is,""intelligent" and "interface" Is found in the same document, or search for a document in which "computer" exists ".

【００２１】凝縮本文サーチの結果が０件であれば，こ
こで０件という検索結果を出力して検索を終了する。
また近傍条件，または文脈条件の指定の有る場合，ある
いは“知的インタフェース”のような分割される検索タ
ームがある場合，つまり検索タームと分割検索タームが
異なる場合に限り本文サーチを行う。そうでない場合，
ここで階層型プリサーチを終了し凝縮本文の結果を検索
結果として出力する。ここで，文脈条件とは例えば，「“計算機”［Ｓ］“知的インタフェース”」のように示される条件式でこれは，「“計算機”と“知
的インタフェース”が同一の文（センテンス）内にある
ものを探せ」という意味を表す。あるいは近傍条件と
は，例えば，「“計算機” ［１０Ｃ］ “知的インタフェース”」のように記述されるもので，これは，「“計算機”と
“知的インタフェース”が１０文字以内に近接して現れ
る文書を探せ」という意味を表す。すなわち，文脈条
件，近傍条件とも文書中に出現する検索タームの位置関
係を指定する検索条件のことである。If the result of the condensed text search is 0, a search result of 0 is output and the search is terminated.
Also, the text search is performed only when the neighborhood condition or the context condition is specified, or when there is a search term to be divided such as “intelligent interface”, that is, when the search term and the divided search term are different. If not,
Here, the hierarchical pre-search is completed, and the result of the condensed text is output as a search result. Here, the context condition is a conditional expression such as, for example, ““ computer ”[S]“ intelligent interface ””, which is a sentence (sentence) in which “computer” and “intelligent interface” are the same. Find what is inside. " Alternatively, the neighborhood condition is described, for example, as ““ computer ”[10C]“ intelligent interface ””, which means that “computer” and “intelligent interface” are close to each other within 10 characters. Find the document that appears. " That is, both the context condition and the neighborhood condition are search conditions that specify the positional relationship of search terms that appear in the document.

【００２２】このような本文中に現れる検索タームの位
置関係を指定した検索条件が与えられた場合，もしくは
凝縮本文中ではセパレータで区切られた途中で文字種の
変わる検索タームが与えられた場合には，凝縮本文サー
チの結果に対応する本文データを参照し，与えられた条
件通りに本文中に検索タームが出現するもののみを検索
結果として出力し，検索を終了することになる。このよ
うに，検索タームが異なる文字種で構成されている場
合，或いは検索ターム間の本文中での出現位置に関する
条件指定がある場合についても，効率的に階層型プリサ
ーチを行い，高速なフルテキストサーチを実現すること
ができる。When a search condition specifying the positional relationship of the search terms appearing in the text is given, or when a search term whose character type changes in the middle of the condensed text is separated by a separator, Then, the text data corresponding to the result of the condensed text search is referred to, only those in which the search term appears in the text according to the given conditions are output as the search results, and the search ends. In this way, even when the search terms are composed of different character types, or when there is a condition specification regarding the appearance position in the text between the search terms, efficient hierarchical pre-search is performed and high-speed full text A search can be realized.

【００２３】次に，本発明の第四の実施例について説明
する。本実施例は，第一の実施例における文字成分表の
容量を削減し，コンパクトにしたものである。第一の実
施例で用いた文字コード依存型文字成分表は，処理が簡
単であるが，文字成分表の１文書あたりのビットリスト
が長いため文字成分表が大きくなるという問題がある。
また，該当する文字コードが存在しないのにエントリ番
号を割当てているためむだな部分が多いという問題があ
る。例えばシフトＪＩＳの場合，（００００）Ｈから
（８１４０）Ｈの間，及び（Ａ０００）Ｈから（Ｅ０４
０）Ｈの間，つまり０番目から３３０８７番目までと４
０９６０番目から５７４０８番目までのエントリ番号に
は該当する文字コードがない。それにもかかわらず，文
字コードによってエントリ番号を決定するためにこの部
分も全て表のエントリとして持っている必要がある。こ
のビットリスト中のむだな部分を排除するために一旦文
字コードを変換し，ビット位置を０番目からすきまなく
使用できるように文字成分表を作成する。この文字コー
ド変換型文字成分表を用いた実施例の詳細について以下
説明する。Next, a fourth embodiment of the present invention will be described. In the present embodiment, the capacity of the character component table in the first embodiment is reduced to make it compact. The character code dependent character component table used in the first embodiment is easy to process, but has a problem that the character component table becomes large because the bit list per document of the character component table is long.
In addition, there is a problem that there are many useless portions because the entry number is assigned even though there is no corresponding character code. For example, in the case of shift JIS, between (0000) H and (8140) H and from (A000) H to (E04
0) During H, that is, from 0th to 33087th and 4
There is no corresponding character code for the entry numbers from 0960 to 57408. Nevertheless, in order to determine the entry number by the character code, it is necessary to have all of this part as a table entry. The character code is once converted to eliminate useless portions in the bit list, and a character component table is created so that bit positions can be used without gaps from the 0th position. The details of the embodiment using the character code conversion type character component table will be described below.

【００２４】文字コード変換型文字成分表を作成するた
めの文字コードへの変換式の例として次式をあげる。ま
た，対応するＰＡＤ図を図１０に示す。ｉｆＳＪＩＳ＜（Ａ０００）ＨｔｈｅｎＳＣＯＤＥ＝ＳＪＩＳ − （８０４０）ＨｅｌｓｅＳＣＯＤＥ＝ＳＪＩＳ − （Ｃ０４０）ＨＳＣＯＤＥ＝ＳＣＯＤＥ − （ＳＣＯＤＥ／２５
６）×６４・・・・・・・（４−１）式（但し、通常文字コードの小さい値の部分は制御コード
として用いることが多いために、本式では（８１４０）
Ｈとはせずに（８０４０）Ｈとして多少の余裕を持たせ
ている。また、（ＳＣＯＤＥ／２５６）の演算結果の小
数点以下は切り捨て、切り捨てた結果と６４との乗算を
行う。）式中でＳＪＩＳがもとのシフトＪＩＳコードを示し，Ｓ
ＣＯＤＥは変換後の文字コードを示す。ＫＥＩＳコード
や他のコード体系についてもシフトＪＩＳコードとの対
応がとれているので同様の式でＳＣＯＤＥへの変換が可
能である。（４−１）式は，文字コード表に表すと図１
１のような変換を意味している。すなわち，（０００
０）Ｈから（ＦＦＦＦ）Ｈまでの間に（８１４０）Ｈ〜（９ＦＦＣ）Ｈ及び（Ｅ０４０）
Ｈ〜（ＦＥＦＣ）Ｈと分散して配置されている文字コードを，（００００）
Ｈからすきまなく配置するように文字コードを変換す
ることになる。この（４−１）式を用いてコード変換す
ることにより，図１２に示すようにビットリストの長さ
を非常に短くすることができ，文字成分表の全体の容量
を小さくすることができる。The following formula is given as an example of a conversion formula into a character code for creating a character code conversion type character component table. FIG. 10 shows a corresponding PAD diagram. if SJIS <(A000) H then SCODE = SJIS− (8040) Helse SCODE = SJIS− (C040) H SCODE = SCODE− (SCODE / 25
6) × 64... (4-1) (However, since a small value portion of the normal character code is often used as a control code, in this formula, (8140)
Instead of H, (8040) H is given some margin. Further, the decimal part of the calculation result of (SCODE / 256) is rounded down, and the rounded-down result is multiplied by 64. ) In the formula, SJIS indicates the original shift JIS code, and SJIS
CODE indicates the character code after conversion. Since the KEIS code and other code systems are also compatible with the shift JIS code, conversion to SCODE can be performed using the same formula. The expression (4-1) can be represented by a character code table as shown in FIG.
It means a conversion like 1. That is, (000
0) From H to (FFFF) H (8140) H to (9FFC) H and (E040)
H to (FEFC) H and the character code distributed and assigned to (0000)
The character code is converted so as to be arranged without a gap from H. By performing code conversion using this equation (4-1), the length of the bit list can be extremely reduced as shown in FIG. 12, and the overall capacity of the character component table can be reduced.

【００２５】階層型プリサーチの制御は，第一の実施例
と同じである。すなわち，図８の制御手順をそのまま使
用し，第１に検索ターム中の文字を使い文字成分表サー
チを行い，第２に検索タームを用いて凝縮本文サーチを
行う。文脈条件の指定がなければここで検索結果を出力
し，検索を終了する。文脈条件の指定があれば第３に本
文サーチを行いその結果を出力する。但し，文字成分表
サーチのときには入力された検索タームは全て（４−
１）式に基づいて文字コード変換を施して用いることに
なる。以上，文字コード変換型文字成分表を用いた第四
の実施例について説明した。本実施例によれば，文字コ
ードをコード変換し，ビット位置を０番目からすきまな
く並べた文字成分表を作成することにより，文字成分表
の文字の割り振られていないエントリを無くすことがで
き，文字成分表の容量を非常に小さくすることができ
る。The control of the hierarchical presearch is the same as in the first embodiment. That is, using the control procedure of FIG. 8 as it is, first, a character component table search is performed using the characters in the search term, and second, a condensed text search is performed using the search term. If no context condition is specified, the search result is output here and the search ends. Thirdly, if a context condition is specified, a text search is performed and the result is output. However, in the case of a character component table search, all the input search terms are (4-
The character code conversion is performed based on the expression (1). The fourth embodiment using the character code conversion type character component table has been described above. According to the present embodiment, by converting the character code and creating a character component table in which bit positions are arranged without gaps from the 0th position, entries in the character component table to which no character is allocated can be eliminated. The capacity of the character component table can be made very small.

【００２６】次に，本発明の第五の実施例について説明
する。本実施例は，第四の実施例における文字成分表の
容量をハッシング手法を用いてさらに削減したものであ
る。第四の実施例の文字成分表の容量をさらに小さくす
るために，本実施例ではビットリスト中の一つのエント
リ番号に複数の文字を割り当てる。すなわち，ハッシュ
関数を用いて検索ターム中の文字とビットリスト中のビ
ット位置を対応付ける方法をとる。このハッシュ関数と
して例えば次の式を用いることができる。ｈ（ＳＣＯＤＥ）＝ｍｏｄ（ＳＣＯＤＥ，Ｎ）・・・・・・（５−１）式式中でＳＣＯＤＥは（４−１）式によってシフトＪＩＳ
から変換した文字コードである。ｍｏｄは第１引き数を
第２引き数で割った余りを出力する関数である。Ｎは任
意の整数値である。Ｎとして，例えば５１２を用いる
と，‘あ’はエントリ番号４８０，‘ま’はエントリ番
号１９３となる。Next, a fifth embodiment of the present invention will be described. In the present embodiment, the capacity of the character component table in the fourth embodiment is further reduced by using a hashing method. In order to further reduce the capacity of the character component table of the fourth embodiment, in this embodiment, a plurality of characters are assigned to one entry number in the bit list. That is, a method is used in which a character in a search term is associated with a bit position in a bit list using a hash function. For example, the following equation can be used as the hash function. h (SCODE) = mod (SCODE, N) (5-1) In the equation, SCODE is shifted according to the equation (4-1).
This is the character code converted from. mod is a function that outputs a remainder obtained by dividing the first argument by the second argument. N is an arbitrary integer value. If, for example, 512 is used as N, “A” becomes the entry number 480, and “M” becomes the entry number 193.

【００２７】このようにして作成した文字成分表の例を
図１３に示す。この場合は，Ｎを５１２と設定したが，
１文書を登録するのに５１２ビットしか必要としないこ
とが分かる。検索時には，与えられた検索タームの各文
字について登録時と同じように，（５−１）式のハッシ
ュ関数を用いてエントリ番号を求め，これに対応する文
字成分表のビット位置を参照する。例えば，“あいま
い”という文字列の場合図１３のようにエントリ番号４
８０，４８２，１９３の位置のビットがすべて１の文書
を文字成分表サーチの検索結果とする。こうして文字成
分表サーチで求められた文書について，次にその凝縮本
文をサーチする。FIG. 13 shows an example of the character component table created as described above. In this case, N was set to 512,
It can be seen that only 512 bits are required to register one document. At the time of search, an entry number is obtained for each character of a given search term using the hash function of equation (5-1) in the same manner as at the time of registration, and the bit position of the corresponding character component table is referenced. For example, in the case of a character string “ambiguity”, as shown in FIG.
A document in which all bits at positions 80, 482, and 193 are 1 is set as a search result of the character component table search. Next, the condensed text of the document obtained by the character component table search is searched.

【００２８】以下，凝縮本文サーチ及び本文サーチの制
御手順について，図１４を用いて説明する。第一の実施
例では，文字成分表サーチの後検索タームが一文字から
なる場合には，文字成分表サーチの結果を検索結果とし
て出力して階層型プリサーチを終了していた。しかし，
この本実施例で用いた文字成分表の文字成分表サーチで
は，検索ノイズの生じる可能性があるために，凝縮本文
サーチまで階層型プリサーチを継続する必要がある。例
えば，ひらがなの‘は’（シフトＪＩＳコード（８２Ｃ
Ｄ）Ｈ）は，（５−１）式でエントリ番号１３である
が，漢字の‘艦’（シフトＪＩＳコード（８ＡＣＤ）
Ｈ）も同じエントリ番号１３となる。このことは，検索
タームとして“艦”が与えられた場合，“は”を含む文
書もすべて文字成分表サーチの結果として検索されてく
ることになる。したがってさらに，凝縮本文をスキャン
して実際に漢字の“艦”を含む文書を抽出し，これを検
索結果として出力することになる。以上，第五の実施例
について説明した。本実施例ではハッシュ関数を使っ
て，文字成分表の１エントリに複数個の文字を割り当て
ることにより，文字成分表の容量を格段に小さくできる
という効果が得られる。The control procedure of the condensed text search and the text search will be described below with reference to FIG. In the first embodiment, if the search term consists of one character after the character component table search, the result of the character component table search is output as the search result, and the hierarchical presearch is terminated. However,
In the character component table search of the character component table used in the present embodiment, since there is a possibility that search noise may occur, it is necessary to continue the hierarchical pre-search up to the condensed text search. For example, hiragana 'wa' (shift JIS code (82C
D) H) is the entry number 13 in the expression (5-1), but the kanji character “ship” (shift JIS code (8ACD)
H) has the same entry number 13. This means that when "ship" is given as a search term, all documents containing "ha" are also searched as a result of the character component table search. Therefore, the condensed text is further scanned to extract a document that actually contains the kanji “ship”, and this is output as a search result. The fifth embodiment has been described above. In the present embodiment, by assigning a plurality of characters to one entry of the character component table by using the hash function, the effect of significantly reducing the capacity of the character component table can be obtained.

【００２９】次に第六の実施例について説明する。第五
の実施例のように単純にハッシングした場合，ひらがな
のように文書中に出現しやすい文字と，ＪＩＳ第２水準
の漢字のようにめったに出現しない文字とが同じエント
リ番号となる可能性がでてくる。例えば，ひらがなの
‘は’と，漢字の‘艦’は同じエントリ番号１３とな
り，検索タームとして“艦”が与えられた場合‘は’を
含む文書はすべて文字成分表サーチの結果としてヒット
することになる。ひらがなの‘は’は日本語の文書では
非常に使用頻度の高い文字のためほぼ全件の文書が文字
成分表サーチでヒットする。このように文字成分表サー
チでの絞り込みの率が低下すると，凝縮本文もスキャン
する文書量が増えるために全体の検索処理時間が増大す
ることになる。Next, a sixth embodiment will be described. In the case of simple hashing as in the fifth embodiment, there is a possibility that a character that easily appears in a document, such as hiragana, and a character that rarely appears, such as JIS second-level kanji, have the same entry number. Come out. For example, “Hiragana” and “Kanji” have the same entry number 13, and if “Ship” is given as a search term, all documents containing “H” will be hit as a result of the character component table search. become. Since Hiragana 'ha' is a character that is very frequently used in Japanese documents, almost all documents are hit in the character component table search. As described above, when the narrowing rate in the character component table search is reduced, the amount of documents to be scanned for the condensed text also increases, so that the entire search processing time increases.

【００３０】このような絞り込み率の低下を防ぐために
は，ハッシュ関数を文字の使用頻度を考慮して定める必
要がある。本実施例で用いる文字成分表を文字種別ハッ
シング型文字成分表と呼ぶ。文字種別ハッシング型文字
成分表を作成するには，例えば図１５に示すように，各
文字種毎に文字成分表のエントリ領域を割り当て，その
領域内で文字コードにより折り返すようなハッシュ関数
を作る。このようなハッシュ関数を実現するには，文字
コードによって文字種を判定した後，ｍｏｄ関数で折り
返してもよいし，文字コードとエントリ番号との対応表
により実現することもできる。このハッシュ関数の一例
を図１６にＰＡＤ図で示す。本実施例では，ひらがな，
カタカナ，英字のエントリ数をそれぞれ２０とし，記号
のエントリ数を１０，数字のエントリ数を１０，ＪＩＳ
第１水準のエントリ数を３７０，ＪＩＳ第２水準のエン
トリ数を６１としている。まず，入力された検索ターム
に対して，文字コードにより文字種を判定し，それぞれ
の文字種ごとに文字成分表の割り当てられたエントリの
部分をｍｏｄ関数を用いて折り返す。In order to prevent such a reduction in the narrowing rate, it is necessary to determine the hash function in consideration of the frequency of use of characters. The character component table used in this embodiment is called a character type hashing type character component table. In order to create a character type hashing type character component table, for example, as shown in FIG. 15, an entry area of the character component table is allocated to each character type, and a hash function is formed in the area by means of a character code. In order to realize such a hash function, a character type may be determined based on a character code and then returned using a mod function, or may be realized based on a correspondence table between character codes and entry numbers. An example of this hash function is shown in a PAD diagram in FIG. In this embodiment, hiragana,
The number of katakana and alphabetic entries is 20 each, the number of symbol entries is 10, the number of numeric entries is 10, JIS
The number of entries of the first level is 370, and the number of entries of the JIS second level is 61. First, the character type of the input search term is determined based on the character code, and the entry portion assigned to the character component table for each character type is looped back using the mod function.

【００３１】すなわち，ＳＣＯＤＥが（０１ＤＦ）Ｈか
ら（０２３１）Ｈの範囲にあれば，ひらがな文字列であ
るので，そのＳＣＯＤＥを２０でｍｏｄをとってこれを
エントリ番号とする。ＳＣＯＤＥが（０２４０）Ｈから
（０２９６）Ｈの範囲にあれば，カタカナ文字列である
ので，そのＳＣＯＤＥを２０でｍｏｄをとって，これに
カタカナのハッシング領域の先頭である２０を足した値
をエントリ番号とする。ＳＣＯＤＥが（０１Ａ０）Ｈか
ら（０１ＤＡ）Ｈの範囲にあれば，英字文字列であるの
で，そのＳＣＯＤＥを２０でｍｏｄをとって，これに英
字のハッシング領域の先頭である４０を足した値をエン
トリ番号とする。ＳＣＯＤＥが（０１８Ｆ）Ｈから（０
１９８）Ｈの範囲にあれば，数字文字列であるので，そ
のＳＣＯＤＥを１０でｍｏｄをとって，これに数字のハ
ッシング領域の先頭である７０を足した値をエントリ番
号とする。ＳＣＯＤＥが（０６５Ｆ）Ｈから（１２３
２）Ｈの範囲にあれば，ＪＩＳ第１水準の漢字文字列で
あるので，そのＳＣＯＤＥを３７０でｍｏｄをとって，
これにＪＩＳ第１水準の漢字文字列のハッシング領域の
先頭である８０を足した値をエントリ番号とする。ＳＣ
ＯＤＥが（１２５Ｆ）Ｈから（１ＦＤＥ）Ｈの範囲にあ
れば，ＪＩＳ第２水準の漢字文字列であるので，そのＳ
ＣＯＤＥを６１でｍｏｄをとって，これにＪＩＳ第２水
準の漢字文字列のハッシング領域の先頭である４５０を
足した値をエントリ番号とする。以上のＳＣＯＤＥ以外
の場合には，記号その他の文字種による文字列とみな
し，そのＳＣＯＤＥを１０でｍｏｄをとって，これに記
号のハッシング領域の先頭である６０を足した値をエン
トリ番号とする。That is, if the SCODE is in the range of (01DF) H to (0231) H, it is a hiragana character string, so that the SCODE is modulo 20 and this is used as the entry number. If SCODE is in the range of (0240) H to (0296) H, it is a katakana character string. Therefore, modulate the SCODE by 20 and add a value obtained by adding 20 which is the head of the katakana hashing area. Let it be an entry number. If SCODE is in the range of (01A0) H to (01DA) H, it is an alphabetic character string. Therefore, modulate the SCODE by 20 and add a value obtained by adding 40 which is the head of the hashing area of the alphabetic character to this. Let it be an entry number. SCODE changes from (018F) H to (0
198) If it is in the range of H, since it is a numeric character string, its SCODE is modulo 10 and the value obtained by adding 70 which is the head of the numeric hashing area is set as the entry number. SCODE changes from (065F) H to (123
2) If it is in the range of H, it is a JIS first-level kanji character string, so its SCODE is mod 370, and
A value obtained by adding 80, which is the head of the hashing area of the JIS first-level kanji character string, to this is set as the entry number. SC
If the ODE is in the range from (125F) H to (1FDE) H, it is a JIS second-level kanji character string.
MOD is taken as 61, and a value obtained by adding 450, which is the head of the hashing area of the JIS second-level kanji character string, to the entry is used as the entry number. In cases other than the above SCODE, it is regarded as a character string of a symbol or other character type, and the SCODE is modulated by 10, and a value obtained by adding 60, which is the head of the hashing area of the symbol, to the entry number.

【００３２】この文字種別ハッシング型文字成分表を用
いた階層型プリサーチの制御手順は，第五の実施例と同
じである。すなわち，第１に検索ターム中の文字を用い
て文字成分表サーチを行い，第２に検索タームを用いて
凝縮本文サーチを行う。文脈条件等が指定されていない
場合には，ここで検索を終了するが，そうでない場合に
は，第３に本文サーチを行い結果を出力する。以上説明
したように，本実施例によれば，使用頻度を考慮して文
字種ごとに文字成分表のエントリ番号を対応させた文字
種別ハッシング型文字成分表を用いることにより，文字
成分表サーチでのノイズを少なくできるため，凝縮本文
における文書のスキャン量が減り，その分高速なフルテ
キストサーチが可能となる。The control procedure of the hierarchical presearch using the character type hashing type character component table is the same as that of the fifth embodiment. That is, first, a character component table search is performed using the characters in the search term, and second, a condensed text search is performed using the search term. If no context condition or the like is specified, the search is terminated here. If not, a third text search is performed and the result is output. As described above, according to the present embodiment, by using the character type hashing type character component table in which the entry number of the character component table is made to correspond to each character type in consideration of the use frequency, the character component table search can be performed. Since the noise can be reduced, the amount of scanning of the document in the condensed text is reduced, and a high-speed full text search can be performed accordingly.

【００３３】次に第七の実施例として，さらに文字成分
表サーチにおける絞り込みの率を向上させ，凝縮本文の
スキャン量を減らすことのできる頻度情報ハッシング型
文字成分表を用いた階層型プリサーチの制御方法を説明
する。頻度情報ハッシング型文字成分表を作成するに
は，データベースに登録してある文書の文字の使用頻度
を調べ，頻度情報によりハッシュ関数を決定する。頻度
の大きい文字については，同一エントリにできるだけ他
の文字が入らないようにし，頻度の少ない文字について
同一エントリに複数個の文字が入るようにハッシュ関数
を調整する。こうすることにより，平均的に安定した絞
り込み率が文字成分表サーチで得られることになる。具
体的には，図１７に示すように（４−１）式で得られる
ＳＣＯＤＥをもとに一度データベース中で該当する文字
を使用している文書数を調べ頻度順に並べ替える。次
に，頻度の大きいものから文字成分表のエントリ数分Ｎ
ｔだけとる。そしてＮｔ以内の頻度数分布のうち最も上
位の頻度を持つエントリだけを残して，その他のエント
リに順次Ｎｔ以上のエントリ番号を割り付けていく。こ
のＮｔ以上のエントリ番号の割付けには（Ｎｔ＋１）番
目のエントリをＮｔのエントリとし，（Ｎｔ＋２）番目
を（Ｎｔ−１）番目のエントリとするように，Ｎｔより
順次頻度の大きいエントリを割り付けていく。割り付け
ていく過程では，常に最上位の頻度を持つエントリの上
には，他のエントリを割り付けないようにする。割り付
けたエントリは，図１８に示すようにテーブルの形で，
記憶しておきこのテーブルを参照してハッシュ関数を構
成する。すなわち，ＳＣＯＤＥが（０９５Ｆ）Ｈの文字
‘検’は，エントリ番号２３１であることが分かる。Next, as a seventh embodiment, a hierarchical pre-search using a frequency information hashing type character component table which can further improve the narrowing rate in the character component table search and reduce the scan amount of the condensed text will be described. The control method will be described. To create a frequency information hashing type character component table, the use frequency of characters in a document registered in a database is checked, and a hash function is determined based on the frequency information. For a character with a high frequency, another character is prevented from entering the same entry as much as possible, and a hash function is adjusted so that a character with a low frequency includes a plurality of characters in the same entry. By doing so, an averagely stable narrowing-down rate can be obtained by the character component table search. Specifically, as shown in FIG. 17, based on the SCODE obtained by the equation (4-1), the number of documents using the corresponding character in the database once is checked and sorted in the order of frequency. Next, from the most frequent one, N
Take only t. Then, leaving only the entry having the highest frequency in the frequency number distribution within Nt, the other entries are sequentially assigned entry numbers of Nt or more. In order to assign an entry number equal to or greater than Nt, an entry having a higher frequency than Nt is assigned so that the (Nt + 1) th entry is the Nt entry and the (Nt + 2) th entry is the (Nt-1) th entry. Go. In the assignment process, no other entry is assigned above the entry having the highest frequency. The allocated entries are in the form of a table as shown in FIG.
The hash function is configured by referring to the stored table. That is, it can be seen that the character “check” having the SCODE of (095F) H is the entry number 231.

【００３４】階層型プリサーチの制御手順は，第五の実
施例と同じである。すなわち，図１４の制御手順をその
まま使用し，第１に検索ターム中の文字を用いて文字成
分表サーチを行い，第２に検索タームを用いて凝縮本文
サーチを行う。文脈条件等が指定されていない場合に
は，ここで検索を終了するが，そうでない場合には，第
３に本文サーチを行い結果を出力する。このように，本
実施例によれば，データベース中で実際に用いられる文
字の頻度分布をもとに文字成分表を作成することによっ
て，文字成分表サーチで常に安定して高い絞り込み率が
得られるため，検索タームに依存せず安定して短時間の
検索処理時間を得ることができる。The control procedure of the hierarchical presearch is the same as that of the fifth embodiment. That is, using the control procedure of FIG. 14 as it is, first, a character component table search is performed using the characters in the search term, and second, a condensed text search is performed using the search term. If no context condition or the like is specified, the search is terminated here. If not, a third text search is performed and the result is output. As described above, according to the present embodiment, a character component table is created based on the frequency distribution of characters actually used in the database, so that a high narrowing-down rate can always be obtained stably in the character component table search. Therefore, a short search processing time can be stably obtained without depending on the search term.

【００３５】以上，文字成分表の異なる実施例について
五つの実施例を説明した。これより凝縮本文の異なる実
施例についての説明をする。第一の実施例で用いた凝縮
本文は作成の処理が簡単であるが，図４でも分かるよう
に“のための”というような本来検索に使われないよう
な文字列まで凝縮本文に残ることになる。このことは凝
縮本文の圧縮率低下を招く。つまり，検索時にスキャン
する凝縮本文の量が増えるため，検索処理時間が増加し
てしまう。このような，凝縮本文の圧縮率を低下させる
主な要因は，“のための”というような付属語の連なっ
たそれ自体では意味を持たない文字列を凝縮本文に登録
してしまうところにある。As described above, five embodiments have been described with respect to different embodiments of the character component table. An embodiment with a different condensed text will now be described. Although the process of creating the condensed text used in the first embodiment is simple, as shown in FIG. 4, even a character string such as "for" which is not originally used in the search remains in the condensed text. become. This leads to a reduction in the compression rate of the condensed text. That is, the amount of condensed text to be scanned at the time of retrieval increases, so that the retrieval processing time increases. The main factor that reduces the compression ratio of the condensed text is that a string of adjuncts such as "for" that has no meaning in itself is registered in the condensed text. .

【００３６】そこで，第八の実施例として，この検索に
不要な付属語の連なりを除去した凝縮本文を用いる階層
型プリサーチを説明する。この凝縮本文を文字種分割・
重複排除・付属語除去型凝縮本文と呼ぶ。この凝縮本文
の作成方法は図１９に示すように，本文のテキスト文字
列から文字種分割して部分文字列に分け，それから重複
語を排除した後，付属語の除去を行う。文字種分割と重
複排除は第一の実施例と変わらない。付属語除去は，重
複排除の済んだひらがな文字列に対して行う。この付属
語除去のための解析は，図２０に示すように基本単語辞
書と単語間の接続規則を基に行う。基本単語辞書には，
図２１のようにひらがなのみから構成される動詞，指示
代名詞，形容詞，形容動詞，副詞，接続詞，助詞，助動
詞，またこれらの品詞の活用語尾が品詞情報とともに登
録されている。本図の例では，動詞として＜ある＞，＜
なる＞，＜もつ＞等がそれらの活用語尾とともに登録さ
れている。接続規則には基本単語辞書に登録された各語
が他のどの語と接続し得るかを登録する。例えば図２２
に示すように，＜動詞−もつ連体形＞に＜名詞−こと＞
が接続し，さらに＜名詞−こと＞には＜助詞−が＞が接
続し得ることが登録されている。このような基本単語辞
書及び接続規則を用いてひらがなの部分文字列が付属語
から構成されているか否かを判定し，凝縮本文へその文
字列を登録するか否かを決定する。例えば，“のため
の”という部分文字列は＜助詞−の＞＜名詞−ため＞＜
助詞−の＞というように接続した文字列と解析できるた
め，付属語のみから構成された文字列と判定し排除す
る。一方，“あいまい”という文字列は，付属語と解析
ができないため排除せずにそのまま凝縮本文へ登録す
る。Therefore, as an eighth embodiment, a hierarchical pre-search using a condensed text from which a series of attached words unnecessary for this search are removed will be described. This condensed text is divided into character types
This is called a deduplication / adjunct removal type condensed text. As shown in FIG. 19, the method of creating the condensed body text is to separate the character string from the text character string of the body body into partial character strings, remove duplicate words from the partial character strings, and then remove attached words. Character type division and deduplication are the same as those in the first embodiment. Adjunct removal is performed on de-duplicated hiragana character strings. The analysis for removing the attached word is performed based on the basic word dictionary and the connection rules between words as shown in FIG. The basic word dictionary contains
As shown in FIG. 21, a verb composed of only hiragana, a demonstrative pronoun, an adjective, an adjective verb, an adverb, a conjunction, a particle, an auxiliary verb, and a conjugation ending of these parts of speech are registered together with part of speech information. In the example of this figure, the verbs <a> and <
<Naru>, <Hitaru>, etc. are registered together with their endings. In the connection rules, each word registered in the basic word dictionary is registered with which other words can be connected. For example, FIG.
As shown in the table, <verb-adjunct form> has <noun-koto>
Are connected, and it is registered that <noun-> can be connected to <noun-to>. Using such a basic word dictionary and the connection rules, it is determined whether or not the partial character string of Hiragana is composed of an accessory word, and whether or not the character string is registered in the condensed text is determined. For example, the substring "for" is <particle-><noun-for><
Since it can be analyzed as a connected character string such as a particle->, it is determined and removed as a character string composed only of attached words. On the other hand, the character string “ambiguity” cannot be analyzed as an adjunct word, so that it is registered in the condensed text without excluding it.

【００３７】このように，付属語を解析してひらがな文
字列を排除し，検索に使われることのない無用の情報を
削除することによって，凝縮本文の圧縮率を高めること
が可能となる。また解析に用いる基本単語辞書と接続規
則は，時代とともに登録語数が増えていく従来のキーワ
ード辞書とは基本的に異り，普遍的なもので一度作成し
てしまえば更新していく必要がないという利点がある。
付属語として解析できるものだけを排除するために，辞
書に存在しないひらがなから構成される新語が現れても
必ず凝縮本文に残るということになる。As described above, the compression rate of the condensed text can be increased by analyzing the attached words to eliminate the hiragana character string and deleting unnecessary information that is not used in the search. Basic word dictionaries and connection rules used for analysis are fundamentally different from conventional keyword dictionaries in which the number of registered words increases with the times, and they are universal and need not be updated once created There is an advantage.
In order to exclude only those that can be analyzed as adjuncts, even if a new word composed of hiragana that does not exist in the dictionary appears, it will always remain in the condensed text.

【００３８】次に，文字種分割・重複排除・付属語除去
型凝縮本文を用いた階層型プリサーチ方式の制御につい
て説明する。文字種分割・重複排除・付属語除去型凝縮
本文では，ひらがな文字列を付属語解析して凝縮本文に
登録しない場合がある。そのため，特定のひらがな文字
列で検索しようとした場合，凝縮本文サーチで検索もれ
となる場合がある。例えば“めまい”という文字列は，
動詞の未然形活用語尾“め”と助動詞“まい”の終止形
が接続したものと解析できる。具体例としては，“認め
まい”があげられる。ところが“めまい”は，名詞とし
て使われている場合でも，付属語除去処理の結果凝縮本
文からは削除されてしまう。したがってこのような場
合，“めまい”で凝縮本文を検索すると検索もれが生じ
る可能性がでてくる。そのため，検索タームが凝縮本文
中にもともとない言葉なのか，あるいは凝縮本文作成過
程で除去された可能性のある言葉なのかをチェックして
から検索する必要が生じる。検索タームが凝縮本文に登
録されるべき語か否かというチェックは，凝縮本文を作
成したときに用いた付属語除去のアルゴリズムをそのま
ま適用する。この例では，“めまい”という検索ターム
が与えられたときは，これが付属語の連なりと判定する
ことができる。Next, the control of the hierarchical pre-search method using the condensed text of the character type division / duplication elimination / adjunct word removal type will be described. In the condensed text of the character type division / duplication elimination / adjunct removal type, there are cases where the hiragana character string is analyzed as an adjunct and not registered in the condensed text. For this reason, when an attempt is made to search for a specific hiragana character string, the search may be missed in the condensed text search. For example, the character string “vertigo”
It can be analyzed that the verb inflected form ending "me" is connected to the auxiliary verb "mai". A specific example is "no admission". However, even if "vertigo" is used as a noun, it is deleted from the condensed text as a result of the attached word removal processing. Therefore, in such a case, if the condensed text is searched for with “vertigo”, there is a possibility that the search may be omitted. Therefore, it is necessary to check whether the search term is an original word in the condensed text or a word that may have been removed in the process of creating the condensed text, and then search. To check whether the search term is a word to be registered in the condensed text, the algorithm for removing attached words used when creating the condensed text is applied as it is. In this example, when a search term “vertigo” is given, this can be determined to be a series of attached words.

【００３９】以上の検索制御の手順を図２３で説明す
る。まず文字成分表サーチを行う。結果件数が０件であ
れば，０件を検索結果として出力して検索処理を終了す
る。第一の実施例でも述べたが，ハッシュ関数を用いな
い方式では検索タームが一文字の場合にかぎり，文字成
分表のサーチ結果を最終検索結果として出力できる。す
なわち，第一及び第四の実施例で説明した文字成分表を
用いる場合には，検索タームが一文字であるか否かを調
べ，一文字であれば文字成分表サーチの結果を検索結果
として出力し，処理を終了する。第五，第六，第七の実
施例で述べたハッシュ関数による文字成分表を用いる場
合には，この検索タームが一文字か否かというチェック
は行わず，常に次の凝縮本文サーチを行う。この後，第
一の実施例と同様に，分割検索タームを生成する。The procedure of the above search control will be described with reference to FIG. First, a character component table search is performed. If the number of results is 0, 0 is output as a search result, and the search process ends. As described in the first embodiment, in the method not using the hash function, the search result of the character component table can be output as the final search result only when the search term is one character. That is, when using the character component table described in the first and fourth embodiments, it is checked whether or not the search term is one character, and if it is one character, the result of the character component table search is output as the search result. , End the process. When the character component table based on the hash function described in the fifth, sixth, and seventh embodiments is used, the following condensed text search is always performed without checking whether or not this search term is one character. Thereafter, similarly to the first embodiment, a divided search term is generated.

【００４０】次に，分割検索タームのそれぞれについて
付属語解析を行う。分割検索タームのうち一つでも付属
語と判定された場合，その分割検索タームは凝縮本文か
ら削除されている可能性があるので，凝縮本文サーチを
行わず，文字成分表サーチの結果に基づいて本文を直接
サーチする。一方，付属語解析の結果，分割検索ターム
が全て付属語でないと判定されたならば，第一の実施例
と同様に凝縮本文サーチを行う。近傍条件あるいは，文
脈条件の指定がない場合，あるいは分割検索タームがも
との検索タームと同じ場合には，この凝縮本文サーチの
結果を最終検索結果として出力し，検索を終了する。も
し，近傍条件ないし文脈条件が指定されている場合，あ
るいは分割検索タームと元の検索タームが異なる場合に
は，次に本文サーチを実行し，その結果を最終的な検索
結果出力とする。このように，本実施例によれば，ひら
がな文字列を解析し，不要な付属語の連なりを凝縮本文
から除去した文字種分割・重複排除・付属語除去型凝縮
本文を用いることにより，凝縮本文の圧縮率を向上さ
せ，検索処理時間を短縮することができる。Next, attached word analysis is performed for each of the divided search terms. If any of the split search terms is determined to be an adjunct, the split search term may have been deleted from the condensed text, so the condensed text search is not performed and the search is performed based on the result of the character component table search. Search the text directly. On the other hand, as a result of the attached word analysis, if it is determined that all of the divided search terms are not attached words, a condensed text search is performed as in the first embodiment. If no neighborhood condition or context condition is specified, or if the divided search term is the same as the original search term, the result of this condensed text search is output as the final search result, and the search is terminated. If the neighborhood condition or the context condition is specified, or if the divided search term and the original search term are different, then a text search is executed, and the result is used as the final search result output. As described above, according to the present embodiment, the condensed text is analyzed by analyzing the hiragana character string, and using the condensed text of the character type division / duplication elimination / adjunct word removal type in which unnecessary attached word sequences are removed from the condensed text. The compression ratio can be improved and the search processing time can be reduced.

【００４１】次に，第九の実施例として，ひらがな文字
列を全て排除した，文字種分割・重複排除・ひらがな文
字列除去型凝縮本文を用いる階層型プリサーチを説明す
る。第八の実施例で説明した凝縮本文は，確かに圧縮率
が上がるものの付属語解析の際に誤った解析をする可能
性がある。例えば第八の実施例でも用いた“めまい”と
いう文字列の例の外にも，付属語解析だけでは本質的に
どれが付属語か正しく判定できない場合がまれにある。
例えば，“動作してこの応用で．．．”という文書の中
の“してこの”という部分文字列は，，“〜して，この
〜”という意味で用いられているのか，“〜し，てこの
〜”のように機械のてこを意味しているのかが判定する
のが難しい。後者の意味で用いられている場合には，
“てこ”という検索タームを指定した際に，“てこ”は
付属語と判定されないため，凝縮本文をサーチしにいく
ことになる。一方，凝縮本文作成では，“してこの”が
付属語と解析され凝縮本文から削除されているため凝縮
本文サーチで検索もれとなってしまう。この付属語解析
の不完全さを補正するために，ひらがな文字列か否かと
いう単純な判定方法で階層型プリサーチを実現するの
が，本第九の実施例である。この凝縮本文の作成方法
を，図２４に示す。本方法では文字種分割の後，ひらが
なを除去して重複登録排除を行う。Next, as a ninth embodiment, a description will be given of a hierarchical presearch using a character-type division / duplication elimination / Hiragana character string removal type condensed text in which all Hiragana character strings are excluded. The condensed text described in the eighth embodiment certainly increases the compression ratio, but may erroneously analyze the attached words. For example, besides the example of the character string “vertigo” used in the eighth embodiment, there are rare cases where it is essentially impossible to correctly determine which of the adjuncts is just by adjunct analysis.
For example, the substring “Toshino” in the document “Operating in this application ...” is used to mean “to this to” or “to , It is difficult to determine whether or not leverage means a mechanical lever. When used in the latter sense,
When the search term “leverage” is specified, “leverage” is not determined to be an adjunct, so that the condensed text is searched. On the other hand, in the creation of the condensed text, "Shiteko" is analyzed as an adjunct word and is deleted from the condensed text. In order to correct the incompleteness of the attached word analysis, the ninth embodiment implements a hierarchical pre-search by a simple determination method of whether or not it is a hiragana character string. FIG. 24 shows a method of creating the condensed text. In this method, after character type division, Hiragana is removed to eliminate duplicate registration.

【００４２】この文字種分割・重複排除・ひらがな文字
列除去型凝縮本文を用いた階層型プリサーチの制御手順
について図２５を用いて説明する。まず第八の実施例と
同様に文字成分表サーチを行う。この後，分割検索ター
ムを生成する。次に，分割検索タームのそれぞれについ
てひらがな文字列か否かチェックを行う。分割検索ター
ムのうち一つでもひらがな文字列がある場合，凝縮本文
サーチを行わず，文字成分表サーチの結果に基づいて本
文を直接サーチする。一方，分割検索ターム中にひらが
な文字列がない場合，第一の実施例と同様に凝縮本文サ
ーチを行い，近傍，文脈条件の指定がある場合，あるい
は分割検索タームが元の検索タームと異なる場合には，
本文サーチまで検索処理を続行する。このように，本実
施例によれば，ひらがな文字列を全て排除した凝縮本文
を用いることによって，ひらがな文字列についても検索
もれのない正確なフルテキストサーチが実現できる。The control procedure of the hierarchical pre-search using the character type division / duplication elimination / Hiragana character string removal type condensed text will be described with reference to FIG. First, a character component table search is performed as in the eighth embodiment. Thereafter, a split search term is generated. Next, it is checked whether each of the divided search terms is a hiragana character string. If at least one of the divided search terms has a hiragana character string, the text is directly searched based on the result of the character component table search without performing the condensed text search. On the other hand, if there is no hiragana character string in the divided search term, a condensed text search is performed in the same way as in the first embodiment, and the neighborhood and context conditions are specified, or the divided search term is different from the original search term Has
Continue the search process until the text search. As described above, according to the present embodiment, an accurate full-text search with no search omission can be realized even for a hiragana character string by using a condensed text in which all the hiragana character strings are excluded.

【００４３】次に，本発明の第十の実施例について，説
明する。上記第九の実施例では，ひらがなの検索ターム
が与えられた場合，本文を直接参照する必要がある。し
たがって検索時間がより多く掛かることになる。そこ
で，ひらがなの検索タームが与えられた場合でも高速に
フルテキストサーチできる方法として，第十の実施例の
説明をする。本実施例では，第九の実施例で用いた凝縮
本文の外に第九の実施例では除去したひらがな文字列を
登録した凝縮本文を別に作成する。図２６に示すよう
に，文字種分割，重複登録排除の後，残った部分文字列
がひらがな文字列か否かを判定し，ひらがな文字列以外
を凝縮本文Ａとして登録し，ひらがな文字列を凝縮本文
Ｂとして登録する。こうすれば，ひらがなだけの検索タ
ームが与えられた際，凝縮本文Ｂを探索することができ
るようになるため，検索時間を短縮することが可能とな
る。実際の階層型プリサーチの検索制御の手順を図２７
に示す。まず第八の実施例と同様に文字成分表サーチを
行う。もし，検索結果が０件なら，ここで検索を終了す
る。この後，分割検索タームを生成する。次に，分割検
索タームをひらがな文字列のタームとそれ以外の文字列
からなるタームに分類する。その後，ひらがな以外の文
字列からなる分割検索タームがある場合には，凝縮Ａを
サーチする。次にひらがなの分割検索タームがある場合
には，凝縮Ｂをサーチする。その後は，第一の実施例と
同様に，近傍，文脈条件の指定がある場合，あるいは分
割検索タームがもとの検索タームと異なる場合には，本
文サーチまで検索処理を続行する。このように，ひらが
なのみの凝縮本文と，ひらがな以外の凝縮本文と分けて
格納することにより，どんな文字種の検索タームが入力
されても，凝縮本文を有効に活用でき，常に高速なフル
テキストサーチが実現できる。Next, a tenth embodiment of the present invention will be described. In the ninth embodiment, when a search term of Hiragana is given, it is necessary to directly refer to the text. Therefore, more search time is required. Therefore, a tenth embodiment will be described as a method for performing a full-text search at a high speed even when a hiragana search term is given. In the present embodiment, in addition to the condensed text used in the ninth embodiment, a condensed text in which the removed hiragana character strings are registered in the ninth embodiment is separately created. As shown in FIG. 26, after character type division and elimination of duplicate registration, it is determined whether or not the remaining partial character string is a Hiragana character string, a character string other than the Hiragana character string is registered as a condensed text A, and the Hiragana character string is condensed text. Register as B. In this way, when a search term of only Hiragana is given, the condensed text B can be searched, so that the search time can be reduced. FIG. 27 shows the procedure of actual hierarchical presearch search control.
Shown in First, a character component table search is performed as in the eighth embodiment. If the search result is 0, the search ends here. Thereafter, a split search term is generated. Next, the divided search terms are classified into terms consisting of hiragana character strings and terms consisting of other character strings. Thereafter, if there is a divided search term consisting of a character string other than Hiragana, the condensed A is searched. Next, if there is a Hiragana split search term, the condensed B is searched. Thereafter, similarly to the first embodiment, when there is a designation of a neighborhood or a context condition, or when the divided search term is different from the original search term, the search processing is continued until the text search. In this way, by storing separately the condensed text of only Hiragana and the condensed text of other than Hiragana, the condensed text can be used effectively regardless of the search term of any character type, and high-speed full-text search is always performed. realizable.

【００４４】次に，第十一の実施例について説明する。
本実施例は，凝縮本文の圧縮率を上げるために，文字種
毎に独立した凝縮本文を用いる方法に基づいたものであ
る。本実施例で用いる凝縮本文を文字種分割・重複排除
・文字種別登録型凝縮本文と呼ぶ。この文字種分割・重
複排除・文字種別登録型凝縮本文を作成するには，図２
８に示すように，文字種分割，重複登録排除を行った
後，残った部分文字列の文字種を判定してひらがな凝縮
本文Ｈ，カタカナ凝縮本文Ｉ，漢字凝縮本文Ｊ，英字凝
縮本文Ｋ，数字凝縮本文Ｌ，記号その他の文字種凝縮本
文Ｍに分類して登録する。こうすることにより，例えば
漢字の検索タームで検索する場合には，漢字文字種の凝
縮本文Ｊのみをサーチすればよいことになるため，検索
時間をさらに短縮することができる。具体的な階層型プ
リサーチの制御手順を図２９を用いて説明する。まず，
第八の実施例と同様に文字成分表サーチを行う。検索結
果件数が０件なら，ここで検索を終了する。この後，分
割検索タームを生成する。次に，分割検索タームを文字
種毎に分類する。その後，ひらがなの分割検索タームが
ある場合には凝縮Ｈを，カタカナの分割検索タームがあ
る場合には凝縮Ｉを，というように分解検索タームの文
字種にしたがってサーチする凝縮本文を選択する。その
後は，第一の実施例と同様に，近傍，文脈条件の指定が
ある場合，あるいは分割検索タームがもとの検索ターム
と異なる場合には，本文サーチまで検索処理を続行す
る。このように，文字種ごとに凝縮本文ファイルを分離
し個々の凝縮本文の容量を小さくすることにより，漢字
のみ，カタカナのみ，あるいはひらがなのみ，といった
単一文字種の検索タームでのフルテキストサーチが高速
に行えるという効果が得られる。Next, an eleventh embodiment will be described.
The present embodiment is based on a method of using an independent condensed text for each character type in order to increase the compression rate of the condensed text. The condensed text used in this embodiment is referred to as a character type division / duplication elimination / character type registration type condensed text. Figure 2 shows how to create a condensed text body with this character type division / duplication elimination / character type registration.
As shown in FIG. 8, after character type division and duplicate registration elimination, the character type of the remaining partial character string is determined, and Hiragana condensed text H, Katakana condensed text I, Kanji condensed text J, English condensed text K, and numeric condensed text It is classified and registered as a text L, a symbol or other character type condensed text M. By doing so, for example, when a search is performed using a kanji search term, only the condensed text J of the kanji character type needs to be searched, so that the search time can be further reduced. A specific hierarchical pre-search control procedure will be described with reference to FIG. First,
A character component table search is performed as in the eighth embodiment. If the number of search results is 0, the search ends here. Thereafter, a split search term is generated. Next, the divided search terms are classified for each character type. Thereafter, a condensed text to be searched is selected according to the character type of the decomposed search term, such as condensed H when there is a hiragana split search term, condensed I when there is a katakana split search term, and so on. Thereafter, similarly to the first embodiment, when there is a designation of a neighborhood or a context condition, or when the divided search term is different from the original search term, the search processing is continued until the text search. In this way, by separating the condensed text file for each character type and reducing the volume of each condensed text, full-text search with single character type search terms such as kanji only, katakana only, or hiragana only can be performed at high speed. The effect that it can be performed is obtained.

【００４５】次に第十二の実施例について，図３０およ
び図３１を用いて説明する。本実施例は，特願平０２−
１９３０１５で提案した文書検索装置を用い，本発明を
実現したものである。本装置の主な構成は，キーボート
３００１，検索式解析プログラム３００２，ビットサー
チプロセッサ３００７ａ，ストリングサーチエンジン３
００６，複合条件判定用マイクロプロセッサ３０４５
ａ，検索結果格納メモリ３０４６，ディスプレイ３０２
０，半導体メモリ装置３０１０ａ，ＲＡＭディスク装置
３０１０ｂ，集合型磁気ディスク３０１０ｃ，及び検索
実行制御プログラム３００８よりなる。半導体メモリ装
置３０１０ａには文字成分表が，ＲＡＭディスク装置３
０１０ｂには凝縮本文，集合型磁気ディスク装置３０１
０ｃには本文がそれぞれ格納されている。但し，文字成
分表及び凝縮本文は，集合型磁気ディスク３０１０ｃに
格納されていて，本装置の運用開始時点でそれぞれ半導
体メモリ装置３０１０ａ及びＲＡＭディスク装置３０１
０ｂへローディングされる。Next, a twelfth embodiment will be described with reference to FIGS. This embodiment is described in Japanese Patent Application No.
The present invention is realized by using the document search device proposed in 193015. The main configuration of this device is a keyboard 3001, a search expression analysis program 3002, a bit search processor 3007a, a string search engine 3
006, Complex Condition Determination Microprocessor 3045
a, search result storage memory 3046, display 302
0, a semiconductor memory device 3010a, a RAM disk device 3010b, a collective magnetic disk 3010c, and a search execution control program 3008. A character component table is stored in the semiconductor memory device 3010a.
Reference numeral 010b denotes a condensed text, a collective magnetic disk device 301
0c stores the text. However, the character component table and the condensed text are stored on the collective magnetic disk 3010c, and the semiconductor memory device 3010a and the RAM disk device 301, respectively, at the start of operation of this device.
0b.

【００４６】階層プリサーチ制御の手順は，いままで実
施例で説明してきたものと変わらない。いままでの実施
例との相違点は，文字成分表を半導体メモリ，凝縮本文
をＲＡＭディスク，本文を集合型磁気ディスクに格納し
たところと，文字成分表サーチ専用のマイクロプロセッ
サ，凝縮本文サーチ及び本文サーチ専用のストリングサ
ーチエンジンを用いていることである。検索処理の手順
を以下に説明する。The procedure of the hierarchical presearch control is the same as that described in the embodiment. The difference from the previous embodiments is that the character component table is stored in a semiconductor memory, the condensed text is stored on a RAM disk, and the text is stored on a collective magnetic disk. That is, a string search engine dedicated to search is used. The procedure of the search process will be described below.

【００４７】キーボード３００１から入力した検索条件
式はサーチマシン制御用マイクロプロセッサＭＰＵ０３
０５０上の検索式解析プログラム３００２により解析さ
れる。すなわち、検索式解析プログラム３００２では検
索条件式を構成するキーワード部分とそれらの包含条件
及び配置条件を記述した複号条件記述部に分離する。包
含条件は論理条件として記述され、配置条件は近傍条件
や文脈条件として記述されたものである。分離抽出後、
キーワード部分は同じくＭＰＵ０３０５０上の同義語展
開プログラム３００３に渡され、複号条件記述部は複号
条件解析プログラム３０４１に渡される。同義語展開プ
ログラム３００３では、ここに内蔵された同義語辞書を
参照して、入力されたキーワードの同義語が求められ
る。そして、ここで同義語展開されたキーワード群は異
表記展開プログラム３００４へ渡される。本図の例の場
合、“計算機”から、“電算機”、“コンピュータ”、
“ＣＯＭＰＵＴＥＲ”などが生成される。異表記展開プ
ログラム３００４では、ここに入力されてきたキーワー
ド群に対して異表記展開処理が施される。本図の例の場
合、“コンピュータ”から“コンピューター”が、“Ｃ
ＯＭＰＵＴＥＲ”から“Ｃｏｍｐｕｔｅｒ”などが生成
される。こうして同義語及び異表記展開されたキーワー
ド群は、次にオートマトン生成用マイクロプロセッサＭ
ＰＵ１３００５ａ上のオートマトン生成用プログラム３
００５に送られる。オートマトン生成用プログラム３０
０５では、異表記展開プログラム３００４から送られて
きたキーワード群に対して、これらを一括照合するオー
トマトンを生成し、状態遷移テーブルと照合すべきキー
ワードの識別コード情報として、サーチエンジン３００
６に設定する。サーチエンジン３００６は有限オートマ
トン方式に基づく高速多重文字照合回路である。また、
異表記展開プログラム３００４で異表記展開されたキー
ワード群は、該当キーワードと共に、ビットサーチ用マ
イクロプロセッサＭＰＵ３３００７ａ上のビットサーチ
プログラム３００７へ渡される。The search condition expression input from the keyboard 3001 is a search machine control microprocessor MPU03.
050 is analyzed by the search formula analysis program 3002. In other words, the search formula analysis program 3002 separates the keyword portions constituting the search condition formula into the decoding condition description portion which describes the inclusion condition and the placement condition. The inclusion condition is described as a logical condition, and the arrangement condition is described as a neighborhood condition or a context condition. After separation and extraction,
The keyword part is also passed to the synonym expansion program 3003 on the MPU03050, and the decoding condition description part is passed to the decoding condition analysis program 3041. In the synonym expansion program 3003, the synonym of the input keyword is obtained with reference to the built-in synonym dictionary. Then, the keyword group that has undergone synonym expansion is passed to the variant notation expansion program 3004. In the case of the example in this figure, “computer”, “computer”,
“COMPUTER” or the like is generated. In the different notation development program 3004, a different notation development process is performed on the keyword group input here. In the case of the example of this figure, “computer” is changed from “computer” to “C”.
"Computer" or the like is generated from "OMPUTER". The keyword group thus expanded into the synonym and the different notation is then processed by the microprocessor M for automaton generation.
Automaton generation program 3 on PU13005a
005. Automaton generation program 30
At step 05, an automaton for collectively collating the keywords sent from the different notation development program 3004 is generated, and the search engine 300 is used as the identification code information of the keyword to be collated with the state transition table.
Set to 6. The search engine 3006 is a high-speed multiple character matching circuit based on the finite automaton method. Also,
The keyword group expanded in different notations by the different notation expansion program 3004 is passed to the bit search program 3007 on the bit search microprocessor MPU33007a together with the corresponding keywords.

【００４８】一方，近傍条件，文脈条件や，ＡＮＤ，Ｏ
Ｒ等の論理条件は検索式解析プログラム３００２から，
複合条件解析プログラム３０４１，近傍条件解析プログ
ラム３０４２，文脈条件解析プログラム３０４３，論理
条件解析プログラム３０４４を経て複合条件判定プログ
ラム３０４５へと送られる。必要な検索情報がビットサ
ーチプログラム３００７，ストリングサーチエンジン３
００６，複合条件判定プログラム３０４５へ送られた
後，検索制御実行プログラム３００８は，まずビットサ
ーチプログラム３００７に起動を掛ける。ビットサーチ
プログラム３００７は，半導体メモリ装置３０１０ａに
格納してある文字成分表を読み出し，文字成分表サーチ
を行う。文字成分表サーチの結果は，検索結果格納メモ
リ３０４６へ格納する。On the other hand, neighborhood conditions, context conditions, AND, O
Logical conditions such as R are obtained from the search expression analysis program 3002,
The complex condition analysis program 3041, the neighborhood condition analysis program 3042, the context condition analysis program 3043, and the logical condition analysis program 3044 are sent to the complex condition determination program 3045. Necessary search information is bit search program 3007, string search engine 3
006, after being sent to the complex condition determination program 3045, the search control execution program 3008 first activates the bit search program 3007. The bit search program 3007 reads a character component table stored in the semiconductor memory device 3010a and performs a character component table search. The result of the character component table search is stored in the search result storage memory 3046.

【００４９】文字成分表サーチが終った後，検索実行制
御プログラム３００８は，検索結果格納メモリ３０４６
を参照し，検索結果が０件であれば，０件を検索結果と
して出力し検索処理を中断する。検索結果が０件でなけ
れば，ストリングサーチエンジン３００６へ起動をかけ
ると同時に検索結果格納メモリ３０４６に格納されてい
る文字成分表サーチの結果でヒットした文書の凝縮本文
をＲＡＭディスク装置２９１０ｂから読み出し，ストリ
ングサーチエンジン３００６へ送り，凝縮本文サーチを
実行させる。この結果件数が０件であるか否かの条件判
定は検索実行制御プログラム３００８で行う。ストリン
グサーチエンジン３００６では，ＲＡＭディスク装置３
０１０ｂより読み出された，凝縮本文を分割検索ターム
でサーチする。照合結果は複合条件判定プログラム３０
４５に順次送られる。複合条件判定プログラム３０４５
では，検索ターム間に付与された論理条件を判定し，条
件に適合する文書の文書番号を検索結果格納メモリ３０
４６へ順次格納する。After the character component table search is completed, the search execution control program 3008 executes the search result storage memory 3046
If the search result is 0, 0 is output as the search result and the search process is interrupted. If the search result is not 0, the string search engine 3006 is activated, and at the same time, the condensed text of the document hit by the result of the character component table search stored in the search result storage memory 3046 is read from the RAM disk device 2910b. The text is sent to the string search engine 3006, and the condensed text search is executed. As a result, the condition determination as to whether or not the number of cases is 0 is performed by the search execution control program 3008. In the string search engine 3006, the RAM disk device 3
The condensed text read from 010b is searched by the divided search term. The comparison result is a compound condition determination program 30
45. Complex condition determination program 3045
Then, the logical condition given between the search terms is determined, and the document number of the document that satisfies the condition is determined.
The data is sequentially stored in 46.

【００５０】凝縮本文サーチが終了した後，検索実行制
御プログラム３００８は，もう一度検索結果格納メモリ
３０４６を参照し，結果件数が０件であれば，０件を検
索結果として出力し，検索を終了する。０件でない場合
で，近傍，文脈条件が設定されているか，もしくは分割
検索タームが検索タームと異なっている場合にかぎり検
索結果格納メモリから，検索結果文書番号を読み取り，
これに対応する本文を集合型磁気ディスク装置３０１０
ｃから読み出し，ストリングサーチエンジン３００６へ
送り，今度は本文サーチを実行させる。近傍，文脈条件
が設定されてなく，かつ分割検索タームが検索タームと
等しい場合には，検索結果格納メモリに格納されている
検索結果件数を出力し，検索を終了する。After the condensed text search is completed, the search execution control program 3008 refers to the search result storage memory 3046 again, and if the number of results is 0, outputs 0 as a search result and terminates the search. . When the search result document number is read from the search result storage memory only when the search condition is not zero and the neighborhood, the context condition is set, or the divided search term is different from the search term,
The corresponding text is stored in the collective magnetic disk drive 3010.
c, and sends it to the string search engine 3006, this time executing a text search. If the neighborhood and context conditions are not set and the divided search terms are equal to the search terms, the number of search results stored in the search result storage memory is output, and the search is terminated.

【００５１】ストリングサーチエンジン３００６では，
集合型磁気ディスク装置３０１０ｃから読み出された本
文をスキャンして本文サーチを行う。結果は複合条件判
定プログラム３０４５に順次送られる。複合条件判定プ
ログラム３０４５では，検索ターム間に付与された論理
条件のほか近傍，文脈条件を判定し，条件に適合する文
書の文書番号を順次検索結果格納メモリ３０４６へ格納
する。本文サーチまで実行した場合は，本文サーチの終
了後，検索実行制御プログラム３００８は，検索結果格
納メモリ３０４６を参照し検索結果件数を出力して検索
を終了する。このように，容量の大きな本文データを磁
気ディスクに，容量の小さな文字成分表や凝縮本文を，
半導体メモリやＲＡＭディスクに格納することにより，
大規模なデータベースに対しても高速なフルテキストサ
ーチを実現することが可能となる。In the string search engine 3006,
The text read from the collective magnetic disk device 3010c is scanned to perform text search. The results are sequentially sent to the complex condition determination program 3045. The compound condition determination program 3045 determines not only the logical conditions given between the search terms, but also the neighborhood and the context conditions, and sequentially stores the document numbers of the documents meeting the conditions in the search result storage memory 3046. When the text search is executed, after the text search is completed, the search execution control program 3008 refers to the search result storage memory 3046, outputs the number of search results, and ends the search. In this way, large-volume text data is stored on the magnetic disk,
By storing in semiconductor memory or RAM disk,
High-speed full-text search can be realized even for a large-scale database.

【００５２】次に凝縮本文を磁気ディスクに格納した第
十三の実施例について説明する。凝縮本文を磁気ディス
クに格納する場合，階層型プリサーチの制御の手順を最
適化することによって，同一の構成を用いた通常の階層
型プリサーチを実行するよりも高速に処理することがで
きる。以下，この制御の手順について説明する。磁気デ
ィスクは通常，機械的に動く磁気ヘッドを持っている。
このため，ディスク上の情報を飛び飛びに読み出す（ス
キップアクセスと呼ぶ）よりも，まとまった情報を一括
して読み出す（シーケンシャルアクセスと呼ぶ）方が速
いという特徴がある。いま，スキップアクセスの読み出
し速度をＶｓｋｉｐＭＢ／ｓ，シーケンシャルアクセ
スの読み出し速度をＶｓｅｑＭＢ／ｓとすると，デー
タベース全件の文書数をＮａ件，文字成分表サーチの結
果件数をＮｃ件とし，文書の容量が均一であるとした場
合，Ｎｃ＞（Ｖｓｋｉｐ／Ｖｓｅｑ）・Ｎａ ……（１２−１）式のとき，シーケンシャルアクセスにより凝縮本文を全件
サーチした方が，文字成分表サーチの結果に基づいてス
キップアクセスするよりも処理時間が短くなる。したが
って，図３２に示すように文字成分表サーチの後，階層
プリサーチ制御プログラムにおいて結果件数を判定し，
（１２−１）式を満たすヒット件数に達した場合には，
文字成分表サーチの結果を無視して，凝縮本文をデータ
ベース全件分サーチする。以上の方法を用いると，磁気
ディスクに凝縮本文を格納するために，大容量のＲＡＭ
ディスクを使用しなくともすみ，比較的高速なフルテキ
ストサーチを低価格の文書検索装置で実現できることに
なる。Next, a description will be given of a thirteenth embodiment in which the condensed text is stored on a magnetic disk. When the condensed text is stored on the magnetic disk, processing can be performed at a higher speed than by performing a normal hierarchical presearch using the same configuration by optimizing the control procedure of the hierarchical presearch. Hereinafter, the procedure of this control will be described. A magnetic disk usually has a magnetic head that moves mechanically.
For this reason, there is a feature that it is faster to collectively read out information (referred to as sequential access) than to read out information on the disk step by step (referred to as skip access). Now, assuming that the read speed of skip access is Vskip MB / s and the read speed of sequential access is Vseq MB / s, the number of documents in all databases is Na, the number of results in the character component table search is Nc, and Assuming that the capacity is uniform, when Nc> (Vskip / Vseq) · Na (12-1), it is better to search all the condensed texts by sequential access based on the result of the character component table search. Processing time is shorter than skip access. Therefore, as shown in FIG. 32, after the character component table search, the number of results is determined in the hierarchical pre-search control program.
When the number of hits satisfying the expression (12-1) is reached,
Search the condensed text for the entire database, ignoring the results of the character component table search. Using the above method, a large amount of RAM is required to store the condensed text on the magnetic disk.
It is not necessary to use a disk, and a relatively high-speed full-text search can be realized by a low-cost document search device.

【００５３】次に凝縮本文を磁気ディスクに格納した第
十四の実施例について説明する。近傍，文脈条件が指定
されている場合には，文字成分表サーチ結果が非常に少
ない場合，凝縮本文サーチを行わずに，文字成分表サー
チ結果をもとに本文を直接サーチするほうが検索時間が
短くなる。今，凝縮本文のサーチ速度をＶｓｒＭＢ／
ｓ，本文のサーチ速度をＶｔｘＭＢ／ｓとし，文字成
分表の結果件数をＮｃ，凝縮本文の結果件数をＮｓ
ｒ，凝縮本文の１件当たりのデータ容量をＱｓｒ，本
文の１件当たりのデータ容量をＱｔｘとすると，ＮｃＱｓｒ／Ｖｓｒ＋ＮｓｒＱｔｘ／Ｖｔｘ＞ＮｃＱｔｘ／Ｖｔｘ …………（１３−１）式のとき，凝縮本文サーチをせずに，本文サーチを直接行
ったほうが検索時間が短くなる。Ｎｓｒは凝縮本文を
実際にサーチするまでわからないが，あらかじめ定数を
設定して凝縮本文サーチを行うか否か決定することにな
る。たとえば，データベース全体の文書数をＮａとし
てＮｓｒ＝αＮａ（０＜α＜１） …………（１３−２）式として，（１３−１）式を変形すると，Ｎｃ＜ αＮａ（Ｑｔｘ／Ｖｔｘ）／（Ｑｔｘ／Ｖｔｘ−Ｑｓｒ／Ｖｓｒ） …………（１３−３）式のとき，本文サーチを直接行うことにする。αをしきい
値として検索前にあらかじめ値を設定しておき，文字成
分表サーチの後（１３−３）式により凝縮本文サーチを
行うか否か決定する。この制御を行うことにより，近
傍，文脈条件の指定の下で高速なフルテキストサーチを
実現することができる。以上，第十二の実施例の廉価版
のシステム構成でフルテキストサーチを実現する第十
三，第十四の実施例について説明した。Next, a description will be given of a fourteenth embodiment in which the condensed text is stored on a magnetic disk. When the neighborhood and context conditions are specified, if the result of the character component table search is very small, it is better to search the text directly based on the result of the character component table search without performing the condensed text search. Be shorter. Now, the search speed of the condensed text is Vsr MB /
s, the search speed of the text is VtxMB / s, the number of results in the character component table is Nc, and the number of results in the condensed text is Ns
r, where Qsr is the data capacity per case of the condensed text and Qtx is the data capacity per case of the text, NcQsr / Vsr + NsrQtx / Vtx> NcQtx / Vtx When the expression Performing a text search directly without a text search shortens the search time. Nsr is not known until the condensed text is actually searched, but a constant is set in advance to determine whether to perform the condensed text search. For example, assuming that the number of documents in the entire database is Na, Nsr = αNa (0 <α <1)... (13-2) Expression (13-2) is transformed to obtain Nc <αNa (Qtx / Vtx). / (Qtx / Vtx-Qsr / Vsr) In the case of Expression (13-3), the text search is directly performed. A value is set in advance before the search using α as a threshold value, and after the character component table search, it is determined whether or not to perform a condensed text search according to the equation (13-3). By performing this control, a high-speed full-text search can be realized under the specification of the neighborhood and context conditions. In the foregoing, the thirteenth and fourteenth embodiments for realizing a full-text search with a low-cost system configuration of the twelfth embodiment have been described.

【００５４】このほかにも，凝縮本文をまったく使用せ
ず凝縮本文サーチのステップを省いて，文字成分表サー
チから直接本文サーチを実行する制御方法によっても階
層型プリサーチを実現することができる。この方法によ
れば，本文をスキャンする量が増えるため検索時間は多
少掛かるが，高価なＲＡＭディスクを使用しなくとも済
み，また凝縮本文を格納する磁気ディスク容量が不要と
なるため，さらに低価格の文書検索装置を実現できるこ
とになる。また，文字成分表を使用せずに直接ＲＡＭデ
ィスクあるいは磁気ディスク上の凝縮本文を全件サーチ
し，近傍，文脈条件などの検索ターム間の位置関係の検
索条件指定があるときにのみ本文サーチする制御方法に
よっても階層型プリサーチを実現することができる。こ
の方法によれば，凝縮本文の探索量が増えるため検索時
間は多少掛かるが，文字成分表を格納する半導体メモリ
が不要となるため，その分低価格の文書検索装置を実現
できることになる。In addition, the hierarchical pre-search can also be realized by a control method that directly executes the text search from the character component table search without using the condensed text at all and omitting the step of the condensed text search. According to this method, it takes a little time to search because the amount of text to be scanned increases, but it does not require the use of an expensive RAM disk, and does not require the capacity of a magnetic disk for storing condensed text. Document retrieval device can be realized. Also, without using the character component table, the entire text search on the RAM disk or magnetic disk is directly searched, and the text search is performed only when there is a search condition specification of the positional relationship between search terms such as neighborhood and context conditions. Hierarchical pre-search can also be realized by a control method. According to this method, the search time is slightly increased because the amount of search for the condensed text increases, but a semiconductor memory for storing the character component table is not required, so that a low-priced document search device can be realized.

【００５５】あるいは，今までの実施例で用いていたビ
ットリスト形式の文字成分表を図３３に示すように，文
書中に現れる文字を書き列ねた形式，すなわち１文字を
１ビットとして表すのではなく，そのまま文字コード自
体として格納した文字成分表を使用することもできる。
あるいはこの時に，第五の実施例，第六の実施例，及び
第七の実施例で説明したハッシュ関数を用いて一つの文
字エントリに複数個の文字を対応させ文字成分表の容量
を削減することもできる。このように文字コードを格納
した文字成分表を用いた文字成分表サーチは，凝縮本文
や本文サーチと同様に，一文字ずつファイルからデータ
を読み出し該当する文字が存在するか否か判定すること
で実現できる。このように，本文中で用いられている文
字のみを集めた文字成分表を用いることにより，データ
構造を簡素化でき，かつビット演算をせずに凝縮本文，
本文サーチと同じスキャン型のサーチを用いることがで
きるため，検索処理方法が簡素化できるという効果が得
られる。As shown in FIG. 33, the character component table in the bit list format used in the embodiments described above is a format in which characters appearing in a document are arranged, that is, one character is represented as one bit. Instead, a character component table stored as it is as the character code itself can be used.
Alternatively, at this time, the capacity of the character component table is reduced by associating a plurality of characters with one character entry by using the hash function described in the fifth, sixth, and seventh embodiments. You can also. A character component table search using a character component table that stores character codes in this way is realized by reading data from the file character by character and determining whether or not the corresponding character exists, as in the case of a condensed text or text search. it can. In this way, by using a character component table that collects only the characters used in the text, the data structure can be simplified, and the condensed text,
Since the same scan type search as the text search can be used, the effect that the search processing method can be simplified can be obtained.

【００５６】さらに，文字成分表も磁気ディスクに格納
した構成でも，階層型プリサーチを実現することができ
る。この磁気ディスクに文字成分表を格納した場合に
は，文字成分表サーチにおいて検索ターム中で用いられ
ている文字のビットリストを磁気ディスクから順次読み
出しながらビット演算処理を行っていく。もしくは，上
記の文字コードをそのまま文字成分表とした場合には，
文字成分表を順次読み出しながら該当する文字を全て含
む文書を選びだす。この文字成分表を磁気ディスクに格
納する方法によれば，半導体メモリを使わずに済むため
に，さらに低価格の文書検索装置を実現することが可能
となる。Further, even when the character component table is also stored on the magnetic disk, the hierarchical presearch can be realized. When the character component table is stored in the magnetic disk, the bit operation process is performed while sequentially reading the bit list of the characters used in the search term from the magnetic disk in the character component table search. Alternatively, if the above character code is used directly as a character component table,
While sequentially reading the character component table, a document including all the corresponding characters is selected. According to the method of storing the character component table on the magnetic disk, it is not necessary to use a semiconductor memory, so that a lower-priced document search device can be realized.

【００５７】[0057]

【発明の効果】本発明によれば，文字成分表及び凝縮本
文を用いて，階層的に文字レベル及び単語レベルで入力
された検索タームに関連しない文書をふるい落すことに
より，無用の本文サーチを省くことができるため，等価
的に高速なフルテキストサーチの実現手段となり，大規
模な文書データベースでも実用的な応答速度で，フルテ
キストサーチすることが可能となる。According to the present invention, a useless text search can be performed by using a text component table and a condensed text to hierarchically filter out documents that are not related to a search term input at a text level and a word level. Since it can be omitted, it becomes equivalently a means for realizing a high-speed full-text search, and it becomes possible to perform a full-text search with a practical response speed even in a large-scale document database.

[Brief description of the drawings]

【図１】本発明の第一の実施例の構成を示す図である。FIG. 1 is a diagram showing a configuration of a first embodiment of the present invention.

【図２】本発明の特徴となる階層型プリサーチのための
登録処理を示す図である。FIG. 2 is a diagram showing a registration process for hierarchical presearch, which is a feature of the present invention.

【図３】本発明の特徴となる階層型プリサーチの検索処
理を示す図である。FIG. 3 is a diagram illustrating a search process of a hierarchical presearch, which is a feature of the present invention.

【図４】凝縮本文を作成する一例を示した図である。FIG. 4 is a diagram showing an example of creating a condensed text.

【図５】凝縮本文の格納形態を示す図である。FIG. 5 is a diagram showing a storage form of a condensed text.

【図６】文字成分表の概要を示す図である。FIG. 6 is a diagram showing an outline of a character component table.

【図７】文字成分表サーチの概要を示す図である。FIG. 7 is a diagram showing an outline of a character component table search.

【図８】階層型プリサーチの処理手順を示す図である。FIG. 8 is a diagram illustrating a processing procedure of a hierarchical presearch.

【図９】第三の実施例における文字成分表サーチの処理
を示す図である。FIG. 9 is a diagram showing a character component table search process according to the third embodiment.

【図１０】第四の実施例で用いる文字成分表のコード変
換の処理を示すＰＡＤ図である。FIG. 10 is a PAD diagram showing a code conversion process of a character component table used in the fourth embodiment.

【図１１】第四の実施例で用いる文字成分表のコード変
換の概要を示す図である。FIG. 11 is a diagram illustrating an outline of code conversion of a character component table used in the fourth embodiment.

【図１２】第四の実施例で用いる文字成分表の概要を示
す図である。FIG. 12 is a diagram showing an outline of a character component table used in a fourth embodiment.

【図１３】第五の実施例で用いる文字成分表の概要を示
す図である。FIG. 13 is a diagram showing an outline of a character component table used in the fifth embodiment.

【図１４】第五の実施例で用いる階層型プリサーチの処
理手順を示す図である。FIG. 14 is a diagram showing a processing procedure of a hierarchical presearch used in the fifth embodiment.

【図１５】第六の実施例で用いる文字成分表の概要を示
す図である。FIG. 15 is a diagram showing an outline of a character component table used in the sixth embodiment.

【図１６】第六の実施例で用いる階層型プリサーチの処
理手順を示す図である。FIG. 16 is a diagram showing a processing procedure of a hierarchical presearch used in the sixth embodiment.

【図１７】第七の実施例で用いる文字成分表の作成方法
の概要を示す図である。FIG. 17 is a diagram showing an outline of a method of creating a character component table used in the seventh embodiment.

【図１８】第七の実施例で用いる文字成分表のためのハ
ッシュ関数で用いる文字コード−エントリ番号の対応表
の概要を示す図である。FIG. 18 is a diagram showing an outline of a character code-entry number correspondence table used in a hash function for a character component table used in the seventh embodiment.

【図１９】第八の実施例で用いる凝縮本文の作成する方
法を示す図である。FIG. 19 is a diagram showing a method of creating a condensed text used in the eighth embodiment.

【図２０】第八の実施例で用いる凝縮本文のためのひら
がな文字列の処理方法を示す図である。FIG. 20 is a diagram illustrating a method of processing a hiragana character string for a condensed text used in the eighth embodiment.

【図２１】第八の実施例で用いる付属語解析のための基
本単語辞書を示す図である。FIG. 21 is a diagram showing a basic word dictionary for attached word analysis used in the eighth embodiment.

【図２２】第八の実施例で用いる付属語解析のための接
続規則を示す図である。FIG. 22 is a diagram showing connection rules for attached word analysis used in the eighth embodiment.

【図２３】第八の実施例で用いる階層型プリサーチの処
理手順を示す図である。FIG. 23 is a diagram showing a processing procedure of a hierarchical presearch used in the eighth embodiment.

【図２４】第九の実施例で用いる凝縮本文の作成する方
法を示す図である。FIG. 24 is a diagram showing a method of creating a condensed text used in the ninth embodiment.

【図２５】第九の実施例で用いる階層型プリサーチの処
理手順を示す図である。FIG. 25 is a diagram showing a processing procedure of hierarchical presearch used in the ninth embodiment.

【図２６】第十の実施例で用いる凝縮本文の作成する方
法を示す図である。FIG. 26 is a diagram illustrating a method of creating a condensed text used in the tenth embodiment.

【図２７】第十の実施例で用いる階層型プリサーチの処
理手順を示す図である。FIG. 27 is a diagram illustrating a processing procedure of hierarchical presearch used in the tenth embodiment.

【図２８】第十一の実施例で用いる凝縮本文の作成する
方法を示す図である。FIG. 28 is a diagram illustrating a method of creating a condensed text used in the eleventh embodiment.

【図２９】第十一の実施例で用いる階層型プリサーチの
処理手順を示す図である。FIG. 29 is a diagram illustrating a processing procedure of hierarchical presearch used in the eleventh embodiment.

【図３０】第十二の実施例の構成の部分を示す図であ
る。FIG. 30 is a diagram showing a part of the configuration of a twelfth embodiment.

【図３１】第十二の実施例の構成の残りの部分を示す図
である。FIG. 31 is a diagram showing the remaining part of the configuration of the twelfth embodiment.

【図３２】第十二の実施例で用いる階層型プリサーチの
処理手順を示す図である。FIG. 32 is a diagram illustrating a processing procedure of hierarchical presearch used in the twelfth embodiment.

【図３３】文字として格納した文字成分表の概要を示す
図である。FIG. 33 is a diagram showing an outline of a character component table stored as characters.

───────────────────────────────────────────────────── フロントページの続き (72)発明者川口久光東京都国分寺市東恋ケ窪１丁目280番地株式会社日立製作所中央研究所内 (72)発明者嶺岸直材神奈川県川崎市幸区鹿島田890番地の12 株式会社日立製作所情報システム工場内 (56)参考文献特開昭57−137965（ＪＰ，Ａ) 特開昭63−244259（ＪＰ，Ａ) 特開平２−253474（ＪＰ，Ａ) 特開昭62−211728（ＪＰ，Ａ) 特開昭59−112339（ＪＰ，Ａ) 特許2986865（ＪＰ，Ｂ２) 加藤寛次他，「全文検索用テキストサーチマシンの開発」，電子情報通信学会技術研究報告（ＤＥ89−38）Ｖｏｌ．89 Ｎｏ．335，1989（平01−12−14）, ｐｐ．17−24 菊池忠一他，「構成文字の属性／文字位置を含むコード化による全文検索の高速化手法」，電子情報通信学会技術研究報告（ＤＥ90−24）Ｖｏｌ．90 Ｎｏ. 362，1990（平02−12−14），ｐｐ．１ −７ (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 17/30 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Hisamitsu Kawaguchi 1-280 Higashi Koigakubo, Kokubunji-shi, Tokyo Inside the Central Research Laboratory, Hitachi, Ltd. (56) References JP-A-57-137965 (JP, A) JP-A-63-244259 (JP, A) JP-A-2-253474 (JP, A) JP-A Sho 62- 211728 (JP, A) JP-A-59-112339 (JP, A) Patent 2986865 (JP, B2) Kanji Kato et al., "Development of text search machine for full-text search", IEICE technical report (DE89-38) ) Vol. 89 No. 335, 1989 (Heisei 01-12-14), pp. 17-24 Tadakazu Kikuchi et al., "High-speed full-text search by coding including attribute / character position of constituent characters", IEICE Technical Report (DE90-24) Vol. 90 No. 362, 1990 (Heisei 02-12-14) pp. 1-7 (58) Field surveyed (Int. Cl. ⁷ , DB name) G06F 17/30 JICST file (JOIS)

Claims

(57) [Claims]

1. A full-text search method for searching for a document including a keyword specified by a searcher by referring to the text content of a document database in which the document information is stored as character code data. When registering, the body character string of the registered document is divided for each character type such as hiragana, kanji, and alphanumeric characters, and after removing all the hiragana character strings, between each partial character string divided for each character type. A step of creating a condensed text consisting of a set of substrings excluding character strings included in other character strings by mutually examining the inclusion relation of the character strings, and a character in which characters appearing in the original text are registered without duplication A step of creating a component table, and a step of registering, in addition to the text of the document to be registered, a plurality of condensed texts and a character component table corresponding to the character type in the document database. In the search, first, a character component table search step of extracting a document including all types of characters constituting the keyword specified by the searcher by referring to the character component table; It is determined whether or not a hiragana character string is present in the partial character strings constituting the keyword. If there is no hiragana character string, the specified partial character string is referred to by referring to the condensed text of the document extracted by the character component table search. By the step of condensed text search that extracts only documents corresponding to the
Select a document containing the specified keyword, and then refer to the text of the document narrowed down in this way to extract only those that satisfy the search conditions such as the positional relationship given between the keywords. Refers to the original text of the document extracted by the character component table search, and extracts only the text that includes the specified partial character string and satisfies the search conditions such as the positional relationship given between keywords. A full-text search method, wherein an equivalently high-speed full-text search is performed by a search step.