JPH06208582A

JPH06208582A - Method and device for retrieving adaptive surrogate type information

Info

Publication number: JPH06208582A
Application number: JP5251090A
Authority: JP
Inventors: Atsushi Hatakeyama; 敦畠山; Satoshi Asakawa; 悟志浅川; Kanji Kato; 寛次加藤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1992-09-18
Filing date: 1993-09-13
Publication date: 1994-07-26
Anticipated expiration: 2018-01-14
Also published as: JP3365833B2

Abstract

PURPOSE:To shorten a processing time and to reduce the number of disks to be used by efficiently executing full text search for data whose text capacity of documents are sharply different from each other. CONSTITUTION:Text data are divided or unified in accordance with a determined condition to prepare surrogate data (condensed texts 1 to 7, character components 1 to 7 and tables), a corresponding table between the divided or unified surrogate data and the original text data is prepared, surrogate data candidates having possibility of satisfying a conditional formula is extracted by referring to the surrogate data, the text data concerned are searched from the obtained candidates while referring to the corresponding table, and finally the text data matched with the conditional formula are outputted. Since the quantity of text data to be retrieved can be reduced, high speed full text search can be consequently attained.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書データベースを、
指定の文字列すなわち検索語を指定して文書の全文を対
象として探索することにより、所望の文書を検索する文
書検索に係わり、特に文書データベースに登録する文書
のテキスト容量が各文書毎に大きくばらつく場合に好適
な情報検索方法および装置に関する。This invention relates to a document database,
It is related to the document search to search for a desired document by searching the entire text of the document by designating a specified character string, that is, a search word, and in particular, the text capacity of the document registered in the document database greatly varies from document to document. The present invention relates to an information search method and apparatus suitable for cases.

【０００２】[0002]

【従来の技術】これまで、文書の登録の際にキーワード
付けを行う必要のないフルテキストサーチ方式を「特開
平０３−１７４６５２」で提案してきた。この方式は、
文書を単語単位に圧縮した凝縮本文と、文書中の使用文
字を一文字単位で登録した文字成分表を用いて、フルテ
キストサーチを実用レベルで高速に行うことを目的とし
たものである。この方式で用いている凝縮本文と文字成
分表のことをここではサロゲートデータと呼ぶことにす
る。この公知例は、実際のテキストデータをスキャンす
る前にサロゲートデータを調べてスキャンすべき候補文
書を絞り込み、得られた候補文書についてのみテキスト
データをスキャンすることにより、探索処理量を減ら
し、その結果等価的に高速なフルテキストサーチを実現
することを特徴としている。この事前の絞り込みを、公
知例ではプリサーチと呼んでいる。このプリサーチの方
法については、「特願平０３−０５８３１１」で詳しく
提案した。2. Description of the Related Art Heretofore, Japanese Patent Laid-Open No. 03-174652 has proposed a full-text search method that does not require adding a keyword when registering a document. This method
The purpose is to perform a full-text search at a high speed at a practical level by using a condensed text obtained by compressing a document in word units and a character component table in which characters used in the document are registered in character units. The condensed text and character component table used in this method will be referred to as surrogate data here. In this publicly known example, the surrogate data is examined before scanning the actual text data to narrow down the candidate documents to be scanned, and the text data is scanned only for the obtained candidate documents, thereby reducing the search processing amount, and as a result, It is characterized by realizing an equivalently high-speed full-text search. This prior narrowing down is called pre-search in a known example. This pre-search method was proposed in detail in "Japanese Patent Application No. 03-058311".

【０００３】以下、上記公知例で提案したサロゲートデ
ータについて概略を説明する。まず、サロゲートデータ
の一つである、凝縮本文は、登録する文書のテキストデ
ータを文字種の変化点で分割した後、分割した文字列の
内、他に重複して存在する文字列を排除することによ
り、元のテキストデータを圧縮したものである。すなわ
ち、凝縮本文は、元の文書のテキストデータ中に同じ言
葉が現われれば現われるほど、元のテキストデータ量に
比べ容量が小さくなるため、小容量のサロゲートデータ
を作成することができる。この様に、データ量を小さく
できればできるほど全体の凝縮本文のスキャン量を減ら
すことができ、高速な凝縮本文サーチを実現することが
できる。これがサロゲートデータ作成の目的であった。
しかしながら、この作成方法は同じ言葉を重複して用い
ることの少ない簡潔な文書では、テキストデータとサロ
ゲートデータの間でデータ量に違いがなくなってくる。
極端な場合、１文章を１文書として登録する場合、デー
タ量はまったく同じになることもある。このような場
合、サロゲートデータからスキャン候補文書を抽出し
て、テキストデータを探索するよりも、テキストデータ
を直接探索する方が、プリサーチを含めたトータルの探
索量が少なくてすむことになる。すなわち、サロゲート
データを探索することが無駄であり、省いた方が高速に
フルテキストサーチできることになる。An outline of the surrogate data proposed in the above-mentioned known example will be described below. First, in condensed text, which is one of the surrogate data, the text data of the document to be registered is divided at the change points of the character type, and then the duplicated existing character strings are excluded from the divided character strings. Is the original text data compressed. That is, since the condensed text has a smaller capacity as the same word appears in the text data of the original document than the original text data amount, it is possible to create a small amount of surrogate data. As described above, the smaller the data amount is, the more the total condensed text scanning amount can be reduced, and the condensed text search can be performed at high speed. This was the purpose of creating surrogate data.
However, in this method, a simple document in which the same word is rarely used duplicates the data amount between the text data and the surrogate data.
In an extreme case, when one sentence is registered as one document, the data amount may be exactly the same. In such a case, the total search amount including the pre-search can be reduced by directly searching the text data rather than extracting the scan candidate document from the surrogate data and searching the text data. That is, it is useless to search the surrogate data, and if omitted, full-text search can be performed faster.

【０００４】一方、もうひとつのサロゲートデータであ
る文字成分表は、文書のテキストデータに現われる文字
をビット情報で記述したものである。すなわち、サロゲ
ートデータとして一定長のビット列を設け、テキストデ
ータに現われる文字に対応して、該当ビット位置に出現
情報を記述することにより、すなわちビット‘１’を設
定することにより、テキストデータの文字レベルのサロ
ゲート情報を表す。この方式では、文書のデータ量に依
らず常に一定のサロゲートデータ量を必要とする。した
がって、文書のデータ量が多いときは、そのデータ量に
比べ非常に小さな容量のサロゲートデータになるのに対
して、文書のデータ量が少ないときには、元のデータ量
よりもサロゲートデータの方が多くなってしまう場合が
生じかねない。すなわち、文書データ量が小さい場合に
は不必要なデータを多くもつことになってしまう。ま
た、文書のデータ量が非常に大きい場合には、必然的に
多くの種類の文字が現われることになるため文字成分表
の多くのビットが立つことになる。その結果、文字成分
表のサーチ結果として多くの文書が候補として上げられ
ることになり、絞り込みが不十分になってしまう。その
上、候補としてあげられる文書データの容量が大きい
と、それだけ多くのテキストデータを探索することにな
り、サーチ時間が長大化することにもなる。つまり、サ
ロゲートデータによる絞り込みの効果が上がらず、多く
の凝縮本文あるいは大容量の本文データを参照しなけれ
ばならなくなる。その結果、検索時間を短くできなくな
ってしまうことになる。On the other hand, another surrogate data character component table describes the characters appearing in the text data of a document by bit information. That is, by providing a bit string of a certain length as surrogate data and describing appearance information at the corresponding bit position corresponding to the character appearing in the text data, that is, by setting bit "1", the character level of the text data is set. Represents surrogate information of. In this method, a constant surrogate data amount is always required regardless of the document data amount. Therefore, when the amount of data in the document is large, the surrogate data has a very small capacity compared to that amount, whereas when the amount of data in the document is small, the surrogate data is larger than the original data amount. There is a possibility that it will become. That is, if the amount of document data is small, there will be a lot of unnecessary data. In addition, when the data amount of the document is very large, many kinds of characters inevitably appear, so many bits of the character component table are set. As a result, many documents are listed as candidates as a search result of the character component table, and narrowing down becomes insufficient. In addition, if the volume of document data given as a candidate is large, a large amount of text data will be searched, and the search time will be lengthened. In other words, the effect of narrowing down by surrogate data does not increase, and it becomes necessary to refer to a large amount of condensed text or a large amount of text data. As a result, the search time cannot be shortened.

【０００５】以上の説明をまとめると、これまでの方法
では、文書一件分のデータ量が少ない文書の場合、文字
成分表および凝縮本文すなわちサロゲートデータが文書
データ量に比べ大きくなり、結果としてプリサーチの処
理時間が長くなる。また、逆に文書一件分のデータ量が
多い場合、サロゲートデータを用いた事前の候補文書選
択すなわちプリサーチにより、多くの文書がヒットして
しまい、結果として多くのテキストデータをスキャンす
る必要が生じるという問題があった。To summarize the above description, according to the conventional methods, in the case of a document in which the data amount of one document is small, the character component table and the condensed text, that is, surrogate data, becomes larger than the document data amount, and as a result, Search processing time increases. On the contrary, when the data amount of one document is large, many documents are hit by the preliminary candidate document selection, that is, the pre-search using the surrogate data, and as a result, it is necessary to scan many text data. There was a problem that it would occur.

【０００６】このようなサロゲートデータの容量という
観点の他に、テキストデータの読み込み方法についても
問題がある。「ＵＮＩＸデバイスドライバ」（アスキー
出版局）Ｐ．５１の「２．３ブロックバッファリング
システム」に記載されているように、磁気ディスクなど
の記憶媒体から、メモリなどへデータを読み込むときに
は、一定のバッファサイズでまとめてデータを記憶媒体
からメモリへ読み込む処理を行っている。したがって、
このような読み込み方をする場合には、バッファサイズ
に比べて小さなデータ量の単位で読み込む場合も、バッ
ファサイズに近いデータ量の単位で読み込む場合も読出
し時間にほとんど差が生じないことになる。そのため、
小さなデータ容量のテキストデータを媒体から読み込む
場合には、データ容量が小さいにもかかわらず読み込み
速度が上がらないという問題があった。In addition to such a surrogate data capacity, there is a problem in the method of reading text data. "UNIX device driver" (ASCII Publishing) P.P. As described in 51, "2.3 Block Buffering System", when data is read from a storage medium such as a magnetic disk to a memory or the like, the data is collectively read into the memory with a fixed buffer size. It is processing. Therefore,
When such a reading method is used, there is almost no difference in the reading time regardless of whether the reading is performed in the unit of data amount smaller than the buffer size or in the unit of data amount close to the buffer size. for that reason,
When reading text data having a small data capacity from a medium, there is a problem that the reading speed does not increase even though the data capacity is small.

【０００７】また、磁気ディスクなどの磁気ヘッドのシ
−クや回転待ちという記憶媒体からの読出しオ−バヘッ
トを短くする方法としては、「特願平４−４６６８５」
が既に知られている。これは、記憶媒体を複数台接続
し、それらの媒体を並列に動作させ分割格納したデ−タ
を読出すことで、各記憶媒体のオ−バヘット時間を多重
化して読出しの効率を改善する技術である。Further, as a method for shortening the seek of a magnetic head such as a magnetic disk or the waiting for rotation to read from a storage medium, Japanese Patent Application No. 4-46685 is available.
Is already known. This is a technique for connecting a plurality of storage media, operating the media in parallel, and reading the data stored separately in order to multiplex the overhead time of each storage media to improve the read efficiency. Is.

【０００８】[0008]

【発明が解決しようとする課題】本発明の目的は、文書
のデータ量にバラツキがある場合でも、サロゲートデー
タ量を適切に調節することにより、フルテキストサーチ
において常に安定したシステム検索速度を実現すること
にある。本発明の他の目的は、フルテキストサーチにお
けるシステム検索速度を高速化することにある。SUMMARY OF THE INVENTION An object of the present invention is to realize a system search speed which is always stable in full-text search by appropriately adjusting the surrogate data amount even when the data amount of a document varies. Especially. Another object of the present invention is to increase the system search speed in full-text search.

【０００９】[0009]

【課題を解決するための手段】上記目的を達成するた
め、本発明は、所定の条件の下に複数の文書のテキスト
データを統合し、該統合したテキストデータに対して前
記サロゲートデータを作成し、かつテキストデータと作
成したサロゲートデータの対応表を作成する。そして、
検索時には、前記サロゲートデータを参照してサロゲー
トデータの候補をあげ、その候補に対応するテキストデ
ータを前記対応表を基に求め、該当するテキストデータ
のみを参照することにより条件に合致した文書を検索す
る。また、所定のデータ容量Ｑｔをあらかじめ設定し、
サロゲートデータの作成時にテキストデータの容量がＱ
ｔに満たないうちはテキストデータを統合して前記サロ
ゲートデータを作成し、もし一つのテキストデータの容
量がＱｔ以上のときには、Ｑｔを超えない範囲でそのテ
キストデータを分割し、分割したテキストデータ毎に別
々に前記サロゲートデータを作成し、かつテキストデー
タと作成したサロゲートデータの対応表を作成する。そ
して、検索時には、前記サロゲートデータを参照してサ
ロゲートデータの候補をあげ、その候補に対応するテキ
ストデータを前記対応表を基に求め、該当するテキスト
データのみを参照することにより条件に合致した文書を
検索する。また、検索条件式として、検索語間のＡＮＤ
条件が与えられたときには、複数の文書を統合して作成
したサロゲートデータについては単語間のＡＮＤ条件で
検索し、一つの文書を複数のテキストデータに分割して
分割したテキストデータ毎に別々に作成したサロゲート
データについては単語間のＯＲ条件で検索するようにし
ている。また、この場合、前記統合して作成したサロゲ
ートデータのそれぞれを格納するファイルと、前記一つ
の文書を複数のテキストデータに分割して分割したテキ
ストデータ毎に別々に作成したサロゲートデータのそれ
ぞれを格納するファイルとを夫々別々のファイルとして
いる。また、文書のテキストデータをファイルとして格
納するために複数の外部記憶装置を設け、各文書のテキ
ストデータを複数の外部記憶装置に順次格納すると共に
所定容量を超える文書のテキストデータは複数の外部記
憶装置に分割して格納しておき、検索時に、サロゲート
データを参照して文書候補をあげ、その文書候補に対応
するテキストデータを複数の外部記憶装置から一括して
読出して参照することにより、条件に合致した文書を検
索するようにしている。In order to achieve the above object, the present invention integrates text data of a plurality of documents under a predetermined condition and creates the surrogate data for the integrated text data. , And create a correspondence table between the text data and the created surrogate data. And
At the time of search, a surrogate data candidate is referred to by referring to the surrogate data, text data corresponding to the candidate is obtained based on the correspondence table, and a document matching the condition is searched by referring to only the corresponding text data. To do. In addition, a predetermined data capacity Qt is set in advance,
When creating surrogate data, the text data capacity is Q
When the capacity of one text data is equal to or more than Qt, the text data is divided within a range not exceeding Qt, and the divided text data is divided for each divided text data. The above-mentioned surrogate data is separately created, and a correspondence table of the text data and the created surrogate data is created. Then, at the time of search, the surrogate data is referred to, a candidate for the surrogate data is given, the text data corresponding to the candidate is obtained based on the correspondence table, and only the corresponding text data is referred to, thereby satisfying the condition. To search. Also, as a search condition expression, AND between search terms
When a condition is given, surrogate data created by integrating multiple documents is searched by AND condition between words, and one document is divided into multiple text data and created separately for each divided text data. The surrogate data is searched by an OR condition between words. Further, in this case, a file for storing each of the surrogate data created by the integration and each of the surrogate data created separately for each of the text data obtained by dividing the one document into a plurality of text data are stored. The files to be created are separate files. Further, a plurality of external storage devices are provided to store the text data of the document as a file, the text data of each document is sequentially stored in the plurality of external storage devices, and the text data of the document exceeding a predetermined capacity is stored in the plurality of external storage devices. By dividing and storing in the device, referring to the surrogate data at the time of retrieval, providing a document candidate, and collectively reading the text data corresponding to the document candidate from a plurality of external storage devices and referring to the condition, Documents that match the above are searched.

【００１０】[0010]

【作用】文書のテキストデータを複数個まとめて、ある
いは一つの文書を複数個に分割して適当な容量のデータ
とした後に、サロゲートデータを作成することにより、
圧縮率が高くかつ絞り込み率の高いサロゲートデータが
作成できる。その結果、等価的に高速なフルテキストサ
ーチを実現することができる。また、テキストデータと
サロゲートデータの対応表を文書の登録時に作成するこ
とにより、テキストデータを統合、あるいは分割して作
成したサロゲートデータにより候補文書を抽出しても、
それがどのテキストデータに対応するかを識別すること
ができる。そのため、複数の文書を一件にまとめたサロ
ゲートデータでも、対応する元のテキストデータを参照
することができ、正確な検索を実現することが可能とな
る。また、ＡＮＤ条件が与えられても検索時間を少なく
済ますことができる。また、複数の外部記憶装置を設け
ることにより検索時間を高速化することができる。By combining a plurality of text data of a document or dividing one document into a plurality of data having an appropriate capacity and then creating surrogate data,
Surrogate data with high compression rate and high narrowing rate can be created. As a result, equivalently high-speed full-text search can be realized. Also, by creating a correspondence table of text data and surrogate data at the time of document registration, even if candidate documents are extracted by surrogate data created by integrating or dividing text data,
It is possible to identify which text data it corresponds to. Therefore, even in the case of surrogate data in which a plurality of documents are collected in one case, the corresponding original text data can be referred to, and an accurate search can be realized. Further, the search time can be shortened even if the AND condition is given. Also, the search time can be shortened by providing a plurality of external storage devices.

【００１１】[0011]

【実施例】まず、本発明における検索処理の概要につい
て説明する。サロゲートデータについては、テキストデ
ータを文字種の変化点で文字列に分割し、これらの中か
らひらがな文字列を除去するとともに、分割した各文字
列について重複を排除して、元のデータを圧縮した凝縮
本文と、テキストデータに現われる文字をビット列とし
て記述した文字成分表を用いることにする。ここでは、
文字成分表の文書１件分に対応するデータを文字成分と
呼ぶことにする。また、データの登録には、複数件の文
書を一度に登録することを想定する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First, an outline of the search processing according to the present invention will be described. For surrogate data, the text data is divided into character strings at the change points of the character types, Hiragana character strings are removed from these, and duplicates are removed from each divided character string, and the original data is compressed and condensed. We will use a character component table in which the characters appearing in the text and text data are described as bit strings. here,
The data corresponding to one document in the character component table will be called a character component. In addition, it is assumed that a plurality of documents are registered at once for data registration.

【００１２】まず、サロゲートデータの登録処理につい
て説明し、次に検索処理について説明する。サロゲート
データを作成するときには、まず登録しようとする文書
のデータ量を調べ、あらかじめ設定されているしきい値
Ｑｔと比較する。ここで、データ量がしきい値Ｑｔより
小さい場合には、本文書を次の文書と統合する。そし
て、この統合処理をデータ量がしきい値Ｑｔを超えるま
で繰り返し、複数個の文書から一件のサロゲートデータ
を作成する。また、一件の文書のデータ量がしきい値Ｑ
ｔを超えている場合には、その文書を分割し複数個のサ
ロゲートデータを作成する。このときの様子を図１に示
す。図１では、文書１及び文書２の２件分のテキストデ
ータから、凝縮本文１及び文字成分１のサロゲートデー
タをそれぞれ１件ずつ出力している。同じように文書３
から文書５の３件分のテキストデータから凝縮本文２及
び文字成分２のサロゲートデータが作成されている。ま
た、文書６は、逆に凝縮本文３，４，５と３件のサロゲ
ートデータに分割されている。First, the surrogate data registration process will be described, and then the search process will be described. When creating surrogate data, the amount of data of the document to be registered is first checked and compared with a preset threshold value Qt. If the data amount is smaller than the threshold value Qt, this document is integrated with the next document. Then, this integration process is repeated until the data amount exceeds the threshold value Qt, and one surrogate data is created from a plurality of documents. Also, the data amount of one document is the threshold value Q.
If t is exceeded, the document is divided to create a plurality of surrogate data. The state at this time is shown in FIG. In FIG. 1, the surrogate data of the condensed body 1 and the surrogate data of the character component 1 are output one by one from the text data of two cases of the document 1 and the document 2. Document 3 as well
The surrogate data of the condensed body 2 and the character component 2 is created from the text data of the three documents 5 to 5. On the contrary, the document 6 is divided into three surrogate data of condensed text 3, 4, and 5.

【００１３】このように分割あるいは統合されたテキス
トデータからサロゲートデータを作成した場合、テキス
トデータとサロゲートデータが一対一に対応しなくなっ
てしまう。そこで、サロゲートデータの検索で候補文書
を抽出した後その結果が元々どの文書に対応したものな
のかを知るために、図２に示すような対応表を作成して
おく。ここでは、サロゲートデータの番号に対応してテ
キストデータの番号を登録する。例えば、サロゲートデ
ータ１は、文書１と文書２をまとめて作ったものなの
で、対応表上ではテキストデータ１及び２を登録する。When the surrogate data is created from the text data divided or integrated as described above, the text data and the surrogate data do not have a one-to-one correspondence. Therefore, in order to know which document the result originally corresponds to after extracting the candidate document by searching the surrogate data, a correspondence table as shown in FIG. 2 is created. Here, the text data number is registered in correspondence with the surrogate data number. For example, since the surrogate data 1 is created by combining the document 1 and the document 2, the text data 1 and 2 are registered on the correspondence table.

【００１４】検索のときには、与えられた条件式中の検
索語について、サロゲートデータでまず候補文書を抽出
し、ここで得られた候補文書について、そのテキストデ
ータをスキャンして正しく条件式に合致する文書を検索
する。すなわち、文字成分表では、検索語中に現われる
文字をすべて含む文書を候補文書として抽出する。凝縮
本文では、検索語を文字種で部分文字列に分割し、それ
らの部分文字列がすべて含まれている文書を候補文書と
して抽出する。こうして得られた候補文書について、そ
のテキストデータをスキャンして本当に検索語が含まれ
ていて、なおかつ検索語間に指定された論理条件等を満
たす文書のみを抽出し、これらを最終結果として出力す
る。At the time of a search, for the search word in a given conditional expression, a candidate document is first extracted from the surrogate data, and the text data of the candidate document obtained here is scanned to match the conditional expression correctly. Search for documents. That is, in the character component table, a document including all characters appearing in the search word is extracted as a candidate document. In the condensed text, the search word is divided into partial character strings by character type, and a document including all of these partial character strings is extracted as a candidate document. With respect to the candidate documents thus obtained, the text data is scanned, and only the documents that really include the search word and satisfy the logical conditions specified between the search words are extracted, and these are output as the final result. .

【００１５】具体的な検索処理を下記条件式を例に説明
する。ｓｅａｒｃｈ：文書ＤＢ・・・・・・・・・・・・・・（１）この条件式（１）は「一つの文書内に“文書ＤＢ”とい
う検索語（文字列）を含む文書を探せ」という意味を表
わす。まず、サロゲートデータを参照して候補文書を抽
出する。文字成分表では、（‘文’，‘書’，‘Ｄ’，
‘Ｂ’）のすべての文字を含む文書を候補文書としてあ
げる。例えば、文字成分の２，４，６，７が上記４文字
を含んでいたとすると、次にこれら２，４，６，７の凝
縮本文をスキャンして、（“文書”，“ＤＢ”）のすべ
ての部分文字列を含む文書を候補として抽出する。こう
して、文字成分表と凝縮本文というサロゲートデータを
用いた２段階のプリサーチにより、２，４，７が候補文
書として得られたとする。既に説明したように、サロゲ
ートデータは複数のテキストデータから作られている場
合があるため、サロゲートデータの検索結果をそのまま
候補文書としてテキストデータを探索することはできな
い。これは、サロゲートデータとテキストデータが一対
一に対応していないためである。このため、テキストデ
ータを探索する前に、一旦、図２に示す対応表を参照
し、サロゲートデータで抽出した候補文書に対応するテ
キストデータを選び出す必要がある。上記の例では、
（２，４，７）というサロゲートデータの探索結果から
図２の対応表を参照することにより、対応するテキスト
データがそれぞれ（３，４，５），６，及び（９，１
０）であることを知る。こうして、最終的にテキストデ
ータ３，４，５，６，９，１０を探索して、（“文書
ＤＢ”）の検索語（文字列）が含まれる文書を探し出
し、その文書の番号を検索の最終結果として出力する。A specific search process will be described by taking the following conditional expression as an example. search: document DB (1) This conditional expression (1) is "search for a document that includes a search word (character string)" document DB "in one document". "Means." First, the candidate document is extracted with reference to the surrogate data. In the character composition table, ('sentence', 'writing', 'D',
Documents containing all characters of'B ') are listed as candidate documents. For example, if the character components 2, 4, 6, and 7 include the above-mentioned four characters, then the condensed text of these 2, 4, 6, and 7 is scanned, and (“document”, “DB”) Documents containing all substrings of are extracted as candidates. In this way, it is assumed that 2, 4, and 7 are obtained as candidate documents by the two-step pre-search using the surrogate data of the character component table and the condensed text. Since the surrogate data may be made up of a plurality of text data as described above, it is not possible to search the text data with the search result of the surrogate data as a candidate document. This is because the surrogate data and the text data do not have a one-to-one correspondence. Therefore, before searching the text data, it is necessary to refer to the correspondence table shown in FIG. 2 once and select the text data corresponding to the candidate document extracted by the surrogate data. In the above example,
By referring to the correspondence table of FIG. 2 from the search result of the surrogate data of (2, 4, 7), the corresponding text data are (3, 4, 5), 6, and (9, 1), respectively.
Know that it is 0). Thus, finally, the text data 3, 4, 5, 6, 9, 10 are searched, a document including the search word (character string) of (“document DB”) is searched, and the document number is searched. Output as the final result.

【００１６】次に、もう少し複雑な検索条件による検索
処理を下記条件式（２）を例に説明する。Next, a search process based on a slightly more complicated search condition will be described using the following conditional expression (2) as an example.

【００１７】ｓｅａｒｃｈ：ＡＮＤ（文書ＤＢ，検索システム）・・・・・（２）この条件式は一つの文書内に“文書ＤＢ”と“検索シス
テム”の両方を含む文書を探せという意味を表わす。サ
ロゲートデータは、一件の文書から複数件に分割して作
成されている場合があるため、指定された検索語が必ず
しも同一のサロゲートデータ内に含まれない可能性があ
る。この問題は、二つの検索語が指定された場合のみな
らず、指定された検索語が一つの場合にも起こり得る。
例えば、“文書ＤＢ”という検索語が指定された場合、
単語の途中‘文’の直後でテキストデータが分割されて
しまうと、“文書”という部分文字列ではサロゲートデ
ータを探索できなくなる。すなわち、テキストデータ中
に指定された検索語があるのにもかかわらずヒットしな
いことになり、検索漏れが生じてしまう。しかし、分割
の仕方を文章の区切れ目、すなわち段落や、章、節等を
目安に行えば、単語の途中で別々のサロゲートデータへ
分割されることを避けることができる。すなわち、単一
の検索語が指定された場合には、テキストデータの分割
の仕方により複数サロゲートデータへの分割の問題は解
決できる。しかし、上記式（２）のＡＮＤ条件は、両検
索語間が文書中でどんなに離れていても成り立つため、
単に分割点を調整することでは問題が解決できない。Search: AND (document DB, search system) (2) This conditional expression means to search for a document that includes both “document DB” and “search system” in one document. . Since the surrogate data may be created by dividing one document into multiple cases, the specified search term may not necessarily be included in the same surrogate data. This problem can occur not only when two search words are specified but also when one search word is specified.
For example, if the search term "document DB" is specified,
If the text data is divided immediately after the'sentence 'in the middle of a word, the surrogate data cannot be searched for with the partial character string "document". That is, even though there is a specified search word in the text data, it will not be hit, resulting in omission of search. However, if the division method is performed by using breaks of sentences, that is, paragraphs, chapters, sections, etc., it is possible to avoid division into different surrogate data in the middle of a word. That is, when a single search word is specified, the problem of dividing the text data into a plurality of surrogate data can be solved. However, since the AND condition of the above equation (2) is satisfied regardless of how far apart the search terms are in the document,
The problem cannot be solved by simply adjusting the dividing points.

【００１８】この問題を解決するために、分割されたサ
ロゲートデータについてはＡＮＤ条件をＯＲ条件に置き
直して、指定の検索語のいずれかを含むサロゲートデー
タをまず抽出し、得られた候補文書に従って該当テキス
トデータをスキャンしすべての検索語が含まれる文書を
探し出す方法を取る。以下は、その探索の例である。ま
ず、サロゲートデータで候補文書を抽出する。文字成分
表では、（‘文’，‘書’，‘Ｄ’，‘Ｂ’）の４文字
をすべて含む、あるいは（‘検’，‘索’，‘シ’，
‘ス’，‘テ’，‘ム’）の６文字をすべて含む文書を
候補文書としてあげる。例えば、文字成分の２，４，
６，７が上記条件を満たしたとすると、次の凝縮本文探
索では、２，４，６，７の凝縮本文をスキャンして、
（“文書”，“ＤＢ”）の両方の部分文字列を含む文書
か、あるいは（“検索”，“システム”）の両方の部分
文字列を含む文書を候補として抽出する。こうして、指
定された検索語間のＯＲ条件サーチによりサロゲートデ
ータの２，４，７が候補文書として得られたとすると、
前例と同様に、図２に示すサロゲートデータ−テキスト
データ対応表を参照し、サロゲートデータで抽出した候
補文書に対応するテキストデータを選び出し、最終的に
テキストデータ３，４，５，６，９，１０をサーチし
て、今度は指定検索語間のＡＮＤ条件で（“文書Ｄ
Ｂ”，“検索システム”）が両方とも含まれる文書を探
し出し、その文書番号を検索の最終結果として出力す
る。In order to solve this problem, the AND condition is replaced with the OR condition for the divided surrogate data, and the surrogate data including any of the specified search terms is first extracted, and the obtained candidate document is used. Scan the corresponding text data and search for documents that include all search terms. The following is an example of that search. First, candidate documents are extracted from the surrogate data. In the character composition table, all four characters of ('sentence', 'writing', 'D', 'B') are included, or ('check', 'search', 'shi'
Documents containing all six characters ('s', 'te', 'mu') are listed as candidate documents. For example, character components 2, 4,
6 and 7 satisfy the above condition, in the next condensed text search, the condensed texts of 2, 4, 6, and 7 are scanned,
A document including both partial character strings of (“document” and “DB”) or a document including both partial character strings of (“search” and “system”) is extracted as a candidate. In this way, if the surrogate data 2, 4, and 7 are obtained as candidate documents by the OR condition search between the specified search terms,
Similar to the previous example, referring to the surrogate data-text data correspondence table shown in FIG. 2, the text data corresponding to the candidate document extracted by the surrogate data is selected, and finally the text data 3, 4, 5, 6, 9, 10 is searched, and this time under the AND condition between the specified search terms (“Document D
B ”,“ search system ”) are both included in the document, and the document number is output as the final search result.

【００１９】このように、文書をデータベースに登録す
る際にその容量を調べ、所定の容量に満たないうちは、
複数の文書をまとめて一つのサロゲートデータを作成す
る。また、文書の容量が大きすぎるときには、所定の容
量に収まる複数個の文書に分割してサロゲートデータを
作成することにより、元の文書容量に比べて小容量のサ
ロゲートデータを作成することができ、かつサロゲート
データによるプリサーチにおいて高い絞り込み率が得ら
れることになる。また、元のテキストデータとサロゲー
トデータが一対一に対応しないという問題に対しては、
文書の登録時に両者の対応表を作成し、検索時に本表を
参照してサロゲートデータの検索で得た候補文書から該
当するテキストデータを選び出して探索することによ
り、正しい結果が得られるようにする。データ分割の際
には、単語の途中で分割されることのないように、文章
などの区切れ目を目安として分割する。こうすること
で、単一の検索タームによる検索の場合には、サロゲー
トデータだけで、最終検索結果を出力することができる
ようにする。また、サロゲートデータが分割されている
文書を正しく検索するために、ＡＮＤ条件が指定された
場合には、サロゲートデータの探索を検索ターム間のＯ
Ｒ条件で行い絞り込んでおいてから、テキストデータを
ＡＮＤ条件で検索する。ＯＲ条件が指定された場合に
は、そのままの条件でサロゲートデータを検索する。こ
うすることにより、分割されたサロゲートデータを用い
ても、検索漏れのない再現性の高い検索を実現すること
ができる。As described above, when the document is registered in the database, its capacity is checked, and if the capacity is not reached,
Create multiple surrogate data by combining multiple documents. When the document size is too large, the surrogate data can be created smaller than the original document size by creating the surrogate data by dividing the document into a plurality of documents that fit in a predetermined size. In addition, a high narrowing rate can be obtained in the pre-search by using the surrogate data. Also, for the problem that the original text data and surrogate data do not correspond one-to-one,
Create a correspondence table for both when registering a document, and refer to this table when searching to select and search the relevant text data from the candidate documents obtained by searching surrogate data so that the correct result can be obtained. .. At the time of data division, division is performed by using a break such as a sentence as a guide so that it is not divided in the middle of a word. By doing so, in the case of a search by a single search term, it is possible to output the final search result with only the surrogate data. In addition, when an AND condition is specified in order to correctly search a document in which the surrogate data is divided, the search for surrogate data is set to O between search terms.
The text data is searched for under the AND condition after narrowing down by performing the R condition. When the OR condition is specified, the surrogate data is searched under the condition as it is. By doing so, even if the divided surrogate data is used, it is possible to realize highly reproducible search without omission of search.

【００２０】以上の各処理ステップによる方法を用い
て、文書のテキストデータを複数個まとめて、あるいは
一つの文書を複数個に分割して適当な容量のデータとし
た後に、サロゲートデータを作成することにより、圧縮
率が高くかつ絞り込み率の高いサロゲートデータが作成
できる。その結果、等価的に高速なフルテキストサーチ
を実現することができる。また、テキストデータとサロ
ゲートデータの対応表を文書の登録時に作成することに
より、テキストデータを統合、あるいは分割して作成し
たサロゲートデータにより候補文書を抽出しても、それ
がどのテキストデータに対応するかを識別することがで
きる。そのため、複数の文書を一件にまとめたサロゲー
トデータでも、対応する元のテキストデータを参照する
ことができ、正確な検索を実現することが可能となる。By using the method according to each of the above processing steps, a plurality of text data of a document are put together or one document is divided into a plurality of pieces to obtain data having an appropriate capacity, and then surrogate data is created. Thus, surrogate data having a high compression rate and a high narrowing rate can be created. As a result, equivalently high-speed full-text search can be realized. In addition, by creating a correspondence table of text data and surrogate data at the time of document registration, even if candidate documents are extracted by surrogate data created by integrating or dividing text data, it corresponds to any text data. Can be identified. Therefore, even in the case of surrogate data in which a plurality of documents are collected in one case, the corresponding original text data can be referred to, and an accurate search can be realized.

【００２１】以上の説明はあらかじめ定めたデータ量の
しきい値Ｑｔによる文書テキストデータの統合及び分割
についてであるが、しきい値Ｑｔによらなくとも、例え
ば文書件数によりデータを統合していく方法も考えられ
る。これは、一件当りのデータ量が揃っているが、非常
に小さなデータ量で一件の文書が成り立っている場合
に、簡単な統合処理の方法を与えるものである。例え
ば、文章１文を一件の文書として扱うような場合がこれ
に当たる。また、しきい値としてのデータ量Ｑｔをデー
タ統合の指標としてだけ用い、しきい値Ｑｔ以上の容量
の文書は一件のサロゲートファイルに作成する方法も考
えられる。この場合は、小さなデータ量の文書だけをま
とめてサロゲートデータにするので、非常に小さなデー
タ量の文書の集合の中にたまに大きなデータ量の文書が
入っているような文書データベースに有効である。この
ように、長大な文書データがない場合は、文書データを
分割せずに文書データの統合だけでサロゲートデータを
作成することで、検索語間のＡＮＤ条件が与えられて
も、ＯＲ条件に置き直してサロゲートデータを検索する
必要がなくなる。そのため、サロゲートデータのサーチ
による高い絞り込み率が得られ、テキストデータの探索
量が減ってシステムの検索速度が向上するという利点が
ある。The above description is about the integration and division of the document text data by the threshold value Qt of the predetermined data amount, but the method of integrating the data by the number of documents without depending on the threshold value Qt. Can also be considered. This provides a simple integration processing method when the data amount per case is uniform but one document consists of a very small data amount. For example, this corresponds to the case where one sentence is treated as one document. It is also possible to use the data amount Qt as a threshold value only as an index for data integration and create a document with a capacity equal to or greater than the threshold value Qt in one surrogate file. In this case, only documents with a small amount of data are combined into surrogate data, which is effective for a document database in which a document with a large amount of data is occasionally included in a set of documents with a very small amount of data. In this way, when there is no long document data, by creating the surrogate data only by unifying the document data without dividing the document data, even if the AND condition between the search terms is given, it is placed in the OR condition. There is no need to repair and search surrogate data. Therefore, there is an advantage that a high narrowing rate can be obtained by searching the surrogate data, the search amount of the text data is reduced, and the search speed of the system is improved.

【００２２】また、データ量のしきい値Ｑｔは一般の文
書として適当な長さを設定する。すなわち、一つの論旨
に従ってそれを説明するのに十分な量がしきい値Ｑｔと
いえる。最も簡潔な文書の例として新聞記事、長めの文
書の例としては論文があげられる。これらの文書は、一
つの論旨に従って内容を説明した文書の典型であり、新
聞記事のデータ量が１ＫＢ（５００文字）、長めの論文
のデータ量が３０ＫＢ（１万５千文字）程度なので、し
きい値Ｑｔは１ＫＢから３０ＫＢの間の量とするのが適
切である。Further, the data amount threshold value Qt is set to an appropriate length as a general document. That is, it can be said that the threshold value Qt is an amount sufficient to explain it according to one argument. Newspaper articles are examples of the simplest documents, and papers are examples of longer documents. These documents are typical documents that explain the content according to one point of view. The data volume of newspaper articles is about 1 KB (500 characters), and the data volume of longer articles is about 30 KB (15,000 characters). The threshold value Qt is suitably an amount between 1 KB and 30 KB.

【００２３】さらにまた、従来例で述べたように、テキ
ストデータの格納媒体からデータを読み込む場合は、一
定のバッファサイズに従って媒体からまとめてデータを
読み込んでいる。つまり、バッファサイズに満たないテ
キストデータならば、連続して読み込めば媒体から効率
的にデータが読み込めることになる。具体的に説明する
と、バッファサイズが８ＫＢの時、一件が５００バイト
程度のテキストデータであれば８ＫＢを超えない量で、
すなわち１６件程度は１回の読み込み処理でバッファに
収まってしまうことになる。つまり、１６件程度はまと
めて読み込んでも１件分の読み込み時間と変わらないこ
とになる。したがって、テキストデータをまとめるしき
い値Ｑｔとして、この媒体からの読み込みバッファサイ
ズを指定することも有効な方法である。Furthermore, as described in the conventional example, when reading data from a storage medium for text data, the data is collectively read from the medium according to a fixed buffer size. In other words, if the text data is smaller than the buffer size, the data can be efficiently read from the medium by continuously reading the text data. Specifically, when the buffer size is 8 KB, if one case is text data of about 500 bytes, the amount does not exceed 8 KB,
That is, about 16 cases will fit in the buffer in one read process. In other words, reading about 16 items at once is the same as the reading time for one item. Therefore, it is also an effective method to specify the read buffer size from this medium as the threshold value Qt for collecting the text data.

【００２４】以上の説明でのサロゲ−トデ−タを用いた
フルテキストサ−チにおいては、テキストデ−タの参照
の前にサロゲ−トデ−タから候補文書を絞り込み、どの
文書を参照すべきか特定できることから別の効果も得ら
れる。従来例で説明した様に、記憶媒体を複数台接続
し、それらに分割されたデ−タを読出すことで読出しの
効率を改善する技術が知られている。例えば磁気ディス
ク装置の場合には、磁気ヘッドのシ−ク時間と回転待ち
時間というオ−バヘッドを複数台のディスク装置を並列
に動作させることで軽減することができる。すなわち、
読出し命令を多く発行すると、それだけ多くのディスク
が並列動作するため効果が大きい。従って、サロゲ−ト
デ−タから得られた候補文書の読出し命令を一括して行
うことにより、複数台の記憶媒体が並列動作するので、
ディスクからのデ−タ読出しのオ−バヘッドを少なくす
ることができる。すなわち、サロゲ−トデ−タから得ら
れた候補文書の一括読出しを行うことは、一件ずつ読出
し命令を発行する方法に較べて検索スル−プットを向上
させるという大きな利点がある。In the full-text search using the surrogate data in the above description, which document should be referred to by narrowing down candidate documents from the surrogate data before referring to the text data. Since it can be specified, another effect can be obtained. As described in the conventional example, there is known a technique for improving the reading efficiency by connecting a plurality of storage media and reading the divided data. For example, in the case of a magnetic disk device, the seek time and rotation waiting time of the magnetic head can be reduced by operating a plurality of disk devices in parallel. That is,
Issuing a large number of read commands is highly effective because many disks operate in parallel. Therefore, since a plurality of storage media operate in parallel by collectively executing the read command of the candidate document obtained from the surrogate data,
The overhead of reading data from the disk can be reduced. That is, performing the batch reading of the candidate documents obtained from the sallog data has a great advantage of improving the search throughput as compared with the method of issuing the reading command one by one.

【００２５】以下、本発明の実施例について詳細に説明
する。図３は、本実施例の構成を示す図である。本実施
例は、条件式を入力するキーボード３０１、結果を出力
するディスプレイ３０２、データベースを格納する磁気
ディスク３０３、同じくデータベースを格納するメモリ
３０４、各種のプログラムを格納するメモリ３０６、こ
れら各種のプログラムを実行制御するＣＰＵ３０５、デ
ータ容量のしきい値Ｑｔとサロゲートデータ・テキスト
データの対応テーブル及びワークメモリを搭載するメモ
リ３０７からなる。The embodiments of the present invention will be described in detail below. FIG. 3 is a diagram showing the configuration of this embodiment. In this embodiment, a keyboard 301 for inputting a conditional expression, a display 302 for outputting a result, a magnetic disk 303 for storing a database, a memory 304 for storing a database, a memory 306 for storing various programs, and various programs The memory 307 includes a CPU 305 for execution control, a data capacity threshold value Qt and a correspondence table of surrogate data / text data, and a work memory.

【００２６】まず、データの登録処理に従い、サロゲー
トデータの作成手順を説明し、次に検索処理に従ってサ
ロゲートデータによる候補文書の抽出過程を説明する。
最初に、図４を用いて本実施例で使用するテキストデー
タとその格納形態を説明する。本図に示すように、テキ
ストデータは、複数のテキスト本体を集めたデータファ
イルと、データファイル中の文書の始めと長さを示すデ
ィレクトリファイルからなる。また、文書の識別にはｔ
ｉｄで示す文書ＩＤ番号を用いる。本実施例ではこのｔ
ｉｄをテキストＩＤと呼ぶ。ディレクトリファイルのｏ
ｆｆｓｅｔは該当する文書がデータファイルの先頭から
何バイト目にあるかを示し、ｌｅｎｇｔｈはそのデータ
の長さを示す。例えば、テキストＩＤ５の文書は、デー
タファイルの先頭から２９,９８７バイト目から始ま
り、長さが４,３７６バイトの文書であることを示して
いる。First, the procedure for creating surrogate data will be described according to the data registration process, and then the candidate document extraction process using the surrogate data will be described according to the search process.
First, the text data used in this embodiment and the storage form thereof will be described with reference to FIG. As shown in the figure, the text data is composed of a data file in which a plurality of text bodies are collected, and a directory file indicating the start and length of the document in the data file. In addition, t is used for document identification.
The document ID number indicated by id is used. In this embodiment, this t
The id is called a text ID. Directory file o
ffset indicates the number of bytes from the beginning of the data file of the relevant document, and length indicates the length of the data. For example, the document having the text ID 5 indicates that the document starts at the 29,987th byte from the beginning of the data file and has a length of 4,376 bytes.

【００２７】次に、図５及び図６を用いてテキストデー
タとサロゲートデータの対応表について説明する。本図
は、図１及び図２に示したサロゲートデータとテキスト
データの対応表の具体的実現方法を示したものである。
図５に示したｓｉｄは、サロゲートデータを識別するＩ
ＤでテキストＩＤと区別してサロゲートＩＤと呼ぶ。ｃ
ｈａｉｎは０でない場合にこの後に次のサロゲートデー
タが続くことを表わし、その値はチェインテーブルのレ
コード番号を示している。すなわち、一つのテキストデ
ータが複数のサロゲートデータに分割された場合、それ
らのサロゲートＩＤを格納するものである。本サロゲー
トＩＤテーブルは、テキストＩＤからサロゲートＩＤへ
の変換表である。サロゲートＩＤテーブルのレコード番
号は、テキストＩＤ（ｔｉｄ）に対応する。例えばテキ
ストＩＤ２に対応するサロゲートＩＤを得るには、サロ
ゲートＩＤテーブルの二番目のレコードを参照すればよ
く、図５の例ではｓｉｄが１となっている。また、一つ
のテキストを分割して複数のサロゲートデータを作成し
た場合、１個のテキストＩＤに複数個のサロゲートＩＤ
が対応する。このために、サロゲートＩＤテーブルには
ｃｈａｉｎという項目を設けている。ｃｈａｉｎが０の
ときはもうつながるデータがない、すなわち分割された
サロゲートデータがないことを示し、０以外のときはサ
ロゲートＩＤチェインテーブルの該当レコードを参照す
ることで、つながっているサロゲートＩＤを得ることが
できる。例えばｔｉｄが６の場合は、まずサロゲートＩ
Ｄテーブルによりｓｉｄとして３が得られ、ｃｈａｉｎ
が１であることから、チェインテーブルの第１レコード
を参照してｓｉｄとして４を得る。次に、そのレコード
のｃｈａｉｎが２であることから、サロゲートＩＤチェ
インテーブルの第２レコードを参照してｓｉｄとして５
が得られる。このレコードのｃｈａｉｎは０であること
から、連鎖するほかのデータがないことがわかり、結局
ｔｉｄが６のテキストデータに対応するｓｉｄは３，
４，５の３個であることがわかる。Next, a correspondence table between text data and surrogate data will be described with reference to FIGS. 5 and 6. This figure shows a specific method of realizing the correspondence table of surrogate data and text data shown in FIGS. 1 and 2.
The sid shown in FIG. 5 is I for identifying surrogate data.
D is called a surrogate ID to distinguish it from a text ID. c
When the chain is not 0, it indicates that the next surrogate data follows, and the value indicates the record number of the chain table. That is, when one text data is divided into a plurality of surrogate data, those surrogate IDs are stored. This surrogate ID table is a conversion table from a text ID to a surrogate ID. The record number of the surrogate ID table corresponds to the text ID (tid). For example, to obtain the surrogate ID corresponding to the text ID2, the second record in the surrogate ID table may be referred to, and sid is 1 in the example of FIG. In addition, when one text is divided to create multiple surrogate data, one text ID contains multiple surrogate IDs.
Corresponds. For this reason, the surrogate ID table has an item called chain. When chain is 0, it indicates that there is no connected data, that is, there is no divided surrogate data. When it is other than 0, the connected surrogate ID can be obtained by referring to the corresponding record in the surrogate ID chain table. You can For example, if the tid is 6, first the surrogate I
3 is obtained as sid from the D table,
Is 1, the first record of the chain table is referred to, and 4 is obtained as sid. Next, since the chain of that record is 2, the second record of the surrogate ID chain table is referred to and the sid is 5
Is obtained. Since the chain of this record is 0, it can be seen that there is no other data to be chained, and the sid corresponding to the text data with tid of 6 is 3,
It can be seen that there are 3 of 4, 5.

【００２８】図６に示したテキストＩＤテーブルは、サ
ロゲートＩＤからテキストＩＤへの変換表である。テキ
ストＩＤテーブルのレコード番号は、ｓｉｄと対応す
る。サロゲートＩＤテーブルと同様に、テキストデータ
を統合して、複数のテキストデータから一つのサロゲー
トデータを作成した場合、一個のテキストＩＤが複数の
サロゲートＩＤに対応するため、テキストＩＤチェイン
テーブルに他のテキストＩＤを登録する。例えば、ｓｉ
ｄが２のときは、テキストＩＤテーブルの第２レコード
を参照することで、まずｔｉｄとして３が得られ、その
ときのｃｈａｉｎが２であることからテキストＩＤチェ
インテーブルの第２レコードを参照して、ｔｉｄとして
４を得、さらにｃｈａｉｎが３であるから、ｔｉｄとし
て５を得る。これにより、ｓｉｄが２のサロゲートデー
タに対応するテキストＩＤは３と４と５であることがわ
かる。また、テキストＩＤテーブルでは、分割して作成
されたサロゲートデータのために、ｔｏｐとして該当テ
キストデータに対応する先頭サロゲートデータのＩＤを
登録する。これにより、例えばｓｉｄが５に対応するｔ
ｉｄは６であるが、サロゲートデータはｓｉｄとしては
３，４，５と分割して作られていることがわかる。以上
のサロゲートＩＤテーブル、テキストＩＤテーブルは、
ＤＢになにも文書を登録する前の初期状態として、それ
ぞれｓｉｄ，ｔｉｄの値をすべて−１としておく。The text ID table shown in FIG. 6 is a conversion table from surrogate ID to text ID. The record number of the text ID table corresponds to sid. Similar to the surrogate ID table, when text data is integrated and one surrogate data is created from multiple text data, one text ID corresponds to multiple surrogate IDs, so other texts can be added to the text ID chain table. Register the ID. For example, si
When d is 2, by referring to the second record of the text ID table, 3 is first obtained as tid, and since the chain at that time is 2, refer to the second record of the text ID chain table. , Tid is 4, and since chain is 3, 5 is obtained as tid. From this, it can be seen that the text IDs corresponding to the surrogate data whose sid is 2 are 3, 4, and 5. Further, in the text ID table, the ID of the head surrogate data corresponding to the corresponding text data is registered as top for the surrogate data created by division. Thereby, for example, t corresponding to sid of 5
Although the id is 6, it can be seen that the surrogate data is divided into 3, 4, and 5 as the sid. The above surrogate ID table and text ID table are
As an initial state before registering a document in the DB, the values of sid and tid are all set to -1.

【００２９】以下、データ登録時の処理について、図７
のＰＡＤ図を用いて説明する。登録データは図４のよう
な形態で、複数件のデータを一括して登録するものとす
る。まず最初に一件分のデータを読み込み、図６のテキ
ストＩＤテーブルの空きレコードに図４のディレクトリ
ファイルに登録されたｔｉｄを登録する。この時のレコ
ード番号ｓｉｄはこれから登録しようとするサロゲート
データのサロゲートＩＤとなる。次に図５のサロゲート
ＩＤテーブルのｔｉｄ番目のレコードにｓｉｄを登録す
る。そして、テキストのデータ量としきい値Ｑｔとを比
較し、しきい値Ｑｔ以下の場合には、データ量がしきい
値Ｑｔを超えるまで次のデータを統合していく。この
時、サロゲートＩＤテーブルの統合した文書のｔｉｄ番
目のレコードに前文書と同じｓｉｄを登録し、テキスト
ＩＤチェインテーブルに同ｔｉｄを連鎖して登録してい
く。そして、しきい値Ｑｔを超えない範囲で統合したテ
キストデータで一個のサロゲートデータを作成し、該ｓ
ｉｄにて登録する。Below, the processing at the time of data registration is shown in FIG.
The PAD diagram of FIG. The registration data has a form as shown in FIG. 4, and a plurality of pieces of data are collectively registered. First, data for one case is read, and the tid registered in the directory file of FIG. 4 is registered in the empty record of the text ID table of FIG. The record number sid at this time becomes the surrogate ID of the surrogate data to be registered. Next, sid is registered in the tid-th record of the surrogate ID table of FIG. Then, the data amount of the text and the threshold value Qt are compared, and if the data amount is equal to or less than the threshold value Qt, the next data is integrated until the data amount exceeds the threshold value Qt. At this time, the same sid as the previous document is registered in the tid-th record of the integrated document in the surrogate ID table, and the tid is chained and registered in the text ID chain table. Then, one surrogate data is created by the integrated text data within the range not exceeding the threshold value Qt, and the s
Register by id.

【００３０】具体的に図４のテキストデータの登録を例
に説明する。ここでは、しきい値Ｑｔとして、２０,０
００バイトを用いる。まず、ｔｉｄが１のテキストデー
タを読み込みテキストＩＤテーブルの空きレコードであ
る第１レコードにｔｉｄとして１を，ｃｈａｉｎとして
０を，ｔｏｐとして０を登録する。サロゲートＩＤテー
ブルのｔｉｄが１に対応する第１レコードには、現在作
成中のサロゲートＩＤであるｓｉｄとして１と，ｃｈａ
ｉｎとして０を登録する。ｔｉｄが１であるテキストデ
ータの長さ６,２３２バイトがしきい値Ｑｔを超えない
ので、ｔｉｄが２であるテキストデータを続いて読み込
み、テキストＩＤテーブルにｔｉｄとして２を登録す
る。この時、テキストＩＤテーブルに既に登録してある
ｔｉｄの１に連鎖するように登録する。すなわち、テキ
ストＩＤテーブルの現在作成中のサロゲートＩＤである
ｓｉｄが１に対応する第１レコードにｔｉｄとして１が
既に登録されているので、該第１レコードのｃｈａｉｎ
の項をテキストＩＤチェインテーブルの空きレコードで
ある１に変更する。そして、テキストＩＤチェインテー
ブルの第１レコードにはｔｉｄとして２を，ｃｈａｉｎ
として０を登録する。サロゲートＩＤテーブルには、ｔ
ｉｄの２に対応する第２レコードにｓｉｄとして１を，
ｃｈａｉｎとして０を登録する。いままで読み込んだｔ
ｉｄの１及び２のテキストデータの総容量は１９,７９
２バイトとなり、ｔｉｄが３のテキストデータ（長さ
３,４７０）を読み込むと総容量がしきい値Ｑｔを超え
てしまうので、ここでデータの統合処理を終え、ｔｉｄ
が１と２の統合されたテキストデータでサロゲートデー
タを作成する。これでｔｉｄが１と２の文書を統合した
サロゲートデータができることになる。The registration of the text data shown in FIG. 4 will be specifically described as an example. Here, as the threshold value Qt, 20,0
00 bytes are used. First, the text data whose tid is 1 is read, and 1 is registered as tid, 0 is registered as chain, and 0 is registered as top in the first record, which is an empty record in the text ID table. In the first record corresponding to the tid of 1 in the surrogate ID table, as the sid which is the surrogate ID currently being created, 1 and cha
Register 0 as in. Since the length 6,232 bytes of the text data having the tid of 1 does not exceed the threshold value Qt, the text data having the tid of 2 is continuously read and 2 is registered as the tid in the text ID table. At this time, the text ID table is registered so as to be linked to 1 of tid already registered. That is, since 1 is already registered as the tid in the first record corresponding to the sid that is the surrogate ID currently being created in the text ID table, the chain of the first record is already registered.
Item is changed to 1 which is an empty record in the text ID chain table. Then, 2 is set as tid in the first record of the text ID chain table, and chain
0 is registered as. The surrogate ID table contains t
1 as sid in the second record corresponding to id 2
Register 0 as chain. T read so far
The total capacity of the text data of id 1 and 2 is 19,79
Since the total capacity exceeds the threshold value Qt when the text data of 2 bytes and the tid of 3 (length 3,470) is read, the integration processing of the data ends here and the tid
Creates surrogate data with the integrated text data of 1 and 2. This makes it possible to create surrogate data in which documents with tid of 1 and 2 are integrated.

【００３１】読み込んだテキストのデータ量がしきい値
Ｑｔ以上のとき、テキストデータを分割して複数個のサ
ロゲートデータを作成する。分割の手順は、しきい値Ｑ
ｔを超えない範囲で、文章の区切れ目、すなわち‘。’
の直後で行う。こうして分割したデータについてサロゲ
ートデータを作成し、サロゲートＩＤをサロゲートＩＤ
テーブルとサロゲートＩＤチェインテーブルに登録して
いく。また、テキストＩＤテーブルの該当するｓｉｄ番
目のレコードにも対象ｔｉｄを登録していく。When the data amount of the read text is equal to or larger than the threshold value Qt, the text data is divided to create a plurality of surrogate data. The procedure of division is the threshold value Q
The text breaks, that is, ', within a range not exceeding t. '
Immediately after. Surrogate data is created for the data thus divided, and the surrogate ID is used as the surrogate ID.
Register in the table and surrogate ID chain table. The target tid is also registered in the corresponding sid-th record in the text ID table.

【００３２】再び図４のテキストデータの例を用いて具
体的に説明する。今、ｔｉｄが５の文書まで登録処理が
終わってｔｉｄが６の文書を登録するものとする。ｔｉ
ｄが６のテキストデータ長は、５７,４３２バイトとし
きい値Ｑｔ（２０,０００）よりも大きいので、ｔｉｄ
が６のテキストデータの分割処理を行う。まず、最初の
Ｑｔ（２０,０００バイト）を超えない文章の区切れめ
でテキストを分割し、サロゲートデータを作成する。ｓ
ｉｄはテキストＩＤテーブルの空きレコードである３を
設定する。従って、テキストＩＤテーブルの第３レコー
ドにｔｉｄとして６を，ｃｈａｉｎとして０を，ｔｏｐ
として３を登録する。ｔｏｐの３はこのレコードに該当
するｓｉｄが３であるサロゲートデータがテキストデー
タを分割して作られたものであることを示し、なおかつ
ｓｉｄが３のサロゲートデータを先頭として連鎖して作
られていることを表わしている。また、サロゲートＩＤ
テーブルのｔｉｄの６に対応する第６レコードにｓｉｄ
として３を，ｃｈａｉｎとして０を登録する。このよう
にして、ｓｉｄが３のサロゲートデータを登録したら、
引き続きしきい値Ｑｔを超えない範囲でｔｉｄが６のテ
キストデータを読み込む。今度は新しいサロゲートＩＤ
すなわちテキストＩＤテーブルの空きレコード４をｓｉ
ｄとして登録処理が続行される。従って、テキストＩＤ
テーブルの第４レコードにｔｉｄとして６を，ｃｈａｉ
ｎとして０を，ｔｏｐとして３を登録し、サロゲートＩ
Ｄテーブルのｔｉｄが６に対応するレコードにｓｉｄと
して４を登録する。この時、サロゲートＩＤテーブルの
第６レコードは先程のｓｉｄが３のデータが登録されて
いるので、ｃｈａｉｎの項をサロゲートＩＤチェインレ
コードの空きレコード１に変更し、サロゲートＩＤチェ
インレコードの第１レコードにはｓｉｄとして４を，ｃ
ｈａｉｎとして０を登録する。次に、ｔｉｄが６の３個
目の分割データを読み込み、ｓｉｄを５としてテキスト
ＩＤテーブルの第５レコードにｔｉｄとして６を，ｃｈ
ａｉｎとして０を，ｔｏｐとして３を登録する。サロゲ
ートＩＤテーブルには、ｔｉｄが６に対応するｓｉｄが
これまで３，４と連鎖しており、サロゲートＩＤチェイ
ンテーブルの第１レコードでｃｈａｉｎが０で連鎖が止
まっているのでこの後にｓｉｄの５をつけ加える処理を
行う。すなわち、第１レコードのｃｈａｉｎをサロゲー
トＩＤチェインテーブルの空きレコードである２と変更
し、第２レコードにｓｉｄとして５を，ｃｈａｉｎとし
て０を登録する。このようにして、ｔｉｄが６のテキス
トデータを３個に分割してそれぞれのサロゲートデータ
に対応付けることができる。A specific explanation will be given again using the example of the text data shown in FIG. Now, it is assumed that the registration process is completed up to the document whose tid is 5, and the document whose tid is 6 is registered. ti
Since the text data length of d = 6 is 57,432 bytes, which is larger than the threshold value Qt (20,000), tid
Performs division processing of the text data of 6. First, surrogate data is created by dividing the text at the breaks of sentences that do not exceed the first Qt (20,000 bytes). s
id is set to 3, which is an empty record in the text ID table. Therefore, in the third record of the text ID table, 6 is set as tid, 0 is set as chain, and top is set.
Register 3 as. The top 3 indicates that the surrogate data corresponding to this record whose sid is 3 is created by dividing the text data, and is formed by chaining the surrogate data whose sid is 3 at the head. It means that. Also, surrogate ID
Sid in the 6th record corresponding to 6 in the table tid
Is registered as 3, and 0 is registered as chain. In this way, when the surrogate data whose sid is 3 is registered,
Subsequently, the text data having a tid of 6 is read within a range not exceeding the threshold value Qt. This time a new surrogate ID
That is, the empty record 4 in the text ID table is si
The registration process is continued as d. Therefore, the text ID
In the fourth record of the table, set 6 as tid to chai
Register 0 as n and 3 as top, and enter surrogate I
Register 4 as the sid in the record corresponding to the tid of 6 in the D table. At this time, since the 6th record of the surrogate ID table has the data with the sid of 3 previously registered, the term of chain is changed to the empty record 1 of the surrogate ID chain record, and is changed to the first record of the surrogate ID chain record. Is 4 as sid, c
Register 0 as the chain. Next, the third divided data having the tid of 6 is read, the sid is set to 5, and the 6th is set as the tid in the fifth record of the text ID table.
Register 0 as ain and 3 as top. In the surrogate ID table, the sid whose tid is 6 has been linked to 3 and 4 so far, and the chain is 0 at the first record of the surrogate ID chain table, and the chain has stopped. Perform additional processing. That is, the chain of the first record is changed to 2 which is an empty record in the surrogate ID chain table, and 5 is registered as sid and 0 is registered as chain in the second record. In this way, the text data having the tid of 6 can be divided into three pieces and associated with the respective surrogate data.

【００３３】このようにして、統合あるいは分割して作
成したサロゲートデータの例を図８及び図９に示す。い
ずれも、図４で示したテキストデータをもとにしてい
る。図８は、文字成分表の例である。例えばサロゲート
ＩＤが７のデータは、テキストＩＤが９及び１０の文書
を統合して文字成分を登録してあるので、両方のテキス
トいずれかで用いられている文字が１として登録されて
いる。具体例で説明すると、ｓｉｄが７の文字成分は、
ｔｉｄの９で用いている‘条’とｔｉｄの１０で用いて
いる‘凝’の両方の文字の成分値が１になっている。ま
た、ｓｉｄが３の文字成分では、ｔｉｄが６の文書が分
割して登録されているので、‘検’の文字成分は１とな
っているが、ｔｉｄが６の文書の中間部分で使用してい
る文字‘デ’の文字成分は０となっている。Examples of surrogate data created by integrating or dividing in this way are shown in FIGS. 8 and 9. Both are based on the text data shown in FIG. FIG. 8 is an example of a character component table. For example, in the data having the surrogate ID 7, since the documents having the text IDs 9 and 10 are integrated and the character components are registered, the character used in either of the both texts is registered as 1. Explaining in a specific example, the character component whose sid is 7 is
The component values of both the characters “article” used in tid 9 and “cure” used in tid 10 are 1. Also, for the character component with sid of 3, the document with tid of 6 is registered separately, so the character component of "inspection" is 1, but it is used in the middle part of the document with tid of 6. The character component of the character "de" is 0.

【００３４】図９は凝縮本文の作成例である。この凝縮
本文は、元のテキストデータと同じように１件が可変長
のデータであるので、ディレクトリファイルにデータフ
ァイルの先頭からのオフセット位置とデータ長を登録し
て管理する。例えば、ｓｉｄが３のサロゲートデータ
は、データファイルの先頭８,５８８バイト目から長さ
４,７６８バイト分格納されていることを示す。このデ
ィレクトリファイルは、サロゲートデータの作成時に同
時に作成される。また、データファイル中の記号‘，’
は部分文字列の区切り記号を表わす。例えば、ｓｉｄが
１の凝縮本文は、ｔｉｄが１及び２のテキストデータを
統合してできているので、ｔｉｄが１の文書に出現する
“対応表”とｔｉｄが２の文書に出現する“複数個”の
部分文字列の両方が登録されている。また、ｓｉｄが３
の凝縮本文には、三分割されたｔｉｄが６のテキストデ
ータの前半部分の、“検索”，“条件”の部分文字列が
登録され、後半部分の“安定”，“検索速度”などはｓ
ｉｄが５のサロゲートデータに登録されている。このよ
うに、文字成分表では、文字成分のビット列のレコード
番号が、凝縮本文ではディレクトリファイル中のレコー
ド番号がそれぞれサロゲートＩＤに対応しているので、
サロゲートＩＤ（ｓｉｄ）によってデータが参照できる
ようになっている。FIG. 9 shows an example of creating a condensed text. Since one item of this condensed text is variable length data like the original text data, the offset position from the beginning of the data file and the data length are registered and managed in the directory file. For example, the surrogate data whose sid is 3 indicates that the length of 4,768 bytes from the beginning 8,588 bytes of the data file is stored. This directory file is created at the same time when the surrogate data is created. Also, the symbols ',' in the data file
Represents the delimiter of the substring. For example, since the condensed text with sid 1 is made by integrating the text data with tid 1 and 2, the "correspondence table" that appears in the document with tid 1 and the "plurality" that appears in the document with tid 2 Both "" substrings are registered. Also, sid is 3
In the condensed text of, the partial character strings of "search" and "condition" in the first half of the text data whose tid is 6 are registered, and the "stable" and "search speed" in the second half are s.
The id is registered in the surrogate data of 5. Thus, in the character component table, the record number of the bit string of the character component corresponds to the surrogate ID in the condensed text, and the record number in the directory file corresponds to the surrogate ID.
Data can be referred to by the surrogate ID (sid).

【００３５】次に、こうして作成したサロゲートデータ
をもとに検索する処理について図１０のＰＡＤ図を用い
て説明する。検索のときには、与えられた条件式から検
索語、検索語中で用いられる文字、及び論理条件を抽出
する。このうち、検索語中で用いられる文字は文字成分
表の探索に、検索語は凝縮本文の探索に用いる。また、
検索語間の論理条件を抽出し、ＡＮＤ条件があればこれ
をＯＲ条件に置き直してサロゲートデータを探索する。
サロゲートデータのうち特に凝縮本文は、テキストデー
タのひらがな文字列をすべて削除してできているため、
検索語としてひらがな以外の文字種のときのみ探索す
る。こうして、文字成分表及び凝縮テキストのサロゲー
トデータを探索することにより、結果として候補となる
サロゲートＩＤが得られる。その結果得られたサロゲー
トＩＤが０件なら、テキストを検索しても結果は必ず０
件なのでここで結果を出力し、検索処理を終える。ま
た、結果のサロゲートＩＤ（ｓｉｄ）の件数が０件でな
い場合には、ｓｉｄを用いてテキストＩＤテーブルを参
照してｔｉｄへ変換する。そして、得られたｔｉｄを用
いて図４のディレクトリファイルを参照し、ここからｏ
ｆｆｓｅｔとｌｅｎｇｔｈを求め、この値からデータフ
ァイルをアクセスしてテキストデータを探索して検索結
果を出力する。テキストデータ探索の必要のないとき
は、ｓｉｄに従ってテキストＩＤテーブルから得られた
テキストＩＤ（ｔｉｄ）を直接出力する。Next, a process of searching based on the surrogate data thus created will be described with reference to the PAD diagram of FIG. At the time of search, a search word, a character used in the search word, and a logical condition are extracted from the given conditional expression. Of these, the characters used in the search words are used to search the character component table, and the search words are used to search the condensed text. Also,
The logical condition between the search words is extracted, and if there is an AND condition, it is replaced with an OR condition to search the surrogate data.
Of the surrogate data, especially the condensed text is made by deleting all the hiragana character strings in the text data.
Search only when the search word is a character type other than hiragana. Thus, by searching the character component table and the surrogate data of the condensed text, a surrogate ID that is a candidate is obtained as a result. If the resulting surrogate ID is 0, the result is always 0 even if you search the text.
Since it is a case, the result is output here, and the search process is ended. When the number of surrogate IDs (sid) as a result is not 0, the text ID table is referenced using sid and converted to tid. Then, using the obtained tid, the directory file in FIG.
ffset and length are obtained, the data file is accessed from this value, the text data is searched, and the search result is output. When there is no need to search for text data, the text ID (tid) obtained from the text ID table according to sid is directly output.

【００３６】以下は、テキストデータを探索しなければ
ならない条件である。（１）検索語にひらがな文字が使用されている場合この場合は、凝縮本文にはひらがな文字列が削除され、
記載されていないため、代わりにテキストデータを検索
する必要があるからである。（２）検索語に複合文字種が使用されている場合この場合は、凝縮本文には文字種ごとに部分文字列で登
録されているため、凝縮本文の探索だけではノイズが載
る可能性があるためである。例えば、“文書データ”と
いう検索語の場合、凝縮本文では“文書”と“データ”
の２個の部分文字列が登録されるので、凝縮本文検索は
“文書”と“データ”の両方の部分文字列を含むデータ
を探すことになるが、得られた候補文書が本当に“文書
データ”という文字列を含んでいたかはわからないため
である。それを確認するために、テキストデータを検索
する必要がある。（３）ＡＮＤ条件がある場合この場合は、一つの文書から複数のサロゲートデータに
分割されていることがあるため、サロゲートデータをＯ
Ｒ条件で探索しなければならないが、この条件のみでは
ノイズが載ってくる可能性があるため、ＡＮＤ条件でテ
キストデータを探索する必要がある。具体的には、“文
書”と“データ”の両方の検索語を含む文書を探す場
合、サロゲートデータが分割されている場合があるの
で、サロゲートデータは“文書”または“データ”のい
ずれかの検索語を含む文書を探し、次に“文書”と“デ
ータ”の両方を含む文書をテキストデータを検索して確
認する必要があるためである。（４）ｓｉｄからｔｉｄを得るときｔｉｄがチェインし
ている場合この場合は、複数のテキストを統合してサロゲートデー
タが作成されているため、どのテキストが条件式に本当
に合致したか判別できないので、元のテキストデータを
直接検索して判別する必要があるためである。The following are the conditions under which the text data must be searched. (1) When Hiragana characters are used in the search term In this case, the Hiragana character string is deleted from the condensed text,
Because it is not described, it is necessary to search the text data instead. (2) When a compound character type is used in the search term In this case, noise may appear only by searching the condensed text because the condensed text is registered as a partial character string for each character type. is there. For example, in the case of the search term "document data", "document" and "data" in the condensed text
Since the two partial character strings of are registered, the condensed text search will search for data that contains both partial strings of "document" and "data", but the obtained candidate document is really "document data". This is because it is not known whether the character string "" was included. To confirm that, we need to search the text data. (3) When there is an AND condition In this case, one document may be divided into a plurality of surrogate data.
Although it is necessary to search under the R condition, noise may appear under this condition alone, so it is necessary to search the text data under the AND condition. Specifically, when searching for a document that includes both "document" and "data", the surrogate data may be divided, so the surrogate data may be either "document" or "data". This is because it is necessary to search for a document that includes the search word, and then search the text data for a document that includes both "document" and "data" to confirm the document. (4) When tid is chained when obtaining tid from sid In this case, it is not possible to determine which text really matches the conditional expression because the surrogate data is created by integrating multiple texts. This is because it is necessary to directly search the original text data for determination.

【００３７】図１１及び図１２は、この検索アルゴリズ
ムで探索した際の手順例を示したものである。図１１は、条件式ｓｅａｒｃｈ：データで検索したときの例を示している。まず、最初のサロゲ
ートデータの探索では、図８の文字成分表を（‘デ’，
‘ー’，‘タ’）の３個の文字を含むデータということ
で検索し、図９の凝縮本文を（“データ”）の文字列を
含むテキストということで探索する。この２段階のサロ
ゲートデータの探索により、例えば、文字成分表により
ｓｉｄとして２と４が、凝縮本文においてもｓｉｄとし
て２と４が得られたとする。これを図６に示すテキスト
ＩＤテーブルを参照してｔｉｄを得ると、ｓｉｄとして
２から３，４，５のテキストＩＤが、ｓｉｄとして４か
ら６のテキストＩＤが得られる。ｓｉｄが２のサロゲー
トデータがチェインし複数のテキストデータが統合され
て一つのサロゲートデータが作られている場合なので、
上記（４）の条件から、テキストデータを探索し、結果
として３，４，５，６のｔｉｄを最終的に得る。FIG. 11 and FIG. 12 show an example of a procedure for searching with this search algorithm. FIG. 11 shows an example when a search is performed using the conditional expression search: data. First, in the first search for surrogate data, the character component table of FIG.
The data including three characters "-" and "Ta") is searched for, and the condensed text in FIG. 9 is searched for by the text including the character string "(data)". It is assumed that, by searching the surrogate data in two steps, sids 2 and 4 are obtained from the character component table, and sids 2 and 4 are obtained in the condensed text. When the tid is obtained by referring to the text ID table shown in FIG. 6, the text IDs of 2 to 3, 4, and 5 as the sid and the text IDs of 4 to 6 as the sid are obtained. Since the surrogate data whose sid is 2 is chained and multiple text data are integrated to create one surrogate data,
From the above condition (4), text data is searched, and as a result, tids of 3, 4, 5, and 6 are finally obtained.

【００３８】図１２は別の条件式による探索の例であ
る。本図では、ｓｅａｒｃｈ：ＡＮＤ（文書，データ）の条件式のもとに文書中に（“文書”，“データ”）の
両方の検索語が現れる文書を探索する例を示している。
この場合、検索語間のＡＮＤ条件があるため、サロゲー
トデータから（“文書”，“データ”）のいずれかのタ
ームが含まれている文書を探索する。まず、文字成分表
の探索では、（‘文’，‘書’）の２文字を含むか、あ
るいは（‘デ’，‘ー’，‘タ’）の３文字を含む文書
を抽出する。図８の文字成分表だと、ｓｉｄとして１，
２，４，６，７がこれに該当する。次に、凝縮本文を
（“文書”，“データ”）のいずれかの文字列を含むと
いう条件で検索する。図９の例では、ｓｉｄとして１，
２，４，６，７がこれに該当する。この条件式の例で
は、上記（３）及び（４）の条件から、テキストデータ
を探索し、実際に（‘文書’，‘データ’）の両方を含
む文書を探しだす。この時は、サロゲートデータの探索
で得られた候補のサロゲートＩＤを図６に示すｔｉｄテ
ーブルによりｔｉｄとして１，２，３，４，５，６，
７，８，９，１０へ変換して、該当するｔｉｄのみを対
象にして検索する。こうして、図４のテキストデータの
例では、ｔｉｄが３，４の文書が最終結果として出力さ
れることになる。FIG. 12 shows an example of search by another conditional expression. This figure shows an example of searching for a document in which both search words (“document”, “data”) appear in the document based on the conditional expression of search: AND (document, data).
In this case, since there is an AND condition between the search words, the surrogate data is searched for a document including any of the terms (“document”, “data”). First, in the search of the character component table, a document including two characters ('sentence', 'writing') or three characters ('de', '-', 'ta') is extracted. In the character component table of FIG. 8, the sid is 1,
2, 4, 6, and 7 correspond to this. Next, the condensed text is searched on the condition that it contains any of the character strings ("document", "data"). In the example of FIG. 9, sid is 1,
2, 4, 6, and 7 correspond to this. In the example of this conditional expression, text data is searched for from the conditions (3) and (4) above, and a document including both ('document'and'data') is actually found. At this time, the surrogate IDs of the candidates obtained by searching the surrogate data are set as tid by the tid table shown in FIG. 6 as 1, 2, 3, 4, 5, 6,
It is converted into 7, 8, 9, and 10, and only the corresponding tid is searched for. Thus, in the example of the text data of FIG. 4, the documents with tid of 3 and 4 are output as the final result.

【００３９】以上、データの登録及び検索処理について
説明した。これより、このようにして作成されたデータ
の削除及び更新の手順について説明する。この処理は、
データ更新プログラムを実行することによって行われ
る。図１３は、データ削除時の処理の手順を示すＰＡＤ
図である。まず、削除する文書のテキストＩＤ（ｔｉ
ｄ）を入力する。このテキストＩＤとしては登録時にデ
ィレクトリファイルに設定したｔｉｄを指定する。この
後、サロゲートＩＤテーブルを参照し、指定されたテキ
ストＩＤ（ｔｉｄ）に対応するサロゲートＩＤ（ｓｉ
ｄ）を抽出する。この時、サロゲートＩＤがチェインし
ているかをチェックし、チェインがある場合すなわちテ
キストデータ量がしきい値Ｑｔより大きくサロゲートデ
ータが分割されている場合には、すべての該当するサロ
ゲートＩＤのサロゲートデータを削除する。そして、サ
ロゲートＩＤテーブル及びテキストＩＤテーブルの該当
するレコードを−１に変更して、削除した分のチェイン
テーブルを詰めるようにチェインテーブル再編成処理を
した後、テキストデータを削除して処理を終わる。The data registration and retrieval processing has been described above. The procedure for deleting and updating the data thus created will be described below. This process
This is done by executing the data update program. FIG. 13 is a PAD showing a procedure of processing when deleting data.
It is a figure. First, the text ID (ti
Enter d). As this text ID, the tid set in the directory file at the time of registration is specified. After that, referring to the surrogate ID table, the surrogate ID (si) corresponding to the specified text ID (tid)
Extract d). At this time, it is checked whether or not the surrogate ID is chained, and if there is a chain, that is, if the text data amount is larger than the threshold value Qt and the surrogate data is divided, the surrogate data of all the corresponding surrogate IDs are delete. Then, the corresponding records in the surrogate ID table and the text ID table are changed to -1, and the chain table reorganization processing is performed so as to fill the deleted chain table, and then the text data is deleted and the processing ends.

【００４０】再び図４，５，６を例に、具体的に説明す
る。ｔｉｄが６の文書を削除するときは、サロゲートＩ
Ｄテーブルを参照し、ｔｉｄが６に対応するテキストＩ
Ｄを抽出する。図５の例では、ｔｉｄが６に対応するサ
ロゲートＩＤはチェインしており、ｓｉｄの３，４，５
がこれに対応する。従って、ｓｉｄが３，４，５のすべ
てのサロゲートデータを削除する。文字成分表のデータ
削除は、該当するｓｉｄの文字成分をすべて０とするこ
とで行う。凝縮本文のデータ削除は、ディレクトリファ
イルの該当サロゲートＩＤのｌｅｎｇｔｈを０とするこ
とで行うことができる。その後、ｔｉｄが６に対応する
サロゲートＩＤテーブルの第６レコードをｓｉｄとして
−１を，ｃｈａｉｎとして０に変更し、チェインしてい
るサロゲートＩＤチェインテーブルの第１レコード及び
第２レコードをｓｉｄとして−１を，ｃｈａｉｎとして
０に変更する。テキストＩＤテーブルの削除も同様に、
ｓｉｄが３，４，５に対応する第３，４，５レコードを
ｔｉｄとして−１を，ｃｈａｉｎとして０を，ｔｏｐと
して０へ変更する。最後にテキストデータの削除である
が、これも凝縮本文の削除と同様にディレクトリファイ
ルの該当する第６レコードをｌｅｎｇｔｈとして０とす
ることで行うことができる。A specific description will be given again with reference to FIGS. To delete a document with a tid of 6, surrogate I
Text I that refers to the D table and has a tid of 6
Extract D. In the example of FIG. 5, the surrogate ID corresponding to tid of 6 is chained, and the sid of 3, 4, and 5 is chained.
Corresponds to this. Therefore, all surrogate data whose sid is 3, 4, 5 are deleted. The data in the character component table is deleted by setting all the character components of the corresponding sid to 0. The data of the condensed text can be deleted by setting the length of the corresponding surrogate ID of the directory file to 0. After that, the sixth record of the surrogate ID table whose tid is 6 is changed to -1 as sid and 0 as chain, and the first and second records of the chained surrogate ID chain table are set to -1 as sid. Is changed to 0 as chain. Similarly, delete the text ID table.
The third, fourth and fifth records corresponding to sid of 3,4,5 are changed to -1 as tid, 0 as chain, and 0 as top. Finally, the deletion of text data can be performed by setting the corresponding sixth record of the directory file to 0 as the length, like the deletion of the condensed text.

【００４１】また、サロゲートデータがチェインしてい
ない場合、得られたｓｉｄでテキストＩＤテーブルを参
照し、該サロゲートデータに対応するテキストＩＤを取
得する。この処理はサロゲートデータが統合されてでき
たものか、単一のテキストデータから作られたものかを
判別するために行う。１個のｓｉｄに対し、複数個のｔ
ｉｄがある場合、すなわち、複数のテキストデータを統
合してサロゲートデータが作られている場合には、削除
するｔｉｄ以外のデータでサロゲートデータを再作成
し、登録する。すなわち、統合されたサロゲートデータ
の中から、削除するテキストデータの部分から作成され
たデータを削除する。そして、テキストＩＤテーブルの
該当レコードを−１に変更して実際のテキストデータを
削除して処理を終わる。図４，５，６の例でｔｉｄが２
の文書を削除する場合には、まずサロゲートＩＤテーブ
ルを参照し、ｓｉｄとして１を得る。次にテキストＩＤ
テーブルを参照し、ｔｉｄが１から２とチェインしてい
ることがわかる。従って、ｔｉｄが２以外のテキストデ
ータすなわち、ｔｉｄが１のテキストデータのみでサロ
ゲートデータを作成し直してｓｉｄが１として再登録す
る。そして、テキストＩＤテーブルの第２レコードをｓ
ｉｄとして−１を，ｃｈａｉｎとして０と変更し、最後
に該当するテキストデータを削除する。サロゲートデー
タとテキストデータが１対１に対応する場合、すなわち
１個のｓｉｄに対し、１個のｔｉｄが対応する場合、つ
まり、分割も統合もされなかったサロゲートデータは、
サロゲートＩＤテーブル及びテキストＩＤテーブルの各
ｓｉｄ，ｔｉｄを−１に初期化して、サロゲートデータ
及びテキストデータを削除して処理を終わる。When the surrogate data is not chained, the obtained sid is used to refer to the text ID table to obtain the text ID corresponding to the surrogate data. This process is performed to determine whether the surrogate data is integrated or made from a single text data. Multiple t for one sid
When there is an id, that is, when the surrogate data is created by integrating a plurality of text data, the surrogate data is recreated and registered with data other than the tid to be deleted. That is, the data created from the portion of the text data to be deleted is deleted from the integrated surrogate data. Then, the corresponding record in the text ID table is changed to -1, the actual text data is deleted, and the processing is ended. In the example of FIGS. 4, 5 and 6, the tid is 2.
When deleting the document, first, the surrogate ID table is referred to, and 1 is obtained as sid. Next text ID
By referring to the table, it can be seen that tid is chained from 1 to 2. Therefore, surrogate data is recreated only with text data having a tid other than 2, that is, text data having a tid of 1, and re-registered with sid being 1. Then, the second record of the text ID table is s
The id is changed to -1 and the chain is changed to 0, and the corresponding text data is deleted at the end. When the surrogate data and the text data have a one-to-one correspondence, that is, when one tid corresponds to one sid, that is, the surrogate data that has not been divided or integrated is:
The sid and tid of the surrogate ID table and the text ID table are initialized to -1, the surrogate data and the text data are deleted, and the processing is ended.

【００４２】図１４は、データを更新するときの処理の
手順を示すＰＡＤ図である。更新データも図４に示すテ
キストと同じ形式で与えられる。まず、削除するときと
同様に、更新するテキストのテキストＩＤ（ｔｉｄ）を
取り込み、サロゲートＩＤテーブルを参照して該当する
サロゲートＩＤ（ｓｉｄ）を抽出する。そして、１個の
ｔｉｄに対して複数個のｓｉｄがある、すなわち、デー
タ分割により作成されたサロゲートデータの場合には、
分割作成されているサロゲートデータを更新処理する。
すなわち、更新するテキストデータのデータ量をしきい
値Ｑｔと比較し、大きい場合には既にサロゲートＩＤテ
ーブルに登録されているｓｉｄでテキストデータをしき
い値Ｑｔ以内に分割しながらサロゲートデータを登録し
ていく。このとき、サロゲートＩＤチェインテーブルに
登録してあるｓｉｄのチェインよりもサロゲートデータ
の分割個数が増える場合には、新たにｓｉｄを付与しな
がら登録していく。逆に分割数が少なくなった場合に
は、余ったｓｉｄを−１へ変更し、チェインテーブルを
再編成する。また、更新しようとするテキストデータが
Ｑｔを越えない場合は、チェインしているサロゲートデ
ータを削除し、チェインしているｓｉｄを−１へ変更、
チェインテーブルを再編成して、サロゲートＩＤテーブ
ルに登録されているチェインの最初のｓｉｄにてサロゲ
ートデータを更新する。FIG. 14 is a PAD diagram showing a procedure of processing when data is updated. The update data is also given in the same format as the text shown in FIG. First, as in the case of deleting, the text ID (tid) of the text to be updated is fetched, and the corresponding surrogate ID (sid) is extracted by referring to the surrogate ID table. If there are a plurality of sids for one tid, that is, in the case of surrogate data created by data division,
Updates the surrogate data that is created separately.
That is, the data amount of the text data to be updated is compared with the threshold value Qt, and if it is larger, the surrogate data is registered while dividing the text data within the threshold value Qt with the sid already registered in the surrogate ID table. To go. At this time, if the number of divided surrogate data is larger than the number of sid chains registered in the surrogate ID chain table, new sid is added and registered. On the contrary, when the number of divisions becomes small, the surplus sid is changed to -1, and the chain table is reorganized. If the text data to be updated does not exceed Qt, delete the chained surrogate data and change the chained sid to -1.
The chain table is reorganized to update the surrogate data with the first sid of the chain registered in the surrogate ID table.

【００４３】再び図４，５，６を用いて具体的に説明す
る。ｔｉｄが６の文書を更新するときは、サロゲートＩ
Ｄテーブルからｓｉｄの３，４，５がｔｉｄの６に対応
するので、データ分割で作成されたサロゲートデータで
あることがわかる。そこで、新しく更新するｔｉｄが６
のテキストデータの容量をしきい値Ｑｔと比較し、大き
い場合には、ｓｉｄが３，４，．．．とテキストデータ
を分割しながらサロゲートデータを作成して、元のデー
タと置き換えていく。この時テキストデータを３個以上
分割しなければならないほど新しいテキストデータの量
が大きい場合は、新しいｓｉｄ（例えば１１）を付与し
てサロゲートデータを登録することになる。新しいｓｉ
ｄを付与する場合は、テキストＩＤテーブルの空きレコ
ード（例えば第１１レコード）を新ｓｉｄとし、そのレ
コードにｔｉｄ（この場合６）を登録する。また、サロ
ゲートＩＤチェインテーブルに新ｓｉｄを登録する。こ
の場合、サロゲートＩＤチェインテーブルのｓｉｄの５
が登録されている第２レコード（レコード番号２）のｃ
ｈａｉｎを空きレコードの３に変更し、第３レコードに
ｓｉｄとして１１を，ｃｈａｉｎとして０を登録すれば
良い。このように、テキストデータの量が大きく、分割
数が増える場合は、新たなｓｉｄを付与、サロゲートＩ
Ｄチェインテーブルを変更しながらサロゲートデータを
登録していく。A detailed description will be given again with reference to FIGS. When updating a document whose tid is 6, surrogate I
From the D table, since sid 3, 4, and 5 correspond to tid 6, it can be seen that they are surrogate data created by data division. Therefore, the newly updated tid is 6
Of the text data is compared with the threshold value Qt, and if the capacity is large, sid is 3, 4 ,. ．． Create surrogate data while dividing the text data and replace it with the original data. At this time, if the amount of new text data is large enough to divide the text data into three or more, a new sid (for example, 11) is added and the surrogate data is registered. New si
When adding d, an empty record (for example, the 11th record) in the text ID table is set as a new sid, and tid (6 in this case) is registered in the record. Also, the new sid is registered in the surrogate ID chain table. In this case, the sid of the surrogate ID chain table is 5
Of the second record (record number 2) in which is registered
It is sufficient to change the chain to the empty record 3, and register 11 as the sid and 0 as the chain in the third record. In this way, when the amount of text data is large and the number of divisions is large, a new sid is added and the surrogate I
Register surrogate data while changing the D chain table.

【００４４】逆に分割数が少なくなった場合には、テキ
ストＩＤテーブルの登録しなかったｓｉｄに対応するレ
コードをｔｉｄとして−１を，ｃｈａｉｎとして０を，
ｔｏｐとして０に変更して、サロゲートＩＤチェインテ
ーブルからも、該ｓｉｄを削除する。例えば、ｔｉｄが
６のテキストデータの変更で、ｓｉｄが３，４でサロゲ
ートデータの作成が終了する場合は、テキストＩＤテー
ブルの第５レコードをｔｉｄとして−１を，ｃｈａｉｎ
として０を，ｔｏｐとして０と変更し、サロゲートＩＤ
チェインテーブルの第２レコードをｓｉｄとして−１
を，ｃｈａｉｎとして０と変更する。この時ｓｉｄの４
でチェインが止まるように、サロゲートＩＤチェインテ
ーブルの第１レコードのｃｈａｉｎを０と変更する。更
新するｔｉｄが６のテキストデータ量がしきい値Ｑｔを
超えない場合は、ｓｉｄを３としてサロゲートデータを
作成し、サロゲートＩＤテーブルの第６レコードをｓｉ
ｄとして３を，ｃｈａｉｎとして０と変更するととも
に、サロゲートＩＤチェインテーブルの第１及び第２レ
コードをｓｉｄとして−１を，ｃｈａｉｎとして０と変
更する。同時に、テキストＩＤテーブルの第４及び第５
レコードをｔｉｄとして−１を，ｃｈａｉｎとして０
を，ｔｏｐとして０と変更する。更に、ｓｉｄが４及び
５のサロゲートデータを削除する。On the contrary, when the number of divisions becomes small, the record corresponding to the unregistered sid in the text ID table is set to -1 as tid and 0 as chain.
The top is changed to 0 and the sid is also deleted from the surrogate ID chain table. For example, when the creation of surrogate data is completed with sid of 3 and 4 by changing the text data with tid of 6, the fifth record of the text ID table is set to tid, and -1 is set to chain.
Changed to 0 and top changed to 0, and surrogate ID
-1 as the second record in the chain table
Is changed to 0 as chain. At this time sid 4
Change the chain of the first record of the surrogate ID chain table to 0 so that the chain stops at. If the amount of text data whose tid to be updated is 6 does not exceed the threshold value Qt, surrogate data is created with sid set to 3, and the sixth record of the surrogate ID table is set to si.
Change 3 as d, change 0 as chain, change -1 as sid for the first and second records of the surrogate ID chain table, and change 0 as chain. At the same time, the fourth and fifth text ID tables
-1 for the record tid and 0 for the chain
Is changed to 0 as top. Furthermore, the surrogate data whose sid is 4 and 5 are deleted.

【００４５】また、１個のｔｉｄに対し、１個のｓｉｄ
が対応する場合、得られたｓｉｄでテキストＩＤテーブ
ルを参照し、ｔｉｄが複数個有るか否かでサロゲートデ
ータの統合がなされているかを判定する。１個のｓｉｄ
について複数個のｔｉｄが対応する場合、すなわち、統
合処理がなされている場合には、更新テキスト及び、統
合されたほかのテキストデータをもとにもう一度テキス
トデータの統合処理と、サロゲートデータの作成処理を
行う必要がある。まず、更新しようとしているテキスト
データを含め、統合されたテキストデータの総容量を計
算し直し、しきい値Ｑｔを超えない場合には、該当する
すべてのテキストデータを再統合し、サロゲートデータ
を再登録する。更新しようとするテキストデータが既に
登録されているテキストデータに較べ大きくなり、該当
のテキストデータすべてを統合するとしきい値Ｑｔより
大きくなる場合には、更新しようとする文書のテキスト
ＩＤをチェインからはずし、それ以外のテキストデータ
を統合してサロゲートデータを再登録する。更新データ
は、新たに別のｓｉｄによりサロゲートデータを作成登
録する。図４，５，６の例で具体的に説明すると、ｔｉ
ｄが１の文書の更新には、ｓｉｄとして１が対応し、か
つ統合処理がなされているので、更新するｔｉｄが１の
テキストデータと既に登録されているｔｉｄが２のテキ
ストデータでもう一度サロゲートデータを再作成してｓ
ｉｄを１として再登録する。このとき、ｔｉｄが１及び
２のテキストデータを統合して、しきい値Ｑｔを超える
場合は、ｔｉｄ２のみでｓｉｄが１のサロゲートデータ
を作成し、ｔｉｄが１の文書のサロゲートデータは新規
のｓｉｄを付与して登録する。この場合には、ｔｉｄの
１及び２のチェインをはずすために、サロゲートＩＤテ
ーブルのｔｉｄが１に対応する第１レコードのｓｉｄを
新ｓｉｄに変更し、テキストＩＤテーブルのｓｉｄが１
に対応する第１レコードのｔｉｄを２にｃｈａｉｎを０
に変更し、ｔｉｄが１の文書に対応する新ｓｉｄのレコ
ードにｔｉｄとして１を登録する。更にテキストＩＤチ
ェインテーブルの第１レコードのｔｉｄを−１とする。
１個のｔｉｄに対して１個のｓｉｄが対応する場合、す
なわち、分割も統合もされていないサロゲートデータは
単純に得られたｓｉｄでサロゲートデータを再登録すれ
ば良い。この場合は、サロゲートＩＤテーブルやテキス
トＩＤテーブルの対応表には全く変更を加える必要が無
い。文字成分表と凝縮本文といったサロゲートデータの
本体を更新するだけでよい。以上のサロゲートデータの
更新処理の後、テキストデータを指定のｔｉｄで更新す
れば、すべてのデータ更新処理が完了する。Further, for one tid, one sid
, The text ID table is referred to by the obtained sid, and it is determined whether or not the surrogate data is integrated depending on whether or not there are a plurality of tids. 1 sid
If a plurality of tids correspond to each other, that is, if the integration processing has been performed, the text data integration processing and the surrogate data creation processing are performed again based on the updated text and other integrated text data. Need to do. First, recalculate the total capacity of the integrated text data including the text data to be updated, and if the threshold value Qt is not exceeded, re-integrate all the corresponding text data and re-create the surrogate data. register. If the text data to be updated is larger than the registered text data, and if all the relevant text data is integrated and becomes larger than the threshold value Qt, remove the text ID of the document to be updated from the chain. , Integrate other text data and re-register surrogate data. For the update data, surrogate data is newly created and registered with another sid. Specifically, using the examples of FIGS.
Since the sid of 1 corresponds to the update of the document whose d is 1, and the integration process has been performed, the text data whose tid to be updated is 1 and the text data whose tid is 2 already registered are surrogate data again. To recreate
Re-register with id = 1. At this time, if the text data with tid of 1 and 2 are integrated and the threshold Qt is exceeded, surrogate data with sid of 1 is created only with tid2, and the surrogate data of the document with tid of 1 is the new sid. And register. In this case, in order to remove the chains of 1 and 2 of tid, the sid of the first record corresponding to tid of 1 in the surrogate ID table is changed to a new sid, and the sid of the text ID table is set to 1.
Corresponding to the first record tid 2 and chain 0
And register 1 as the tid in the record of the new sid corresponding to the document with the tid of 1. Furthermore, the tid of the first record in the text ID chain table is set to -1.
When one sid corresponds to one tid, that is, for surrogate data that has not been divided or integrated, it is sufficient to re-register the surrogate data with the sid obtained. In this case, it is not necessary to change the correspondence table of the surrogate ID table and the text ID table. All you have to do is update the body of the surrogate data, such as the character component table and condensed text. After the above surrogate data update processing, if the text data is updated with the specified tid, all data update processing is completed.

【００４６】以上、分割、統合によるサロゲートデータ
を使用したデータの登録、検索、削除および更新の処理
を説明した。本実施例は、登録する文書のテキスト容量
に大きなバラツキが有り、サロゲートデータのデータ量
がテキストデータの量に比べ不均一になるのを防ぎ、結
果としてサロゲートデータ量を少なくする効果がある。The processing of registering, searching, deleting and updating data using surrogate data by division and integration has been described above. The present embodiment has an effect that there is a large variation in the text capacity of the document to be registered, and the data amount of the surrogate data is prevented from becoming uneven compared to the text data amount, and as a result, the surrogate data amount is reduced.

【００４７】次に、第２の実施例の説明を行なう。本実
施例は、第１の実施例で検索ターム間のＡＮＤ条件が与
えられたとき、サロゲートデータをＯＲ条件に置き換え
て検索するため、サロゲートデータでの絞り込み率が低
下し、結果として最終的にスキャンするテキストデータ
量が増える欠点を解消するものである。本実施例では、
図１５に示すように、テキストデータを統合して作成し
たサロゲートデータと、分割して作成したサロゲートデ
ータを別ファイルとする。このようにファイルを分ける
ことにより、統合して作成したサロゲートデータすなわ
ち統合サロゲートデータをＡＮＤ条件で検索し、分割し
て作成したサロゲートデータすなわち分割サロゲートデ
ータをＯＲ条件で検索するということが可能となる。タ
ーム間のＡＮＤ条件が与えられたとき、ＯＲ条件で検索
しなければならないのは、テキストデータを分割して作
成した分割サロゲートデータのみである。従って、統合
サロゲートデータはＡＮＤ条件で、分割サロゲートデー
タについてのみＯＲ条件で検索することで、サロゲート
データの検索による検索候補文書の件数を減らすことが
できる。Next, the second embodiment will be described. In the present embodiment, when the AND condition between the search terms is given in the first embodiment, the surrogate data is replaced with the OR condition for the search, so that the narrowing-down ratio in the surrogate data is lowered, and as a result, finally. This eliminates the disadvantage that the amount of text data to be scanned increases. In this embodiment,
As shown in FIG. 15, the surrogate data created by integrating the text data and the surrogate data created by dividing the text data are set as separate files. By dividing the files in this way, it is possible to search the integratedly created surrogate data, that is, the integrated surrogate data under an AND condition, and search the dividedly created surrogate data, that is, the divided surrogate data under an OR condition. . When the AND condition between terms is given, it is only the divided surrogate data created by dividing the text data that needs to be searched by the OR condition. Therefore, by searching the integrated surrogate data under the AND condition and only the divided surrogate data under the OR condition, it is possible to reduce the number of search candidate documents by searching the surrogate data.

【００４８】以下、別ファイルとしたサロゲートデータ
の管理方法とデータ登録処理及び検索処理について説明
する。まず、サロゲートデータの一つ文字成分表の形態
を図１６に示す。本実施例では、このように統合サロゲ
ートデータと分割サロゲートデータを別々に管理し、サ
ロゲートＩＤも別々に管理する。例えば統合サロゲート
データのｓｉｄ１の情報と分割サロゲートデータのｓｉ
ｄが１の情報は、まったく別のデータを表わす。図１７
は、もう一つのサロゲートデータである凝縮本文の作成
例を示すものである。ここでも統合サロゲートデータ
と、分割サロゲートデータのサロゲートＩＤが別々に管
理されている。Hereinafter, a method of managing surrogate data in a separate file, a data registration process, and a search process will be described. First, the form of the one-character component table of surrogate data is shown in FIG. In this embodiment, the integrated surrogate data and the divided surrogate data are separately managed in this way, and the surrogate ID is also separately managed. For example, information of integrated surrogate data sid1 and si of divided surrogate data si
Information in which d is 1 represents completely different data. FIG. 17
Shows another example of creating a condensed text which is another surrogate data. Also in this case, the integrated surrogate data and the surrogate ID of the divided surrogate data are separately managed.

【００４９】図１８及び図１９はこのように作成された
サロゲートデータとテキストデータを対応付けるサロゲ
ートＩＤ（ｓｉｄ）とテキストＩＤ（ｔｉｄ）の対応表
である。まず、図１８にｔｉｄからｓｉｄへ変換するた
めのサロゲートＩＤテーブルの例を示す。レコード番号
がｔｉｄと対応しているのは、第一の実施例と同じであ
る。ただし、例えばｔｉｄが１と６の文書のように同じ
ｓｉｄを持つものがあるが、これは別のサロゲートデー
タを示すものである。その区別は、ｃｈａｉｎで行って
いる。すなわち、ｃｈａｉｎが０のものは、統合サロゲ
ートファイルのｓｉｄを意味し、０以外の場合は、分割
サロゲートデータのｓｉｄを意味する。例えば、ｔｉｄ
が１及び２の文書は同じｓｉｄが１の統合サロゲートデ
ータが対応するが、ｔｉｄが６の文書はｓｉｄが１，
２，３の分割サロゲートデータが対応している。FIG. 18 and FIG. 19 are correspondence tables of surrogate ID (sid) and text ID (tid) that correlate the surrogate data and text data created in this way. First, FIG. 18 shows an example of a surrogate ID table for converting from tid to sid. The record number corresponds to tid, as in the first embodiment. However, there are documents having the same sid, such as documents with tid of 1 and 6, but this shows different surrogate data. The distinction is made by chain. That is, the one with chain of 0 means the sid of the integrated surrogate file, and the other than 0 means the sid of the divided surrogate data. For example, tid
Documents with 1 and 2 correspond to the integrated surrogate data with the same sid of 1, but the document with tid of 6 has sid of 1.
Two or three divided surrogate data correspond.

【００５０】図１９は、逆にｓｉｄからｔｉｄへの変換
表の例を示している。やはり、ｔｉｄテーブルのレコー
ド番号がｓｉｄに対応するが、統合サロゲートデータ
と、分割サロゲートデータで別々にｓｉｄを管理するた
めテーブルが２個に分かれている。テキストＩＤテーブ
ル１が統合サロゲートデータのｓｉｄからｔｉｄに変換
するためのテーブルで、テキストＩＤテーブル２が分割
サロゲートデータのｓｉｄからｔｉｄへ変換するテーブ
ルである。テキストＩＤテーブル１は、統合のため１個
のｔｉｄから複数個のｓｉｄが対応するので、チェイン
テーブルがある。形式は第一の実施例と変わらない。し
かし、テキストＩＤテーブル２は、チェインするｓｉｄ
がないのでチェインテーブルの必要がない。その代わ
り、テキストデータ中のどの部分から分割してサロゲー
トデータを作成したかの情報を持つ。後で説明するが、
こうすることで長いテキストデータの一部分のみをスキ
ャンして検索時間を短縮することができる。例えば、分
割サロゲートデータのｓｉｄが２のデータは、ｔｉｄが
６のテキストデータの最初から１９,２３０バイト目か
ら１９,１０７バイトの長さの部分を切り出して作成さ
れたものであることを示している。また、テキストＩＤ
テーブル２の項目ｔｏｐは、１個のテキストＩＤに複数
個のサロゲートＩＤが連鎖しているので、連鎖する最初
のサロゲートＩＤを指すものである。これらのテキスト
ＩＤテーブル、サロゲートＩＤテーブル、及びチェイン
テーブルは、最初データが何も登録されていないとき、
すべてのｔｉｄ，ｓｉｄは−１で初期化しておく。すな
わち、第１の実施例と同じように−１であるＩＤは空き
レコードであることを意味する。On the contrary, FIG. 19 shows an example of a conversion table from sid to tid. Again, the record number of the tid table corresponds to sid, but the table is divided into two because the integrated surrogate data and the divided surrogate data separately manage the sid. The text ID table 1 is a table for converting sid of integrated surrogate data to tid, and the text ID table 2 is a table for converting sid of divided surrogate data to tid. The text ID table 1 has a chain table because one tid corresponds to a plurality of sids for integration. The format is the same as in the first embodiment. However, the text ID table 2 has sid to chain.
There is no need for a chain table because there is no. Instead, it has the information from which part of the text data is divided to create the surrogate data. As I will explain later,
By doing so, it is possible to scan only a part of the long text data and shorten the search time. For example, it is indicated that the data with the sid of 2 of the divided surrogate data is created by cutting out the portion of the text data with the tid of 6 from the beginning of the 230th byte to the length of 19,107 bytes. There is. Also, the text ID
The item top in Table 2 indicates the first surrogate ID in the chain, because one text ID is linked to a plurality of surrogate IDs. These text ID table, surrogate ID table, and chain table, when no data is initially registered,
All tid and sid are initialized to -1. That is, as in the first embodiment, an ID of -1 means an empty record.

【００５１】このようなサロゲートデータと変換テーブ
ルをどのように作成するか、データ登録の処理を図２０
のＰＡＤ図を用いて説明する。まず、最初に一件分のデ
ータを読み込み、データ長としきい値Ｑｔとを比較す
る。もし、データ長がしきい値Ｑｔ以下の場合には、デ
ータの統合処理を行い統合サロゲートデータとして登録
する。まず、統合サロゲートデータのｓｉｄとｔｉｄを
対応付けているテキストＩＤテーブル１及びサロゲート
ＩＤテーブルに各ＩＤを登録する。そして、データ量が
しきい値Ｑｔを超えない範囲でテキストデータを統合
し、統合したテキストデータのｔｉｄを統合サロゲート
用のテキストＩＤテーブル１のチェインテーブルに登録
していく。また、サロゲートＩＤテーブルのｔｉｄ番目
のレコードにｓｉｄを登録する。最後に、統合したテキ
ストデータでサロゲートデータを作成登録し、テキスト
データを登録して処理を終わる。FIG. 20 shows how to create such surrogate data and conversion table, and how to register the data.
The PAD diagram of FIG. First, one piece of data is first read, and the data length is compared with the threshold value Qt. If the data length is less than or equal to the threshold value Qt, data integration processing is performed and registered as integrated surrogate data. First, each ID is registered in the text ID table 1 and the surrogate ID table in which the sid and tid of the integrated surrogate data are associated with each other. Then, the text data is integrated in a range in which the data amount does not exceed the threshold value Qt, and the tid of the integrated text data is registered in the chain table of the text ID table 1 for integrated surrogate. Also, sid is registered in the tid-th record of the surrogate ID table. Finally, the surrogate data is created and registered with the integrated text data, the text data is registered, and the processing is ended.

【００５２】しきい値Ｑｔが２０,０００バイトのと
き、図４のテキストデータの登録を具体例として説明す
る。まず、ｔｉｄが１のテキストデータを読み込み、デ
ータ量がしきい値Ｑｔを超えるかチェックする。しきい
値Ｑｔを超えない場合、テキストＩＤテーブル１の空き
レコードである第１レコードにｔｉｄとして１を，ｃｈ
ａｉｎとして０を登録する。サロゲートＩＤテーブルに
は、ｔｉｄが１に対応する第１レコードにｓｉｄとして
１を，ｃｈａｉｎとして０を登録する。次にｔｉｄが２
のテキストデータを読み込み、データ量のチェックをす
る。この場合ｔｉｄが１，２のテキストデータ量合わせ
てしきい値Ｑｔ未満なので、データを統合してサロゲー
トデータを作成する。すなわち、テキストＩＤテーブル
１の第１レコードのｃｈａｉｎを１とし、テキストＩＤ
チェインテーブルの第１レコードにｓｉｄとして２を，
ｃｈａｉｎとして０を登録し、サロゲートＩＤテーブル
の第２レコードにｓｉｄとして１を，ｃｈａｉｎとして
０を登録する。When the threshold value Qt is 20,000 bytes, the registration of the text data shown in FIG. 4 will be described as a specific example. First, the text data whose tid is 1 is read, and it is checked whether the data amount exceeds the threshold value Qt. When the threshold value Qt is not exceeded, 1 is set as tid in the first record, which is an empty record in the text ID table 1, and ch
Register 0 as ain. In the surrogate ID table, 1 is registered as the sid and 0 is registered as the chain in the first record corresponding to the tid of 1. Then tid is 2
Read the text data of and check the amount of data. In this case, the sum of the text data amounts of tid of 1 and 2 is less than the threshold value Qt, so the data is integrated to create surrogate data. That is, the chain of the first record in the text ID table 1 is set to 1 and the text ID
2 as the sid in the first record of the chain table,
0 is registered as chain, 1 is registered as sid, and 0 is registered as chain in the second record of the surrogate ID table.

【００５３】また、読み込んだテキスト長がしきい値Ｑ
ｔより大きい場合は、分割サロゲートデータとして処理
する。すなわち、分割サロゲートデータ用のテキストＩ
Ｄテーブル２にｔｉｄを、サロゲートＩＤテーブルにｓ
ｉｄを登録し、テキストデータをしきい値Ｑｔを超えな
い範囲で分割しながらサロゲートデータを登録する。こ
の時、分割したデータのテキストデータ中での位置及び
長さをｔｉｄとともにテキストＩＤテーブル２に登録
し、サロゲートＩＤテーブルにｓｉｄを登録していく。
最後にテキストデータを登録して処理を終わる。例え
ば、図４のｔｉｄが６のテキストデータの登録のときに
は、テキスト長がしきい値Ｑｔよりも大きいので、テキ
ストＩＤテーブル２の空きレコードである第１レコード
にｔｉｄとして６を，ｔｏｐとして１を，ｏｆｆｓｅｔ
として０を，ｌｅｎｇｔｈとして１９,２３０を登録す
るとともに、サロゲートＩＤテーブルのｔｉｄが６に対
応する第６レコードにｓｉｄとして１を，ｃｈａｉｎと
して０を登録する。ここで、ｌｅｎｇｔｈの１９,２３
０はしきい値Ｑｔ以下のデータ量で文章の区切れめで分
割した結果、データ長が１９,２３０バイトになったこ
とを示している。そして、次のしきい値Ｑｔ以内の分割
データについてサロゲートデータを作成する。分割デー
タの長さが１９,１０７バイトとすると、テキストＩＤ
テーブル２の第２レコードにｔｉｄとして６を，ｔｏｐ
として１を，ｏｆｆｓｅｔとして１９,２３０を，ｌｅ
ｎｇｔｈとして１９,１０７を登録する。サロゲートＩ
Ｄテーブルには、第６レコードのｃｈａｉｎをサロゲー
トＩＤチェインテーブルの空きレコードである１と変更
し、サロゲートＩＤチェインテーブルの第１レコードに
ｓｉｄとして２を，ｃｈａｉｎとして０を登録する。３
個目の分割データについては、テキストＩＤテーブル２
にｔｉｄとして６を，ｔｏｐとして１を，ｏｆｆｓｅｔ
として３８,３３３を，ｌｅｎｇｔｈとして１９,０９５
のように登録するとともに、サロゲートＩＤチェインテ
ーブルの第１レコードのｃｈａｉｎを次の空きレコード
２と変更して、第２レコードにｓｉｄとして３を，ｃｈ
ａｉｎとして０を登録する。The read text length is the threshold value Q.
If it is larger than t, it is processed as divided surrogate data. That is, the text I for the divided surrogate data
Tid in D table 2 and s in surrogate ID table
The id is registered, and the surrogate data is registered while dividing the text data within a range not exceeding the threshold value Qt. At this time, the position and length in the text data of the divided data are registered in the text ID table 2 together with tid, and sid is registered in the surrogate ID table.
Finally, the text data is registered and the process ends. For example, when the text data having the tid of 6 in FIG. 4 is registered, the text length is larger than the threshold value Qt, so 6 is set as the tid and 1 is set as the top in the first record, which is an empty record of the text ID table 2. , Offset
Is registered as 0, the length is registered as 19,230, and 1 is registered as the sid and 0 is registered as the chain in the sixth record corresponding to the tid of 6 in the surrogate ID table. Where the length of 19,23
The value 0 indicates that the data length is 19,230 bytes as a result of dividing the text with the data amount equal to or less than the threshold value Qt. Then, surrogate data is created for the divided data within the next threshold value Qt. If the length of the divided data is 19,107 bytes, the text ID
6 as the tid in the second record of table 2
As 1 and offset as 19,230, le
Register 19,107 as ngth. Surrogate I
In the D table, the chain of the sixth record is changed to an empty record of 1 in the surrogate ID chain table, and 2 is registered as sid and 0 is registered as chain in the first record of the surrogate ID chain table. Three
For the second divided data, the text ID table 2
6 as tid, 1 as top, offset
As 38,333 and the length as 19,095
And the chain of the first record of the surrogate ID chain table is changed to the next empty record 2, and 3 is set as the sid of the second record.
Register 0 as ain.

【００５４】次に、このようにして作成したサロゲート
データを用いた検索処理について、図２１のＰＡＤ図を
用いて説明する。まず、与えられた条件式から検索語
と、検索語中の文字、及び論理条件を抽出する。そし
て、サロゲートデータを探索するが、この時、条件式の
論理条件にＡＮＤ条件があるときには、分割サロゲート
データについてのみＯＲ条件に置き換えて探索する。統
合サロゲートデータの探索結果を統合サロゲートＩＤと
し、分割サロゲートデータの探索結果を分割サロゲート
ＩＤとする。Next, a search process using the surrogate data thus created will be described with reference to the PAD diagram of FIG. First, a search word, a character in the search word, and a logical condition are extracted from the given conditional expression. Then, the surrogate data is searched. At this time, when the logical condition of the conditional expression has an AND condition, only the divided surrogate data is replaced with the OR condition to be searched. The search result of the integrated surrogate data is the integrated surrogate ID, and the search result of the divided surrogate data is the divided surrogate ID.

【００５５】統合サロゲートＩＤと分割サロゲートＩＤ
の結果が両方とも０件であれば、該当文書無しとして結
果０件を出力する。それ以外のときは、必要に応じてテ
キストデータを探索し、結果のｔｉｄを出力していく。
まず、分割サロゲートデータの検索処理について説明す
る。条件式の論理条件にＡＮＤ条件があり、分割サロゲ
ートＩＤが０件でない場合、分割前のテキストデータを
該当文書分探索する必要がある。すなわち、分割サロゲ
ートＩＤをテキストＩＤテーブル２を使ってｔｉｄに変
換し、該当ｔｉｄのテキストデータをすべて探索して、
結果のｔｉｄをバッファｔｉｄｏｕｔに格納する。バッ
ファｔｉｄｏｕｔは、この後、統合サロゲートデータの
検索結果により統合サロゲートＩＤをテキストＩＤテー
ブル１を使ってｔｉｄに変換し、該ｔｉｄによるテキス
トデータの探索結果であるｔｉｄを格納するのに使い、
双方の結果を合わせて出力する。Integrated surrogate ID and split surrogate ID
If both of the results are 0, it is determined that there is no relevant document and 0 results are output. In other cases, the text data is searched as needed and the resulting tid is output.
First, a search process for divided surrogate data will be described. When the logical condition of the conditional expression includes an AND condition and the divided surrogate ID is not 0, it is necessary to search the text data before the division for the corresponding document. That is, the divided surrogate ID is converted into tid using the text ID table 2, all text data of the corresponding tid are searched,
Store the resulting tid in the buffer tidout. After that, the buffer tidout is used to convert the integrated surrogate ID into tid by using the text ID table 1 according to the search result of the integrated surrogate data, and store the tid which is the search result of the text data by the tid,
Output both results together.

【００５６】条件式の論理条件にＡＮＤ条件がない場
合、条件式を訂正していないので、分割サロゲートデー
タの結果は、検索語が単一文字種で構成されている場合
は正しい結果となる。検索語が複合文字種でない場合、
すなわち、単一文字種の場合は、テキストＩＤテーブル
２を用いて分割サロゲートＩＤをｔｉｄに変換した結果
をバッファｔｉｄｏｕｔにそのまま保持する。検索語が
複合文字種の場合は、テキストデータの探索を行う必要
がある。しかし、この場合は該当テキストデータすべて
を探索する必要はなく、得られた分割サロゲートＩＤに
対応するテキスト部分のみを探索すれば良い。例えば、
図１９の例で分割サロゲートＩＤとして３がサロゲート
データの探索結果として得られた場合は、ｔｉｄが６の
テキストデータの先頭から３８,３３３バイト目から１
９,０９５バイトを探索範囲とすればよい。こうして得
られた条件式に合致したｔｉｄをバッファｔｉｄｏｕｔ
に保持する。If there is no AND condition in the logical conditions of the conditional expression, the conditional expression has not been corrected, so the result of the divided surrogate data will be correct if the search word is composed of a single character type. If the search term is not a compound character type,
That is, in the case of a single character type, the result of converting the divided surrogate ID to tid using the text ID table 2 is held in the buffer tidout as it is. If the search word is a complex character type, it is necessary to search the text data. However, in this case, it is not necessary to search all the corresponding text data, and only the text portion corresponding to the obtained divided surrogate ID may be searched. For example,
In the example of FIG. 19, when 3 is obtained as the search result of the surrogate data as the divided surrogate ID, the text data whose tid is 6 is 1 from the 38,333th byte from the beginning.
The search range may be 9,095 bytes. The tid that matches the conditional expression thus obtained is stored in the buffer tidout
Hold on.

【００５７】次に統合サロゲートデータの探索結果、統
合サロゲートＩＤの処理の説明をする。統合サロゲート
ＩＤはテキストＩＤテーブル１を使用して対応するｔｉ
ｄに変換することができる。そしてチェインがない、す
なわち、ほかのテキストデータを統合してサロゲートデ
ータが作られていない場合で、かつ検索タームが単一文
字種からなる時は、テキストデータを探索する必要がな
く、統合サロゲートＩＤから変換したｔｉｄをそのまま
先に作成したバッファｔｉｄｏｕｔに追加して出力す
る。また、検索タームが複合文字種からなる場合あるい
はｔｉｄのチェインがある場合は、得られたｔｉｄのテ
キストデータを探索してその結果をバッファｔｉｄｏｕ
ｔに加え出力する。Next, the processing of the integrated surrogate data search result and integrated surrogate ID will be described. The integrated surrogate ID uses the text ID table 1 and the corresponding ti
can be converted to d. And when there is no chain, that is, when the surrogate data is not created by integrating other text data, and the search term consists of a single character type, it is not necessary to search the text data and the integrated surrogate ID is used. The converted tid is added as it is to the previously created buffer tidout and output. Also, if the search term is composed of compound character types or if there is a chain of tid, the text data of the obtained tid is searched and the result is buffered in the tidou.
Output in addition to t.

【００５８】以上の検索処理による候補文書抽出の様子
を、図２２及び図２３を用いて説明する。図２２はＡＮ
Ｄ条件のある条件式ｓｅａｒｃｈ：ＡＮＤ（文書，データ）で検索する例である。ＡＮＤ条件があるので、統合サロ
ゲートデータは、ＡＮＤ条件のまま、分割サロゲートデ
ータはＯＲ条件で探索する。まず、文字成分表探索であ
るが、図１６の文字成分表で統合サロゲートデータを
“文書”，“データ”の両方を含む条件、すなわち
（‘文’，‘書’，‘デ’，‘ー’，‘タ’）のすべて
の文字を含む条件で探索する。一方、分割サロゲートデ
ータは、“文書”あるいは“データ”のいずれかを含む
という条件で探索する。つまり、（‘文’，‘書’）の
２文字を含む文書、あるいは（‘デ’，‘ー’，
‘タ’）の３文字を含む文書を探索する。この文字成分
表の探索の結果は、図１６の例では統合サロゲートデー
タの結果としてｓｉｄ２、分割サロゲートデータの結果
としてｓｉｄの２が得られる。凝縮本文検索の結果も同
様に、統合サロゲートＩＤが２、分割サロゲートＩＤが
２となる。まず分割サロゲートデータの結果を処理する
ことにすると、例えば図１９に示したテキストＩＤテー
ブル２により分割サロゲートＩＤの２はｔｉｄの６へ変
換される。分割サロゲートデータは、ＡＮＤ条件をＯＲ
条件に置き換えて検索したので、テキストデータを探索
する必要があり、ｔｉｄが６のテキストデータを探索し
て結果を結果格納バッファｔｉｄｏｕｔに格納する。図
４に示したデータ例によるとｔｉｄが６の文書は条件を
満たさないので、バッファｔｉｄｏｕｔはｎｕｌｌとな
る。ここで記号ｎｕｌｌはデータのない状態を表わす。
次に統合サロゲートデータの結果、統合サロゲートＩＤ
の処理をする。図１９に示したテキストＩＤテーブル１
を参照してｔｉｄに変換すれば、ｔｉｄとして３，４，
５が得られる。ｔｉｄがチェインしているため、該当の
テキストデータを探索し、例えば図４のデータのとき
は、結果としてｔｉｄとして３，４が得られる。これを
先程の分割サロゲートＩＤの処理結果であるバッファｔ
ｉｄｏｕｔに追加して最終結果としてバッファｔｉｄｏ
ｕｔはｔｉｄが３，４と格納され、これを検索結果とし
て出力する。The state of candidate document extraction by the above search processing will be described with reference to FIGS. 22 and 23. 22 shows AN
This is an example of searching with a conditional expression search: AND (document, data) having a D condition. Since there is an AND condition, the integrated surrogate data is searched under the AND condition and the divided surrogate data is searched under the OR condition. First, regarding the character component table search, in the character component table of FIG. 16, the integrated surrogate data is a condition that includes both “document” and “data”, that is, ('sentence', 'writing', 'de', ' Search with a condition that includes all the characters', 'ta'). On the other hand, the divided surrogate data is searched under the condition that it contains either "document" or "data". That is, a document containing two characters ('sentence', 'call') or ('de', '-',
Search for documents that contain the three characters'ta '). As a result of searching the character component table, sid2 is obtained as the result of the integrated surrogate data and sid2 is obtained as the result of the divided surrogate data in the example of FIG. Similarly, the result of the condensed text search is that the integrated surrogate ID is 2 and the divided surrogate ID is 2. First, if the result of the divided surrogate data is processed, for example, the divided surrogate ID 2 is converted into the tid 6 by the text ID table 2 shown in FIG. Divided surrogate data is ANDed with OR
Since the search is performed by replacing with the condition, it is necessary to search the text data, and the text data with tid of 6 is searched and the result is stored in the result storage buffer tidout. According to the data example shown in FIG. 4, the document whose tid is 6 does not satisfy the condition, so the buffer tidout becomes null. Here, the symbol null represents a state with no data.
Next, as a result of integrated surrogate data, integrated surrogate ID
Process. Text ID table 1 shown in FIG.
And convert to tid, the tid is 3, 4,
5 is obtained. Since tid is chained, the corresponding text data is searched for. For example, in the case of the data of FIG. 4, 3 and 4 are obtained as tid. This is the buffer t which is the processing result of the previous divided surrogate ID.
Add to idout and buffer tido as final result
ut has tid stored as 3 and 4, and outputs this as a search result.

【００５９】別の検索例を図２３を用いて説明する。こ
の例は複合文字種での検索例で条件式は、ｓｅａｒｃｈ：データ量とする。この場合、検索ターム間のＡＮＤ条件がないの
で、統合サロゲートデータ及び分割サロゲートデータを
同じ条件式で探索できる。文字成分表では、（‘デ’，
‘ー’，‘タ’，‘量’）のすべての文字を含むという
条件で探索し、凝縮本文は文字種変化点で分割した部分
文字列（“データ”，“量”）の両方の文字列を含むと
いう条件で探索する。結果として図１６，１７のサロゲ
ートデータでは、統合サロゲートＩＤが２、分割サロゲ
ートＩＤが２となる。次に分割サロゲートデータをテキ
ストＩＤテーブル２を用いて変換する。図１９に示すテ
キストＩＤテーブルを用いると分割サロゲートＩＤの２
はｔｉｄが６へ変換される。検索語が単一文字種であれ
ば、この得られたｔｉｄをそのままバッファｔｉｄｏｕ
ｔに格納できるが、複合文字種なので、テキストデータ
を探索し、実際に“データ”と“量”が連続して出現す
るテキストか確認する。ただし、この時ｔｉｄが６の文
書のすべてのテキストデータを探索範囲とする必要はな
い。サロゲートデータの探索で確認された“データ”と
“量”を両方含んでいる部分だけを探索すればよい。サ
ロゲートデータによる結果ｓｉｄの２に対応するテキス
トデータは図１９のテキストＩＤテーブル２によれば、
ｔｉｄが６，ｏｆｆｓｅｔが１９,２３０，ｌｅｎｇｔ
ｈが１９,１０７の部分なので、ここだけを探索し結果
をバッファｔｉｄｏｕｔに格納する。図４のデータ例で
は、確かに“データ量”という単語が現れているので、
バッファｔｉｄｏｕｔに６を格納する。Another search example will be described with reference to FIG. This example is a search example using a complex character type, and the conditional expression is search: data amount. In this case, since there is no AND condition between the search terms, the integrated surrogate data and the divided surrogate data can be searched with the same conditional expression. In the character composition table, ('de',
"-", "Ta", "quantity") is searched for under the condition that all characters are included, and the condensed text is a character string of both subcharacter strings ("data", "quantity") divided at character type change points. Search on the condition that includes. As a result, in the surrogate data of FIGS. 16 and 17, the integrated surrogate ID is 2 and the divided surrogate ID is 2. Next, the divided surrogate data is converted using the text ID table 2. If the text ID table shown in FIG. 19 is used, the divided surrogate ID 2
Has a tid converted to 6. If the search word is a single character type, the obtained tid is used as it is in the buffer tidou
Although it can be stored in t, since it is a complex character type, the text data is searched and it is confirmed whether or not "data" and "quantity" actually appear consecutively. However, at this time, it is not necessary to set all the text data of the document whose tid is 6 as the search range. Only the portion containing both "data" and "quantity" confirmed in the surrogate data search need be searched. According to the text ID table 2 of FIG. 19, the text data corresponding to the result sid of 2 based on the surrogate data is
Tid is 6, offset is 19,230, length
Since h is the portion of 19,107, only this is searched and the result is stored in the buffer tidout. In the data example of FIG. 4, the word “data amount” certainly appears, so
Store 6 in the buffer tidout.

【００６０】次に統合サロゲートＩＤをテキストＩＤテ
ーブル１を用いてｔｉｄに変換する。図１９の例による
とｓｉｄの２からｔｉｄの３，４，５が得られる。ｔｉ
ｄがチェインしているので、これらのテキストデータを
探索し、結果としてｔｉｄが５のテキストが条件に合致
する。この結果を先程のバッファｔｉｄｏｕｔに追加し
て、最終結果としてバッファｔｉｄｏｕｔにｔｉｄの６
と５が格納される。Next, the integrated surrogate ID is converted into tid using the text ID table 1. According to the example of FIG. 19, sid 2 to tid 3, 4, and 5 are obtained. ti
Since d is chained, these text data are searched, and as a result, the text whose tid is 5 matches the condition. This result is added to the buffer tidout and the final result is 6
And 5 are stored.

【００６１】以上、第２の実施例として、統合サロゲー
トデータと分割サロゲートデータに分けた場合のデータ
の登録及び検索処理について説明した。本実施例によれ
ば、検索ターム間のＡＮＤ条件の時でもＯＲ検索しなけ
ればならないサロゲートデータを限定できるので、高い
絞り込み率が得られ高速なフルテキストサーチが実現で
きる。また、長いテキストデータを分割してサロゲート
データを作成するとき、元のテキストデータの分割点を
記録しているので、必要なテキストデータのみを探索す
ることで検索時間を短縮できるという利点がある。In the above, as the second embodiment, the data registration and retrieval processing in the case of being divided into integrated surrogate data and divided surrogate data has been described. According to this embodiment, the surrogate data that needs to be OR-searched can be limited even under the AND condition between search terms, so that a high narrowing down rate can be obtained and a high-speed full-text search can be realized. Further, when the surrogate data is created by dividing the long text data, since the division point of the original text data is recorded, there is an advantage that the search time can be shortened by searching only the necessary text data.

【００６２】次に第３の実施例を説明する。本実施例
は、登録文書を件数を目安に統合してサロゲートデータ
を作成するものである。例えばテキストデータの統合単
位を文書１０件と設定すると、１０件ずつテキストデー
タをまとめてサロゲートファイルを作成する。このよう
な件数まとめによるサロゲートデータの登録と、それを
利用した検索処理について詳細に説明する。まず、登録
はこれまで説明してきた統合サロゲートデータの登録要
領で行う。すなわち、所定の件数分のテキストデータを
読み込みそのデータを統合してサロゲートデータを作成
する。ただし、この時、第１の実施例で使用したテキス
トＩＤテーブルや、サロゲートＩＤテーブルは必要な
い。統合する文書の件数が固定なので、テキストＩＤか
らの計算で対応するサロゲートＩＤが得られるからであ
る。例えば１０件ずつまとめてサロゲートデータを作成
する場合、ｔｉｄの１から１０までのテキストに対応す
るサロゲートデータはｓｉｄが１である。逆に、ｓｉｄ
の２に対応するテキストデータはｔｉｄの１１から２０
である。というように、単純に１０をかけるなどの計算
だけで対応するｔｉｄが得られる。このように登録され
たサロゲートデータは、第１の実施例と同じ図８及び図
９の形式で表わすことができる。ただし、本実施例では
ｓｉｄの１は、ｔｉｄ１から１０、ｓｉｄの２はｔｉｄ
１１から２０のように、１０件の文書を単位として対応
している。検索もこれまで説明した実施例と同様に、ま
ずサロゲートデータを検索して、候補文書を抽出し、該
候補文書に対応するテキストデータを検索することで行
うことができる。具体例を次の条件式で説明する。ｓｅａｒｃｈ：データこれまでと同様に、文字成分表は（‘デ’，‘ー’，
‘タ’）のすべての文字を含む条件で探索し、凝縮本文
は“データ”の文字列をもつ文書を探索する。この条件
を満たすサロゲートデータは、図８の文字成分表ではｓ
ｉｄが２及び４である。従って、次の凝縮本文の探索で
はｓｉｄが２及び４の凝縮本文を検索して“データ”の
文字列を含む候補文書を抽出する。図９の例では、ｓｉ
ｄが２及び４が該当する。次のテキストデータの探索で
は、得られたサロゲートＩＤに対応するテキストＩＤの
テキストデータを対象とする。１０件ずつまとめて作成
されたサロゲートデータはｓｉｄが２に対応するテキス
トＩＤはｔｉｄが１１から２０であり、ｓｉｄが４に対
応するテキストＩＤはｔｉｄが３１から４０である。従
って、ｔｉｄが１１から２０及び３１から４０の２０文
書のテキストデータを探索して、条件式に合致する文書
を検索結果として出力すれば良い。Next, a third embodiment will be described. In the present embodiment, surrogate data is created by integrating registered documents based on the number of registered documents. For example, if the text data integration unit is set to 10 documents, the text data is collected 10 by 10 to create a surrogate file. Registration of surrogate data by summarizing the number of cases and search processing using the registration will be described in detail. First, registration is performed according to the procedure for registering integrated surrogate data described above. That is, the text data of a predetermined number is read, and the data is integrated to create surrogate data. However, at this time, the text ID table and the surrogate ID table used in the first embodiment are not necessary. This is because the number of documents to be integrated is fixed, and the corresponding surrogate ID can be obtained by calculation from the text ID. For example, when surrogate data is created collectively for 10 cases, the sid is 1 for the surrogate data corresponding to the texts 1 to 10 of tid. Conversely, sid
The text data corresponding to No. 2 is from tid 11 to 20
Is. In this way, the corresponding tid can be obtained by simply multiplying by 10. The surrogate data registered in this way can be represented in the same formats as in FIGS. 8 and 9 as in the first embodiment. However, in this embodiment, sid 1 is tid 1 to 10, and sid 2 is tid.
As in 11 to 20, 10 documents are used as a unit. Similar to the above-described embodiments, the search can be performed by first searching the surrogate data, extracting the candidate document, and searching the text data corresponding to the candidate document. A specific example will be described with the following conditional expression. search: data As before, the character component table is ('de', '-',
'Ta') is searched with a condition that includes all characters, and the condensed text searches for documents with the character string "data". The surrogate data satisfying this condition is s in the character component table of FIG.
The ids are 2 and 4. Therefore, in the next search for the condensed text, the condensed texts whose sid are 2 and 4 are searched to extract the candidate documents including the character string of "data". In the example of FIG. 9, si
d corresponds to 2 and 4. In the next text data search, the text data of the text ID corresponding to the obtained surrogate ID is targeted. The surrogate data created collectively by 10 cases has the text IDs whose sid is 2 from 11 to 20 and the text IDs whose sid is 4 from 31 to 40. Therefore, the text data of 20 documents with the tid of 11 to 20 and 31 to 40 may be searched, and the document matching the conditional expression may be output as the search result.

【００６３】以上件数を統合の目安としてサロゲートデ
ータを作成する実施例について説明した。本実施例は、
１件当りのデータ量が比較的均一で、かつ小さい場合に
有効となる。また、サロゲートデータとテキストデータ
を対応付けるために特にテーブルを必要としないという
効果がある。An example of creating surrogate data using the number of cases as a guideline for integration has been described above. In this example,
This is effective when the amount of data per case is relatively uniform and small. Further, there is an effect that a table is not particularly required to associate the surrogate data with the text data.

【００６４】これより、第４の実施例を説明する。図２
４は、本実施例の構成を示す図である。本実施例は、第
１の実施例に較べ、複数文書一括読出しプログラムをメ
モリ２４０６に、ディレクトリテ−ブルをメモリ２４０
８に、文書のテキストデ−タを２４０３−Ａ〜Ｄの４台
の磁気ディスク装置にそれぞれ持つ所が異なっている。
２４０３−Ａ〜Ｄの４台の磁気ディスク装置は、それぞ
れ内部に磁気ディスク装置コントロ−ラと、ディスク内
のデ−タを読出した後、一時デ−タを保持しておくため
のバッファメモリとを持ち、ＳＣＳＩアダプタ２４０９
を介してメインバスに接続されている。４台の磁気ディ
スクには、文書データを分散して格納してある。本実施
例では、ディスクとのデータアクセス単位である８ＫＢ
のデータ長さを単位として、一件分のテキストデータを
分割し、それぞれのディスクに格納する。ディレクトリ
テーブル２４０８には、図４に示すディレクトリファイ
ルと同じ情報が格納されている。すなわち、ディレクト
リテーブル２４０８を参照することにより、サロゲート
データで得られる候補文書のテキストデータについて、
ファイルの先頭からのオフセット位置と長さが分かるよ
うになっている。複数文書一括読出しプログラムは、こ
のファイル先頭からのオフセット位置と長さで構成され
る読出し命令を受けて、各ディスクに分散して格納され
ているテキストデータを読出して統合し、ワークメモリ
２４０７に格納する。Now, the fourth embodiment will be described. Figure 2
FIG. 4 is a diagram showing the configuration of this embodiment. The present embodiment is different from the first embodiment in that the plural document batch reading program is stored in the memory 2406 and the directory table is stored in the memory 240.
8 is different in that the text data of the document is provided in each of the four magnetic disk devices 2403-A to D.
Each of the four magnetic disk devices 2403-A to 2D-D has a magnetic disk device controller inside and a buffer memory for holding the temporary data after reading the data in the disk. With a SCSI adapter 2409
Connected to the main bus via. Document data is distributed and stored on four magnetic disks. In this embodiment, 8 KB, which is the unit of data access to the disk
Text data for one case is divided by the data length of and is stored in each disk. The directory table 2408 stores the same information as the directory file shown in FIG. That is, by referring to the directory table 2408, regarding the text data of the candidate document obtained by the surrogate data,
The offset position and length from the beginning of the file can be known. The multiple document batch read program receives the read command composed of the offset position and the length from the head of the file, reads and integrates the text data distributed and stored in each disk, and stores it in the work memory 2407. To do.

【００６５】以下、検索処理について本実施例に特有な
部分を重点的に説明する。まず、与えられた条件式の検
索後で使われている文字について、サロゲートデータの
文字成分表２４０４を用いて、検索語中の全ての文字を
含む文書を候補として抽出する。図２４には、サロゲー
トデータとしての凝縮本文、また、サロゲートＩＤテー
ブルやサロゲートＩＤテーブル等は示していないが、候
補文書を得るために凝縮本文、サロゲートＩＤテーブ
ル、サロゲートＩＤテーブル等を用いてもよいことは云
うまでもない。次に、候補文書のテキストデータを参照
して条件式に合致する文書を探索することになるが、本
実施例ではこの時に複数文書一括読出しプログラムを用
いてテキストデータを磁気ディスクからワークメモリ２
４０７へ読出して検索処理を行う。Hereinafter, the search process will be described by focusing on the parts peculiar to the present embodiment. First, regarding the characters used after the search of the given conditional expression, a document including all the characters in the search word is extracted as a candidate using the character component table 2404 of the surrogate data. Although a condensed text as surrogate data, a surrogate ID table, a surrogate ID table, etc. are not shown in FIG. 24, a condensed text, a surrogate ID table, a surrogate ID table, etc. may be used to obtain a candidate document. Needless to say. Next, the text data of the candidate document is referred to search for a document that matches the conditional expression. In the present embodiment, at this time, the text data is read from the magnetic disk to the work memory 2 by using the multiple document batch reading program.
The data is read out to 407 and search processing is performed.

【００６６】このテキストデータの読出しの詳細を図２
５を用いて説明する。本実施例では、図２５に示すよう
に８ＫＢのデータ長を単位として、テキストデータを分
割してディスク装置に格納している。図４に示した例の
テキストデータ長に従うと、文書１の場合は、データ長
が６２３２バイトなのでディスク装置Ａの１台のディス
ク装置に格納されている。これに対し、文書２はデータ
長が１３５６０バイトと８ＫＢよりも大きいので、ディ
スク装置Ｂ及びＣの２台のディス装置にテキストデータ
が格納される。Details of reading out the text data are shown in FIG.
This will be described using 5. In this embodiment, as shown in FIG. 25, the text data is divided in units of a data length of 8 KB and stored in the disk device. According to the text data length of the example shown in FIG. 4, in the case of the document 1, since the data length is 6232 bytes, it is stored in one disk device of the disk device A. On the other hand, since the data length of the document 2 is 13560 bytes, which is larger than 8 KB, the text data is stored in the two disk devices B and C.

【００６７】検索時、サロゲートデータから得られた候
補文書について、まずディレクトリテーブルを用いて各
テキストデータのファイル先頭からのオフセット位置と
長さで構成される読出し命令の列を生成する。図２５の
例では、文書１、文書３等が候補としてサロゲートデー
タから得られ、それぞれディレクトリテーブルからディ
レクトリデータであるｏｆｆｓｅｔとｌｅｎｇｔｈを取
り出して読み出し命令を複数文書一括読み出しプログラ
ムが受け取り、テキストデータのファイル先頭からのオ
フセット位置と長さから、各ディスクに分散格納されて
いるデータの読出し命令に分割し、順次ディスク装置へ
の読出し命令を発行する。この場合、ディレクトリテー
ブルも参照するようにしてもよい。例えば、文書１はｏ
ｆｆｓｅｔが０なので、データの先頭すなわちディスク
装置Ａから格納され、ｌｅｎｇｔｈが６２３２バイトと
８ＫＢに満たないことからディスク装置Ａにのみデータ
が格納されていることがわかる。このように読出し命令
のｏｆｆｓｅｔとｌｅｎｇｔｈからテキストデータのデ
ィスク装置と格納位置を計算し、各ディスク装置に読出
し命令を発行する。図２５の例では、ディスク装置Ａに
対し文書１の読出し命令を、ディスク装置Ｄに対し文書
３の読出し命令が発行されることになる。At the time of retrieval, for the candidate document obtained from the surrogate data, first, a directory table is used to generate a sequence of read instructions composed of offset positions and lengths from the file head of each text data. In the example of FIG. 25, documents 1, 3 and the like are obtained from the surrogate data as candidates, the offset and length which are directory data are respectively extracted from the directory table, and the read command is received by the multiple document batch reading program, and the text data file is read. Based on the offset position and the length from the head, it is divided into read commands for data stored in each disk in a distributed manner, and read commands are sequentially issued to the disk device. In this case, the directory table may also be referred to. For example, document 1 is o
Since ffset is 0, it is stored from the head of the data, that is, from the disk device A, and the length is 6232 bytes, which is less than 8 KB, so it can be seen that the data is stored only in the disk device A. In this way, the disk device and the storage position of the text data are calculated from the offset and the length of the read command, and the read command is issued to each disk device. In the example of FIG. 25, a read command for document 1 is issued to the disk device A, and a read command for document 3 is issued to the disk device D.

【００６８】この読出し命令の発行時に、もしディスク
装置が他の命令の処理を行っておりビジー状態であれ
ば、ディスク装置への命令の発行は行われず命令がキュ
ーイングされる。後に、ディスク装置がレディ状態にな
ったときにキューイングされた命令は、一括してディス
ク装置へ送られる。例えば図２５の例で、サロゲートデ
ータから得られた候補文書が文書１，３，４でありディ
スク装置Ａがビジー状態で読出し命令が送れなかった場
合には、レディ状態になったときに文書１及び４の読出
し命令が一括して送られることになる。At the time of issuing this read command, if the disk device is busy processing other commands and is in a busy state, no command is issued to the disk device and the command is queued. Later, the commands queued when the disk device becomes ready are collectively sent to the disk device. For example, in the example of FIG. 25, when the candidate documents obtained from the surrogate data are documents 1, 3 and 4 and the disk device A is in a busy state and a read command cannot be sent, the document 1 is returned to the ready state. The read commands of 4 and 4 will be sent together.

【００６９】各ディスク装置は、送られた命令にしたが
って磁気ディスクからデータの読出しを行い、ディスク
装置内部にあるバッファメモリにデータを格納し、読出
し命令が終了したことを複数文書一括読出しプログラム
へ報告する。報告を受けた複数文書一括読出しプログラ
ムは、図２６に示すように各ディスク装置からのデータ
を統合して、ワークメモリ２４０７へ転送する。図の例
は、ディスク装置Ａに格納された文書１のテキストデー
タと、ディスク装置Ｄに格納された文書３のテキストデ
ータをワークメモリにそれぞれ転送していることを示し
ている。すなわち、図２５の読出しの例では、ディスク
装置Ａ及びディスク装置Ｄの２個のディスク装置が並列
に動作することになる。実際には、多くの候補文書がサ
ロゲートデータの探索で得られるため、接続されている
ディスク装置がすべて並行動作することになる。このよ
うにして、各ディスクから読出したワークメモリ上のテ
キストデータを参照して、条件に合致する文書を探索
し、検索結果をディスプレイ２４０２へ表示する。Each disk device reads data from the magnetic disk in accordance with the sent instruction, stores the data in the buffer memory inside the disk device, and reports the completion of the read command to the multiple document batch reading program. To do. The multiple document batch read program that has received the report integrates the data from the respective disk devices as shown in FIG. 26 and transfers the data to the work memory 2407. The example in the figure shows that the text data of the document 1 stored in the disk device A and the text data of the document 3 stored in the disk device D are respectively transferred to the work memory. That is, in the read example of FIG. 25, two disk devices, the disk device A and the disk device D, operate in parallel. In reality, since many candidate documents are obtained by searching surrogate data, all connected disk devices operate in parallel. In this way, the text data on the work memory read from each disk is referenced to search for a document that matches the conditions, and the search result is displayed on the display 2402.

【００７０】以上、第４の実施例について説明した。本
実施例によれば、複数の読出し命令を一括処理するため
に磁気ディスクが並列に動作し、ファイルからの読出し
速度が向上するので、テキストデータのファイル読出し
の際の検索レスポンスの劣化が少ないという効果があ
る。また、磁気ディスク装置のバッファを読出し命令の
あった地点から規定の量だけ常に読出す先読みバッファ
としても使用すると、読出し命令が来た場合でもバッフ
ァに既にデータがある可能性があり、磁気ヘッドのシー
クや回転待ちのオーバヘッドが全くなくなるので、更な
る性能向上の効果がある。The fourth embodiment has been described above. According to the present embodiment, the magnetic disks operate in parallel to collectively process a plurality of read commands and the reading speed from the file is improved, so that the deterioration of the search response when reading the text data file is small. effective. Further, if the buffer of the magnetic disk device is also used as a read-ahead buffer that always reads a specified amount from the point where the read command was issued, there is a possibility that there is already data in the buffer even when the read command comes, and Since there is no seek or overhead of waiting for rotation, there is an effect of further improving performance.

【００７１】[0071]

【発明の効果】本発明によれば、ほぼ一定の容量のテキ
ストデータでサロゲートデータを作成できるので、元の
文書容量に比べてサロゲートデータが適正な容量とな
り、かつ、サロゲートデータによる絞り込み率も適正な
ものが得られる。その結果、高速なフルテキストサーチ
が実現できることになる。また、テキストデータとサロ
ゲートデータの対応表を文書の登録時に作成すること
で、サロゲートデータによる階層プリサーチ後該当する
テキストデータを正確に探索することができる。同じ
く、データの削除更新時においてもテキストデータとサ
ロゲートデータの対応を取りながらデータベースの整合
を取ることが可能となる。また、統合サロゲートデータ
と分割サロゲートデータを区別することで、ターム間の
ＡＮＤ条件が与えられても統合サロゲートデータはＡＮ
Ｄ条件で検索できるため、絞り込み率を向上することが
できる。それとともに、サロゲートデータで結果を特定
することができない複合文字種のタームによる検索にお
いて、より多くのデータスキャンを必要とする長いテキ
ストデータでも、分割サロゲートデータからテキストデ
ータの必要部分のみを探索できるため、探索量が減り検
索時間を短縮することが可能となる。さらにまた、複数
の外部記憶装置を並列動作させてテキストデータの検索
ができるため、検索速度を高速化することができる。According to the present invention, since surrogate data can be created with text data having a substantially constant capacity, the surrogate data has a proper capacity as compared with the original document capacity, and the narrowing rate by the surrogate data is also suitable. You can get something. As a result, high-speed full-text search can be realized. Further, by creating a correspondence table of text data and surrogate data at the time of registering a document, it is possible to accurately search the corresponding text data after hierarchical pre-search by surrogate data. Similarly, when deleting and updating data, it is possible to make the database consistent while associating the text data with the surrogate data. In addition, by distinguishing the integrated surrogate data from the divided surrogate data, the integrated surrogate data is AN even if an AND condition between terms is given.
Since the search can be performed under the D condition, the narrowing rate can be improved. At the same time, in the search by the term of the complex character type for which the result cannot be specified with the surrogate data, even the long text data that requires more data scan can search only the necessary part of the text data from the divided surrogate data. The amount of search is reduced, and the search time can be shortened. Furthermore, since a plurality of external storage devices can be operated in parallel to retrieve text data, the retrieval speed can be increased.

[Brief description of drawings]

【図１】文書容量によるサロゲートデータの分割と統合
の概念図である。FIG. 1 is a conceptual diagram of division and integration of surrogate data according to document capacity.

【図２】サロゲートデータとテキストデータの対応表を
示す概念図である。FIG. 2 is a conceptual diagram showing a correspondence table of surrogate data and text data.

【図３】第１の実施例の構成図である。FIG. 3 is a configuration diagram of a first embodiment.

【図４】第１の実施例における登録データの形式を示す
図である。FIG. 4 is a diagram showing a format of registration data in the first embodiment.

【図５】第１の実施例におけるテキストＩＤからサロゲ
ートＩＤへの変換表を示す概念図である。FIG. 5 is a conceptual diagram showing a conversion table from a text ID to a surrogate ID in the first embodiment.

【図６】第１の実施例におけるサロゲートＩＤからテキ
ストＩＤへの変換表を示す概念図である。FIG. 6 is a conceptual diagram showing a conversion table from surrogate ID to text ID in the first embodiment.

【図７】第１の実施例におけるデータ登録時の処理手順
を示すＰＡＤ図である。FIG. 7 is a PAD showing a processing procedure at the time of data registration in the first embodiment.

【図８】第１の実施例におけるサロゲートデータ（文字
成分表）の作成例を示す図である。FIG. 8 is a diagram showing an example of creating surrogate data (character component table) in the first embodiment.

【図９】第１の実施例におけるサロゲートデータ（凝縮
本文）の作成例を示す図である。FIG. 9 is a diagram showing an example of creating surrogate data (condensed text) in the first embodiment.

【図１０】第１の実施例におけるデータ検索時の処理手
順を示すＰＡＤ図である。FIG. 10 is a PAD diagram showing a processing procedure at the time of data search in the first embodiment.

【図１１】第１の実施例における検索時の候補文書抽出
の手順を説明する図である。FIG. 11 is a diagram illustrating a procedure of extracting candidate documents at the time of searching in the first embodiment.

【図１２】第１の実施例における検索時の候補文書抽出
の手順を説明する図である。FIG. 12 is a diagram illustrating a procedure of candidate document extraction at the time of search in the first embodiment.

【図１３】第１の実施例におけるデータ削除時の処理手
順を示すＰＡＤ図である。FIG. 13 is a PAD diagram showing a processing procedure at the time of deleting data in the first embodiment.

【図１４】第１の実施例におけるデータ更新時の処理手
順を示すＰＡＤ図である。FIG. 14 is a PAD diagram showing a processing procedure at the time of updating data in the first embodiment.

【図１５】第２の実施例におけるサロゲートデータの作
成方法を示す説明図である。FIG. 15 is an explanatory diagram showing a method of creating surrogate data in the second embodiment.

【図１６】第２の実施例におけるサロゲートデータ（文
字成分表）の作成例を示す図である。FIG. 16 is a diagram showing an example of creating surrogate data (character component table) in the second embodiment.

【図１７】第２の実施例におけるサロゲートデータ（凝
縮本文）の作成例を示す図である。FIG. 17 is a diagram showing an example of creating surrogate data (condensed text) in the second embodiment.

【図１８】第２の実施例におけるテキストＩＤからサロ
ゲートＩＤへの変換表を示す概念図である。FIG. 18 is a conceptual diagram showing a conversion table from a text ID to a surrogate ID in the second embodiment.

【図１９】第２の実施例におけるサロゲートＩＤからテ
キストＩＤへの変換表を示す概念図である。FIG. 19 is a conceptual diagram showing a conversion table from a surrogate ID to a text ID in the second embodiment.

【図２０】第２の実施例におけるデータ登録時の処理手
順を示すＰＡＤ図である。FIG. 20 is a PAD diagram showing a processing procedure at the time of data registration in the second embodiment.

【図２１】第２の実施例におけるデータ検索時の処理手
順を示すＰＡＤ図である。FIG. 21 is a PAD diagram showing a processing procedure at the time of data search in the second embodiment.

【図２２】第２の実施例における検索時の候補文書抽出
の手順を説明する図である。FIG. 22 is a diagram illustrating a procedure of candidate document extraction at the time of search in the second embodiment.

【図２３】第２の実施例における検索時の候補文書抽出
の手順を説明する図である。FIG. 23 is a diagram illustrating a procedure of candidate document extraction at the time of search in the second embodiment.

【図２４】第４の実施例の構成を示す図である。FIG. 24 is a diagram showing a configuration of a fourth exemplary embodiment.

【図２５】テキストデータの読出しの詳細を説明するた
めの図である。FIG. 25 is a diagram for explaining details of reading text data.

【図２６】各ディスク装置からワークメモリへのデータ
転送を説明するための図である。FIG. 26 is a diagram for explaining data transfer from each disk device to a work memory.

[Explanation of symbols]

３０１、２４０１キーボード３０２、２４０２ディスプレイ３０３、２４０３磁気ディスク３０４、３０６、３０７メモリ３０５、２４０５ＣＰＵ２４０４、２４０６、２４０７、２４０８メモリ２４０９ＳＣＳＩバスアダプタ 301, 2401 Keyboard 302, 2402 Display 303, 2403 Magnetic disk 304, 306, 307 Memory 305, 2405 CPU 2404, 2406, 2407, 2408 Memory 2409 SCSI bus adapter

Claims

[Claims]

1. Surrogate data in which document information is compressed by separately storing text data of a document as a file and storing characters or words in the text data or both of them separately in the file without duplication , When searching for information, first refer to the surrogate data as a document that contains the characters or words included in the search term in the specified search condition expression, and then give a candidate, then match the candidate with the search condition expression. In the information retrieval method for retrieving a document by referring to the text data, the text data of a plurality of documents are integrated under a predetermined condition, and the surrogate data is created for the integrated text data. Create a correspondence table of data and surrogate data, and search at the time of the surrogate data. Refer to the list of surrogate data candidates that may satisfy the search condition expression, find the text data corresponding to the candidate based on the correspondence table, and refer to only the corresponding text data to create the search condition expression. An adaptive surrogate type information retrieval method characterized by retrieving matched documents.

2. The adaptive surrogate-type information search method according to claim 1, wherein a plurality of text data are integrated with a predetermined number of documents as the predetermined condition.

3. The adaptive surrogate type information retrieval method according to claim 1, wherein the predetermined condition is based on a predetermined text data capacity, and a plurality of text data are integrated within a range not exceeding the capacity. An adaptive surrogate type information retrieval method characterized by the above.

4. Surrogate data in which text information of a document is compressed is stored as a file, and characters or words in the text data, or both are separately stored in the file without duplication to create separate surrogate data corresponding to the document. , When searching for information, first refer to the surrogate data as a document that contains the characters or words included in the search term in the specified search condition expression, and then give a candidate, then match the candidate with the search condition expression. In the information retrieval method for retrieving the document by referring to the text data, a predetermined data capacity Qt is set in advance, and when the text data capacity is less than Qt when the surrogate data is created, the text data is integrated and Create surrogate data, and if the capacity of one text data is more than Qt In that case, the text data is divided within a range not exceeding Qt, the surrogate data is created separately for each of the divided text data, and a correspondence table of the text data and the surrogate data is created. Search for the surrogate data that may satisfy the search condition expression by referring to the data, find the text data corresponding to the candidate based on the above correspondence table, and refer to only the corresponding text data to find the search condition expression. An adaptive surrogate type information retrieval method characterized by retrieving documents that match the above.

5. The adaptive surrogate information retrieval method according to claim 4, wherein when dividing text data exceeding Qt, the surrogate data is divided at any break of a sentence, paragraph, chapter, or section. An adaptive surrogate type information retrieval method characterized by creating.

6. The adaptive surrogate type information retrieval method according to claim 4, wherein the value of Qt is from 1 KB to 30 KB.

7. The adaptive surrogate type information retrieval method according to claim 4, wherein the value of Qt is 1 in the text storage medium.
An adaptive surrogate-type information retrieval method characterized by taking a read unit, that is, a read buffer size.

8. The adaptive surrogate-type information search method according to claim 4, wherein the search condition is AND between search words.
When given a condition, that is, a condition for searching for a document that includes all specified search terms, surrogate data created by integrating a plurality of documents is searched by an AND condition between words, and one document is An adaptive surrogate-type information retrieval method characterized in that surrogate data created by dividing each piece of text data into separate pieces of text data is searched for by OR conditions between words and candidates for surrogate data are given.

9. The adaptive surrogate type information retrieval method according to claim 8, wherein a file storing each of the surrogate data created by integrating the plurality of documents and the one document is divided into a plurality of text data. An adaptive surrogate-type information retrieval method characterized in that a file for storing each of the surrogate data created separately for each of the divided text data is a separate file.

10. The text data of the document is stored as a file, and the characters or words in the text data, or both are separately stored in the file without duplication so that surrogate data in which the document information is compressed is separately created for the document. , When searching for information, first refer to the surrogate data as a document that contains the characters or words included in the search term in the specified search condition expression, and then give a candidate, then match the candidate with the search condition expression. In the information retrieval method for retrieving a document by referring to text data, a plurality of external storage devices are provided to store the text data of the document as a file, and the text data of each document is stored in the plurality of external storage devices. Text data of a document which is sequentially stored and exceeds a predetermined capacity is stored in the plurality of external storages. By dividing and storing the data in a storage unit, referring to the surrogate data at the time of retrieval, providing a document candidate, and collectively reading and referring to text data corresponding to the document candidate from the plurality of external storage devices. An adaptive surrogate type information retrieval method characterized by retrieving documents that match the conditions.

11. An input device for inputting a conditional expression, an output device for outputting a result, a storage device, and a processing device, wherein text data of a document is stored as a file, and characters or words in the text data, or By separately storing both in a file without duplication, surrogate data that compresses the document information is created separately for the document, and when searching for information, include the characters or words included in the search word in the specified search condition expression. As an existing document, first, by referring to surrogate data, a candidate is given, and then a document that actually matches the search condition expression for the candidate is searched by referring to text data. The text data of a plurality of documents are integrated under a condition, and the surrogate data is added to the integrated text data. A means for creating, a means for creating a correspondence table of the text data and the created surrogate data, and a means for referring to the surrogate data at the time of searching to provide candidates for surrogate data that may satisfy the search condition expression, , An adaptive surrogate type information search characterized by comprising means for searching for text data corresponding to the candidate based on the correspondence table, and searching for a document matching the search condition expression by referring to only the corresponding text data. apparatus.

12. The adaptive surrogate-type information search device according to claim 11, wherein a plurality of text data are integrated under the predetermined condition by a predetermined number of documents.

13. The adaptive surrogate-type information retrieval device according to claim 11, wherein the predetermined condition is based on a predetermined text data capacity and a plurality of text data are integrated within a range not exceeding the capacity. An adaptive surrogate type information retrieval device characterized by the above.

14. An input device for inputting a conditional expression, an output device for outputting a result, a storage device, and a processing device. Text data of a document is stored as a file, and characters or words in the text data, or By separately storing both in a file without duplication, surrogate data that compresses the document information is created separately for the document, and when searching for information, include the characters or words included in the search word in the specified search condition expression. In the information retrieving apparatus for retrieving a document that actually matches the retrieval condition expression for the candidate by referring to the surrogate data and then retrieving the document by referring to the text data, a predetermined data capacity Qt is set in advance. The processing device says that the capacity of the text data is less than Qt when the surrogate data is created. Is a means for integrating the text data to create the surrogate data, and, if the capacity of one text data is Qt or more, divides the text data within a range not exceeding Qt, and separates the divided text data separately. Means for creating the surrogate data, means for creating a correspondence table of the text data and the created surrogate data, and a candidate for surrogate data that may satisfy the search condition expression by referring to the surrogate data at the time of search And a means for searching for a document that matches the search condition expression by obtaining text data corresponding to the candidate based on the correspondence table and referring to only the corresponding text data. Surrogate type information retrieval device.

15. The adaptive surrogate-type information search device according to claim 14, wherein when the processing device divides text data exceeding Qt, the processing device divides the text data at any break of a sentence, paragraph, chapter, or section. An adaptive surrogate-type information retrieval device characterized by creating surrogate data.

16. The adaptive surrogate-type information search device according to claim 14, wherein the value of Qt is 1 KB to 30 K.
An adaptive surrogate type information retrieval device characterized by taking a value of B.

17. The adaptive surrogate type information retrieval device according to claim 14, wherein the value of Qt is one read unit of the text storage medium, that is, the read buffer size.

18. The adaptive surrogate-type information search device according to claim 14, wherein the search condition expression is A between search words.
When an ND condition, that is, a condition for searching a document including all specified search terms is given, the processing device
For surrogate data created by integrating multiple documents, search for AND conditions between words, and divide one document into multiple text data, and for surrogate data created separately for each divided text data, interword O
An adaptive surrogate-type information retrieval device characterized in that a surrogate data candidate is searched for under an R condition.

19. The adaptive surrogate-type information search device according to claim 18, wherein a file for storing each of the surrogate data created by integrating the plurality of documents,
An adaptive surrogate-type information retrieval device characterized in that the one document is divided into a plurality of text data, and a file for storing each of the surrogate data created separately for each of the divided text data is a separate file. .

20. An input device for inputting a conditional expression, an output device for outputting a result, a storage device, and a processing device, wherein text data of a document is stored as a file, and characters or words in the text data, or By separately storing both in a file without duplication, surrogate data that compresses the document information is created separately for the document, and when searching for information, include the characters or words included in the search word in the specified search condition expression. In the information search device that first refers to surrogate data as a document to list candidates and then refers to the text data for a document that actually matches the search condition expression, searches the text data of the document into a file. A plurality of external storage devices are provided to store the text data of each document as the plurality of external storage devices. A unit for sequentially storing text data of a document exceeding a predetermined capacity in the device and storing the divided text data in the plurality of external storage devices, and the processing device refers to the surrogate data at the time of search to extract a document candidate. Means for designating an external storage device to be accessed and a storage position based on the extracted directory data of one or more document candidates, and collectively reading text data corresponding to the document candidates from the plurality of external storage devices. An adaptive surrogate type information retrieving apparatus comprising means for retrieving a document that matches a condition by referring to read text data.