JP3665112B2

JP3665112B2 - Character string search method and apparatus

Info

Publication number: JP3665112B2
Application number: JP24732795A
Authority: JP
Inventors: 卓哉市川; 良文坂井
Original assignee: NS Solutions Corp
Current assignee: NS Solutions Corp
Priority date: 1995-09-26
Filing date: 1995-09-26
Publication date: 2005-06-29
Anticipated expiration: 2015-09-26
Also published as: JPH0991297A

Description

【０００１】
【発明の属する技術分野】
本発明は、与えられた検索キーワードに応じて検索を行う情報検索における文字列検索方法及び装置に関する。
【０００２】
【従来の技術】
国語辞書や英和辞書、百科事典類などはこれまで紙媒体によって刊行されてきたが、近年、コンピュータ可読型の記憶媒体、特にＣＤ−ＲＯＭなどの読み出し専用記憶媒体に格納された形態でこれら辞書、事典類が流通するようになってきている。こういったＣＤ−ＲＯＭ版の辞書・事典（電子化された辞書・事典）では、検索時間の短縮を目的として、インデックスファイルを設けるのが一般的である。インデックスファイルは、検索対象となる語（見出し語ないし索引語）ごとに、その語に対応する物件（辞書などであれば説明文）がＣＤ−ＲＯＭ中のどこに所在するかの情報（いわゆるポインタ）を記述したファイルであり、インデックスファイルに対して文字列検索を行うことにより、すなわち利用者の入力した検索キーワードに一致する見出し語ないし索引語がインデックスファイル中にあるかを調べることによって、検索対象の物件に短時間でアクセスすることが可能になる。
【０００３】
なお、国語辞書の場合には、見出し語とその見出し語に対する物件（説明文）が１対１で対応すると考えることができるが、百科事典などの場合には、１つの索引語に複数の物件（説明文）が対応することがありうる。また、特許文献などの全文データベースを格納したＣＤ−ＲＯＭにおいても、例えば統制語方式により、検索に使用されるキーワードに基づいてインデックスファイルを予め構成しておくことにより、インデックスファイルに登録されているキーワードについては短時間で全文検索を行うことが可能になる。
【０００４】
ところで、ファイル中に検索キーワードと一致する文字列があるかどうかを検索する文字列検索方法として、検索キーワードを分割して一群の連語を生成し、ファイル中の文字列と一群の連語との一致度を求めことにより、文字列を検索する方法があり、この方法は広く用いられている。連語とは、検索キーワード中で隣接する文字の組み合わせで構成された１あるいは数文字の長さの文字の並びのことである。検索キーワードのままであるとその長さが一定しないので処理が複雑になるが、このように連語に分割して検索することにより、大量のデータに対して高速での検索処理が可能になる。
【０００５】
ここで、この連語を用いた文字列検索方法について、図１３のフローチャートを用いて説明する。ここでは、連語の文字長が２文字であり、検索キーワードとして「あいうえお」が選ばれるものとする。
【０００６】
まず、利用者によって検索キーワード（ここでは「あいうえお」）が入力され（ステップ９１）、入力された検索キーワードが連語長２文字の連語「あい」、「いう」、「うえ」、「えお」に分割される（ステップ９２）。続いて、各連語に関して対象とするファイルを検索してファイル中の各項目にその連語が含まれているかを調べ、連語と一致した文字列をカウントする（ステップ９３）。全ての連語についての検索が終ったかを判断し（ステップ９４）、未検索の連語があればステップ９３に戻り、全ての連語についての検索が終っていれば、文字列ごとにカウント数を合計して一致度を算出し（ステップ９５）、一致度が１００％である文字列を出力し（ステップ９６）、処理を終了する。
【０００７】
一致度は、検索キーワードと文字列との一致の度合を示す尺度であって、各文字列ごとに、
一致度（％）＝［（カウント数の合計）／（連語の種類）］×１００
なる式で算出される。
【０００８】
ここでは、連語長が２文字で検索キーワードが「あいうえお」であるので、連語の種類は「あい」、「いう」、「うえ」、「えお」の４種類である。表１は、各種の文字列に対する一致度を示した表であり、表中の○印はその連語がその文字列に含まれていることを示している。文字列に対する一致度が１００％である場合に、その文字列が検索キーワードと同一の文字列であることが多いので、検索者に対しては一致度が１００％である文字列が出力される。
【０００９】
【表１】

ところで、実際の文字列検索の局面では、検索キーワードと完全に一致する文字列のみを検索（完全一致検索）したのでは、利用者の検索要求に対して不十分であることがある。例えば、辞書の見出し項目の検索を例に挙げれば、表記のゆれなどがある場合には利用者の入力した検索キーワードと辞書での見出し項目が一致しないことがあり、あるいは、類似の単語を網羅的に検索したい場合もあり、これらの場合には、完全一致の項目のみを検索したのでは目的とする項目に達することはできず、曖昧検索を行う必要がある。また、ある部分文字列で始まる全ての単語、ある部分文字列で終る全ての単語、ある部分文字列を含む全ての単語を検索したい場合には、それぞれ、先頭一致検索、後方一致検索、部分一致検索を行う必要がある。なお、以下の説明において、完全一致検索、先頭一致検索、後方一致検索、部分一致検索を総称して一致検索とする。また、完全一致検索、先頭一致検索、後方一致検索、部分一致検索、曖昧検索、一致検索などの別を検索種別という。
【００１０】
上述した連語を利用した文字列検索方法では、一致度があるしきい値以上であれば１００％未満であっても検索されたとすることにより、検索キーワードに類似した文字列を検索することができ、曖昧検索を実行することができる。
【００１１】
【発明が解決しようとする課題】
しかしながら、上述した連語に基づく文字列検索方法には、完全一致検索、先頭一致検索、後方一致検索、部分一致検索、曖昧検索などを含む多様な検索種別に的確に対応するのには不十分であるという問題点がある。連語による方法では、一致検索において検索キーワードと一致しないものも検出すること（過検出）が起こり得るが、過検出を少なくして高速で文字列検索処理を行うために、まだ改善の余地がある。
【００１２】
また、日本語の場合、表記用文字としてかな文字と漢字とが併存するので、同一項目に対して利用者が入力する検索キーワードも多種類にわたることがある。そこで、辞書における索引語として、辞書単語のほかにその読みを登録する（索引語「富士山」に対して、読み「ふじさん」を登録する）ことが考えられるが、その場合であっても、例えば項目「富士山」に対する検索キーワードとして、「ふじ山」、「富士山」、「ふじさん」、「ふ士山」などの入力が考えられる。連語を用いた従来の方法では、「ふじ山」や「ふ士山」の入力に対して、目的とする文字列を検索することは容易ではない。
【００１３】
本発明の目的は、完全一致検索や曖昧検索などの多様な検索種別での検索を過検索が少なくて高速で実行でき、かつ任意に漢字とかなが混じりあったような検索キーワードでの検索も可能な文字列検索方法及び装置を提供することにある。
【００１４】
【課題を解決するための手段】
本発明の文字列検索方法は、入力した検索キーワードと指定された検索種別に基づいてファイル中から検索キーワードに対応する項目を探索する文字列検索方法において、利用者が入力した検索種別を判別し、利用者が入力した検索キーワードを構成する文字の字種について、検索キーワードがかな文字のみからなりかつ検索種別が一致検索である場合には、文字長が２である連語を検索キーワードから順次抽出し、それ以外の場合には、文字長が１である連語を検索キーワードから順次抽出し、ファイル中の各項目の文字列と前記一群の連語とを比較して当該文字列に対する一致度を算出し、一致度がしきい値以上である項目を検索された項目とすることを特徴とする。
【００１５】
本発明の文字列検索方法は、検索キーワードの文字の字種と検索種別とに応じて異なる生成規則による連語を検索キーワードから抽出しようとするものである。ここで字種とは、漢字、かな文字などの種類の別を指す。このように生成規則を変化させることにより、多様な検索種別での検索を過検索が少なくて高速で実行でき、かつ任意に漢字とかなが混じりあったような検索キーワードでの検索も可能になる。
【００１６】
具体的には、例えば、検索キーワードがかな文字のみからなりかつ検索種別が一致検索である場合には、文字長が２である連語を検索キーワードから順次抽出し、それ以外の場合には、文字長が１である連語を検索キーワードから順次抽出する。一致検索の際に検索キーワードがかな文字のみで構成されている場合に連語長を２文字とすることにより、過検出が抑止され、また、その他の場合に連語長を１文字とすることにより、曖昧検索などを的確に行うことが可能になる。
【００１７】
さらに、任意に漢字とかな文字が混っているような検索キーワードに対応するため、ファイル中の各項目には、それぞれ、当該項目の読みに対応するかな文字列が付加するようにすることが望ましい。上述のように、漢字かな混じりの検索キーワードに対しては連語の文字長を短く、例えば１とすることによって、任意に漢字かな混じりとなっている検索キーワードに対しても有効に文字列検索を行うことが可能になる。
【００１８】
本発明の文字列検索装置は、入力した検索キーワードと指定された検索種別に基づいてファイル中から前記検索キーワードに対応する項目を探索する文字列検索装置において、検索キーワードと検索種別としきい値を入力する入力手段と、利用者が入力した検索種別を判別するとともに利用者が入力した検索キーワードを構成する文字の字種について、検索キーワードがかな文字のみからなりかつ検索種別が一致検索である場合には、文字長が２である連語を検索キーワードから順次抽出し、それ以外の場合には、文字長が１である連語を生成する連語生成手段と、ファイル中の各項目の文字列と連語とを比較して当該文字列に対する一致度を算出する処理手段とを有し、一致度が入力されたしきい値以上である項目を検索された項目とすることを特徴とする。
【００１９】
【発明の実施の形態】
次に、本発明の望ましい実施の形態について、図面を参照して説明する。図１は、本発明の実施の一形態の情報検索システムを説明するブロック図である。
【００２０】
この情報検索システムは、辞書や事典類を内容とするＣＤ−ＲＯＭ２０と、利用者の入力した検索キーワードに応じてＣＤ−ＲＯＭ２０を検索し検索結果を表示する処理装置１０とによって構成されている。後述するように、ＣＤ−ＲＯＭ２０の検索に際しては、インデックスデータファイル３０中の項目に対して本発明の方法によって文字列検索が行われており、処理装置１０は、本発明の文字列検索装置としても機能する。
【００２１】
処理装置１０には、ＣＤ−ＲＯＭ２０を装着して必要なデータを読み出すためのＣＤ−ＲＯＭドライブ１１と、ＣＰＵなどで構成され検索処理やＣＤ−ＲＯＭドライブ１１の動作の制御などを行うための処理部１２と、検索処理に必要なファイルを一時的に格納するためのファイル格納用メモリ１３と、タッチパネルやキーボードなどからなり利用者からの検索要求、検索キーワード、検索種別、しきい値などが入力する入力部１４と、液晶パネルなどからなり検索結果を利用者に対して表示するための表示部１５とが設けられている。処理部１２には、ＣＤ−ＲＯＭ２０中あるいはファイル格納用メモリ１３内のファイルに対して連語による検索を行う検索部１６と、入力した検索キーワードから検索条件に応じて連語を生成する連語生成部１７と、一致度を算出してしきい値と比較する比較部１８が設けられている。また、表示部１５は、外部のテレビジョン受像機に対し、検索結果をテレビジョン画像として表示するための映像信号を出力するものであってもよい。
【００２２】
ＣＤ−ＲＯＭ２０の記憶領域の構成が図２に示されている。ここでは、ＣＤ−ＲＯＭ２０がＣＤ−ＲＯＭ版の辞書である例が示されているが、別に辞書に限定される必要はなく、百科事典類、写真集、旅行ガイドブック、各種ハンドブック・規格書、論文集、特許公報類など、検索を行って所望のデータにアクセスすることを目的とするものであれば、どのようなものであってもよい。
【００２３】
ＣＤ−ＲＯＭ２０の格納領域は、検索処理プログラムが格納される処理プログラム格納部２１と、インデックスファイル類が格納されるインデックスファイル格納部２２と、辞書の説明文（物件）が格納される辞書データ本体格納部２３とに分けられている。本実施の形態では、処理装置１０の処理部１２で走らせるための検索処理プログラム自体を検索対象のＣＤ−ＲＯＭ２０内に格納し、ＣＤ−ＲＯＭ２０がＣＤ−ＲＯＭドライブ１１に装着された時点で、検索処理プログラムが処理装置１０の処理部１２に読み込まれるようにしている。
【００２４】
本実施の形態では、図３に示すように、インデックスファイルとしてインデックスデータファイル３０を使用するとともに、検索の高速化のために、検索用指示ファイル３１と検索用倒置ファイル３２を使用している。検索用指示ファイル３１と検索用倒置ファイル３２は、インデックスデータファイル３０から学習工程を経て生成されるファイルである。これらインデックスデータファイル３０、検索用指示ファイル３１及び検索用倒置ファイル３２はいずれもインデックスファイル格納部２２内に格納され、このうち、検索用指示ファイル３１は検索時には処理装置１０のファイル格納用メモリ１３内に読み込まれるようになっている。また、説明文ごとに連続番号でインデックス番号が付与されており、索引語からインデックス番号を知ることによって、ＣＤ−ＲＯＭ２０中での対応する説明文の格納場所に対して即座にアクセスすることができるようになっている。以下、各ファイル３０〜３２について説明する。
【００２５】
インデックスデータファイル３０は、図４に示すように、ＣＤ−ＲＯＭ２０内の説明文（物件）にアクセスするため基本となるファイルであって、説明文ごとに、その説明文に対するインデックス番号と索引語（見出し語）とＣＤ−ＲＯＭ２０内での格納位置とを記述したものである。説明文は索引語の読みの五十音順で配置されており、各説明文に対して０から始まる連続番号であるインデックス番号が、重複しないように付与されている。各索引語は「読み」と「実体」とに分かれており、「読み」にはその索引語の読みが格納され、「実体」にはその索引語の実際の表記（漢字やアルファベット）が格納されている。なお、この実施の形態ではひらがなとかたかなの区別、清音と濁音、半濁音の区別は行っておらず、また、ひらがなのみで表記される索引語については、「実体」には何も格納していない。
【００２６】
検索用倒置ファイル３２は、いわゆる倒置（インバーテッド）ファイルとして構成されており、曖昧検索などを実現するために、索引語（キーワード）を１文字あるいは２文字の連語（例えば、「あ」,「い」,「ああ」,「山」）に分解し、連語をキーとしてその連語を含む項目のインデックス番号が参照できるように構成されている。連語とは本来は２文字以上の文字列集団を指すが、本明細書においては、１文字のものも連語と呼ぶことにする。索引語を連語に分解しているので、１索引語に１つの説明文しか対応しない場合（国語辞書などの場合）であっても１つの連語には複数のインデックス番号が対応し、したがって、連語ごとにレコードを構成するとすれば、検索用倒置ファイル３２は可変長レコードのファイルであるといえる。以下、検索用倒置ファイルにおける連語ごとのインデックス番号の並びを連語のレコードと呼ぶ。なお、検索用指示ファイル３１が設けられているので、検索用倒置ファイル３２には、連語そのものを格納しておく必要はない。一方、検索用指示ファイル３１は、連語をキーとして、検索用倒置ファイルにおいてその連語のレコードがどこにあるかを指示するファイルである。したがって、連語をごとにレコードを構成するとするすれば、検索用指示ファイルは固定長のファイルであるといえる。後述するように、実際に検索を行う場合には、それに先立って検索用指示ファイル３１がＣＤ−ＲＯＭ２０から処理装置１０側に読み出される。
【００２７】
次に、インデックスデータファイル３０から検索用指示ファイル３１及び検索用倒置ファイル３２を生成する学習工程について、図５を用いて説明する。まず、各索引語から１文字の連語としての構成文字を抽出する。「読み」の部分については、２文字の連語（構成文字列）も抽出する。例えば、見出し語「（読み）あそさん、（実体）阿蘇山」からは、「あ」,「そ」,「さ」,「ん」,「あそ」,「そさ」,「さん」,「阿」,「蘇」,「山」が抽出される。そして、これら各構成文字がどのインデックス番号の見出し語に含まれているかを求め、そのインデックス番号を保存する。つまり、構成文字（列）をキーとしインデックス番号を並びとするインバーテッドファイルを生成する。そして、ページング処理を実行し、インデックス番号の代りにページング後のインデックス番号が記録されるようにする。ページングとは、検索速度の向上を目的として、一連のインデックス番号を複数のページに分けることである。例えば、インデックス番号を６５５３６（＝２¹⁶）で除算したとして、商をページの番号、余りをページングのインデックス番号とする。このようにページングを定義すると、ページングの結果、インデックス番号２３２１０は第０ページの２３２１０と、６５５３７は第１ページの１と表わされることになる。
【００２８】
なお、インデックス番号は索引語の読みの五十音順で付与されているから、索引語の読みの先頭文字が指定されれば、対応するインデックス番号の値の取り得る範囲やどのページに属しているかを知ることができる。本実施の形態では、そのことを利用して、完全一致検索と先頭一致検索の高速化を図っている。場合によっては、１ページに含まれるインデックス番号の数を可変にしてページ境界と先頭文字の境目が一致するようにしてもよく、そうすることにより、先頭文字が指定されれば検索すべきページが１つに定まることになる。また、補助ファイルとして先頭文字位置ファイルを設け、「読み」の部分に関して先頭文字ごとにその先頭文字が始まるインデックス番号を格納するようにしてもよい。これにより、例えば、「読み」において先頭文字が「う」であるものは、インデックス番号が２３６９から３９５５の範囲にあるものと即座に分かり、検索対象を絞り込むのに役立つ。
【００２９】
図６は検索用指示ファイル３１の構成例を示している。ここでは、各構成文字の各ページごとに、その構成文字が出現した索引語の数（該当するインデックス番号の数）が格納されている。検索用指示ファイル３１での構成文字の順は検索用倒置ファイル３２での構成文字の順と同じとなっており、検索用指示ファイル３１において注目する構成文字の直前の構成文字までに出現回数として格納された数の総和を求めれば、その総和は、検索用倒置ファイル３２でのその注目する構成文字に対するポインタとして扱うことができる。あるいは、検索用指示ファイル３１には、各構成文字の各ページごとに、検索用倒置ファイル３２における当該構成文字の当該ページの先頭のアドレスを直接記録するようにしてもよく、このように構成すれば、検索用指示ファイル３１での値を検索用倒置ファイル３２のレコードに対するポインタとしてそのまま使用することが可能になる。
【００３０】
図７は検索用倒置ファイル３２の構成例を示している。この検索用倒置ファイル３２では、各構成文字の各ページを単位としてレコードが構成され、各レコードは、可変長であって、該当する構成文字の該当するページに出現するインデックス番号を並びとして格納している。各レコードには、構成文字やページを表わすデータは格納されていない。インデックス番号自体は、所定の整数型データとして表わされている。検索用指示ファイル３１に格納されているデータが図６に示すようであれば、各レコードの要素数（格納されているインデックス番号の数）は、図７において要素数として表わされた数となる。
【００３１】
次に、情報検索処理について説明する。まず、情報検索の処理手順の概要について、図８及び図９を用いて説明する。
【００３２】
ＣＤ−ＲＯＭ２０が処理装置１０に装着されると、まず、検索処理プログラムがＣＤ−ＲＯＭ２０から読み出されて処理装置１０の処理部１２にロードされ、この検索処理プログラムの実行が開始する（ステップ１０１）。続いて、ＣＤ−ＲＯＭ２０から検索用指示ファイル３１が読み出され、処理装置１０のファイル格納用メモリ１３に格納される（ステップ１０２）。
【００３３】
利用者が検索キーワードを入力すると（ステップ１０３）、入力した検索キーワードに応じてファイル格納用メモリ１３内の検索用指示ファイル３１が検索され、その検索結果によってＣＤ−ＲＯＭ２０内の検索用倒置ファイル３２が検索される（ステップ１０４）。すなわち、図９に示すように、検索キーワードが連語に分解され、連語によって検索用指示ファイル３１が検索され、検索用倒置ファイル３２における検索すべきレコードの位置が求められる。そして、該当する連語のレコードが検索用倒置ファイル３２から検索されて処理装置１０側に読み込まれる。読み込まれた連語のレコードの数に対するあるインデックス番号が出現するレコードの数の割合すなわち一致度を求め、一致検索であればこの一致度が１００％であり、曖昧検索であればこの一致度が所定のしきい値を上回っているときに、そのインデックス番号に基づいて説明文を読み込むようにする。そして、上述のように読み込まれた説明文すなわち検索結果の説明文を表示部１５に表示し、利用者に対して次の検索を行うかどうかを問い合わせる（ステップ１０５）。次の検索を行う場合にはステップ１０３に戻って次の検索キーワードの入力を受け付け、次の検索を行わない場合にはそのまま処理を終了する。
【００３４】
この実施の形態では、データ量の大きなインデックスデータファイル３０や検索用倒置ファイル３２をＣＤ−ＲＯＭ２０内に残しておき、データ量が小さくかつ検索用倒置ファイル３２に対するポインタとして使用される検索用指示ファイル３１を処理装置１０内のファイル格納用メモリ１３にロードし、検索キーワードに基づく検索をまず検索用指示ファイル３１に対して実行することにより、十分なメモリを備えていないような場合であっても、高速で検索を行うことが可能になる。すなわち、最終的には検索用倒置ファイル３２からの処理装置１０へのデータの読み込みが必要になるが、検索用指示ファイル３１を用いて対象となる連語のレコードを絞っているので、検索用倒置ファイル３２から読み込まれるレコードの数を必要最小限にし、ＣＤ−ＲＯＭ２０からの読み込みに要する時間を縮減することが可能になっている。検索用指示ファイル３１はファイル格納用メモリ１３に常駐させておくことが可能なので、繰り返して検索を行う場合に大幅に検索時間を減らすことが可能である。
【００３５】
以下、上述のステップ１０３及び１０４すなわち文字列検索処理を含む情報検索処理の詳細について、図１０及び図１１を用いて説明する。
【００３６】
利用者によって検索種別（完全一致検索、部分一致検索、先頭一致検索、後方一致検索あるいは曖昧検索の別）と検索キーワードが入力されると（ステップ１１１）、まず、曖昧検索かそうでないかの判断がなされる（ステップ１１２）。曖昧検索の場合には、利用者から一致度に対するしきい値ｘの入力を受け（ステップ１１３）、入力された検索キーワードから、漢字１文字で構成された連語とひらがな１文字で構成された連語を順次抽出する（ステップ１１４）。本実施の態様では、上述したように、１文字あるいは２文字からなる連語に検索キーワードを分解し、分解して得た連語に基づいて検索を行う。例えば検索キーワード「あそ山」からは「あ」,「そ」,「山」が連語として抽出される。なお、同一の連語が重複しては抽出されないようにする。そして、抽出された連語により、ファイル格納用メモリ１３に既に格納されている検索用指示ファイル３１を検索する（ステップ１１５）。検索キーワード「あそ山」の例でいえば、検索用指示ファイル３１での構成文字「あ」,「そ」,「山」の内容がそれぞれ読み出され、「あ」,「そ」,「山」に関する検索用倒置ファイル３２へのポインタがそれぞれ算出される。そして、ステップ１２５に移行する。
【００３７】
一方、ステップ１１２で曖昧検索でない場合、すなわち一致検索の場合には、しきい値ｘを自動的に１００％に設定し（ステップ１１６）、入力された検索キーワードが全てかな文字からなるあるいは全て漢字からなるかどうかを判定する（ステップ１１７）。全てかな文字あるいは全て漢字ではない場合（典型的にはかなと漢字が混在する場合）には、上述のステップ１１４とステップ１２５を順次実行してステップ１２５に移行し、全てかな文字あるいは全て漢字の場合には、検索キーワードが全てかなであるかを判定する（ステップ１１８）。ステップ１１８で全てかなの場合には、検索キーワードから、ひらがな２文字で構成された連語を順次抽出する（ステップ１１９）。例えば、検索キーワードが「あそさん」であれば、連語として「あそ」,「そさ」,「さん」が抽出される。一方、ステップ１１８で全てかなでない場合、すなわち全て漢字の場合には、検索キーワードから、漢字１文字で構成された連語を順次抽出する（ステップ１２０）。例えば、検索キーワード「阿蘇山」からは連語として「阿」,「蘇」,「山」が抽出される。そして、ステップ１１９を実行した場合もステップ１２０を実行した場合も、このようにして抽出された連語により、上述と同様に、ファイル格納用メモリ１３に既に格納されている検索用指示ファイル３１を検索する（ステップ１２１）。
【００３８】
ところで、後述するように検索実行文字に基づいて最終的にはＣＤ−ＲＯＭ２０内の検索用倒置ファイル３２が検索されることになっており、その際、連語が多数あると、それだけＣＤ−ＲＯＭ２０へのアクセス回数が増えることになる。そこで、ステップ１２１の実行後、連語がＮ個以上見つかったかどうかを判断し、連語がＮ個以上であれば、出現回数が多い方の連語から削って連語の数をＮ−１にする（ステップ１２２）。連語の出現回数は検索用指示ファイル３１に記述されている。Ｎは例えば７に設定する。ここで出現回数の多い方から削るのは、出現回数の多い連語は多くの見出し語に含まれていて、入力された検索キーワードを特定するのに余り役立たないと考えられるからである。ステップ１２２の実行後、▲１▼検索種別が完全一致検索あるいは先頭一致検索であって、かつ、▲２▼先頭文字がかなである、が満たされているかどうかを判断する（ステップ１２３）。満たされていない場合にはそのままステップ１２５に移行し、満たされている場合には、上述のように構成文字の先頭文字が特定のページに対応していることから、検索キーワードの先頭のかな文字に基づいて、検索すべき対象のページを決定し（ステップ１２４）、その後、ステップ１２６に移行する。
【００３９】
ステップ１２５では、検索種別が曖昧検索であるかを判定し、曖昧検索であればそのままステップ１２６に移行し、曖昧検索でない場合にはステップ１２３に移行する。
【００４０】
ステップ１２６では、ステップ１１５あるいはステップ１２１での検索用指示ファイル３１の検索結果に応じ、ＣＤ−ＲＯＭ２０内の検索用倒置ファイル３２から未処理の１ページ分のレコードを読み込む。検索キーワード「あそ山」の例では、「あ」,「そ」,「山」のそれぞれについてのレコードが読み出される。後述するように、ステップ１２４で対象ページが設定されている場合を除いてステップ１２５は繰り返して実行されるが、例えばまず、第０ページに属するレコードが読み出され、次にステップ１２５が実行されるときに第１ページに属するレコードが読み出される。また、ステップ１２４で対象ページが設定されている場合には、その対象ページに属するレコードが読み出される。上述したようにステップ１１５あるいはステップ１２１では、各連語ごとに検索用倒置ファイル３２でのその連語のレコードへのポインタ（格納位置に関する情報）が求められているから、このポインタを用いて検索用倒置ファイル３２にアクセスし、その連語のレコードを読み出せばよい。すなわち、検索用倒置ファイル３２の全体を走査する必要はなく、検索用倒置ファイル３２の必要な場所に直接アクセスすることが可能になっている。
【００４１】
そして、検索キーワードから生成した一群の連語に対する各インデックス番号の一致度を求める（ステップ１２７）。図１２は一致度の集計を説明する図である。すなわち、検索用倒置ファイル３２から読み出されたレコードについて、各インデックス番号ごとに出現回数をカウントする。図１５において○印はそのレコードにおいてそのインデックス番号が記録されていたことを示している。この例では、検索キーワード「あそ山」から抽出された各連語「あ」,「そ」,「山」のレコードについて、それぞれどのインデックス番号が出現したかが示されており、例えば連語「あ」のレコードには、インデックス番号０,３,８,９,１３,１５が記録されていることが示されている。そして、連語の数（この例では３）で出現回数を除算することにより、各インデックス文字ごとに一致度が求められている。この例では、連語の各レコードに共通にインデックス番号１３が含まれ（出現回数が３）、インデックス番号１３に対する一致度が１００％であることが示されている。
【００４２】
一致度の集計が終了したら、▲１▼検索種別が完全一致か先頭一致検索であり、かつ、▲２▼検索文字列の先頭がかなである、という条件を満足するかどうかを判定する（ステップ１２８）。この条件を満足しない場合にはそのままステップ１３０に移行し、満足する場合には、検索キーワードの先頭文字に応じて評価対象となるインデックス番号の範囲を求め（ステップ１２９）、以後の処理ではその範囲内のインデックス番号のみを対象とするようにして、ステップ１３０に移行する。このように先頭文字に応じてインデックス番号の範囲を絞るのは、インデックス番号の一致度のみに着目すると検索キーワード「あそ山」に対して見出し語「山あそ」もヒットすることになるので、このような検索ノイズの発生を防ぎ、ＣＤ−ＲＯＭ２０への不要なアクセスを減らすためである。先頭文字「あ」で範囲を限定すれば、検索キーワード「あそ山」に対し、「あ山そ」はヒットするが、「山あそ」などのヒットは防ぐことができる。
【００４３】
ステップ１３０では、一致度がしきい値ｘ以上となっているインデックス番号を求める。一致検索に対してはステップ１１６でｘ＝１００％としているので、一致度が１００％のインデックス番号のみが求められる。一方、曖昧検索の場合には、ステップ１１３で入力したしきい値ｘに応じてインデックス番号が求められる。そして、求められたインデックス番号に基づいてＣＤ−ＲＯＭ２０内のインデックスデータファイル３０を参照し、それらのインデックス番号に対応する見出し語を求める（ステップ１３１）。その際、それらのインデックス番号に対応する説明文の辞書データ本体格納部２３での格納位置も求めておく。
【００４４】
続いて、検索種別が曖昧検索であるかどうかを判断し（ステップ１３２）、曖昧検索であればそのままステップ１３４に移行し、曖昧検索でない場合すなわち一致検索である場合には、求められた見出し語が検索条件と合致しているかを判定する（ステップ１３３）。ステップ１３３において検索条件と合致している場合にはステップ１３４に移行し、検索条件に合致していない場合にはステップ１３５に移行する。ここで検索条件と合致しているかを判断するのは、本実施の形態の手順によれば、検索キーワード「あそ山」に対して「あそ山」と「あ山そ」の両方が見出し語として検出されるので、ノイズである「あ山そ」を排除するためである。なお、曖昧検索の場合には、利用者の意図する検索対象に「あ山そ」も含まれている可能性があるので、検索条件に合致しているかどうかのステップ１３３でのチェックは行わない。
【００４５】
曖昧検索である場合とステップ１３３で検索条件に合致している場合にはステップ１３４に移行するが、ステップ１３４では、該当するインデックス番号に対応する説明文をＣＤ−ＲＯＭ２０の辞書データ本体格納部２３から読み出し、検索された見出し語と対応する説明文とを表示部１５に表示し、ステップ１３５に移行する。辞書データ本体格納部２３にアクセスする場合には、ステップ１３１においてインデックスデータファイル３０にアクセスした際に既に求めてある格納位置の情報を使用する。
【００４６】
ステップ１３５では、全ページの処理が終了したかどうかを判断し、未処理のページが残っているのであればステップ１２６に戻り、全ページの処理が終了しているのであれば、入力された検索キーワードに対する情報検索処理を終了する。ステップ１２４で対象ページが定められている場合には、未処理のページが存在しないので、そのまま処理を終了する。
【００４７】
【発明の効果】
以上説明したように本発明は、検索キーワードの文字種と検索種別とに応じて異なる生成規則による連語を検索キーワードから抽出しようとするものである。このように生成規則を変化させることにより、多様な検索種別での検索を過検索が少なくて高速で実行でき、かつ任意に漢字とかなが混じりあったような検索キーワードでの検索も可能になるという効果がある。
【００４８】
例えば、一致検索の際に検索キーワードがかな文字のみで構成されている場合に連語長を２文字とすることにより、かな文字のみの一致検索での過検出が抑止され、また、その他の場合に連語長を１文字とすることにより、曖昧検索などを的確に行うことが可能になる。
【００４９】
ファイル中の各項目には、それぞれ、当該項目の読みに対応するかな文字列が付加するようにすることにより、漢字かな混じりの検索キーワードに対しても有効に文字列検索を行うことが可能になる。
【図面の簡単な説明】
【図１】本発明の実施の一形態の情報検索システムを説明するブロック図である。
【図２】ＣＤ−ＲＯＭ内でのデータの配置を示す図である。
【図３】情報検索処理に使用される各種ファイル間の関係を示す図である。
【図４】インデックスデータファイルの内容の一例を示す図である。
【図５】インデックスデータファイルから各種ファイルを生成するための学習過程を示す図である。
【図６】検索用指示ファイルの内容の一例を示す図である。
【図７】検索用倒置ファイルの内容の一例を示す図である。
【図８】図１の情報検索システムにおける情報検索処理の概要を示すフローチャートである。
【図９】図１の情報検索システムにおける情報検索処理時のデータの流れの概略を示す図である。
【図１０】情報検索処理の具体的処理手順を示すフローチャートである。
【図１１】情報検索処理の具体的処理手順を示すフローチャートである。
【図１２】一致度の集計を説明する図である。
【図１３】従来の文字列検索方法の処理手順の一例を示すフローチャートである。
【符号の説明】
１０処理装置
１１ＣＤ−ＲＯＭドライブ
１２処理部
１３ファイル格納用メモリ
１４入力部
１５表示部
１６検索部
１７連語生成部
１８比較部
２０ＣＤ−ＲＯＭ
２１処理プログラム格納部
２２インデックスファイル格納部
２３辞書データ本体格納部
３０インデックスデータファイル
３１検索用指示ファイル
３２検索用倒置ファイル
１０１〜１０５，１１１〜１３５ステップ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a character string search method and apparatus in information search for searching according to a given search keyword.
[0002]
[Prior art]
Japanese language dictionaries, English-Japanese dictionaries, encyclopedias, and the like have been published in the past, but in recent years these dictionaries are stored in a computer-readable storage medium, particularly a read-only storage medium such as a CD-ROM, Encyclopedias are beginning to circulate. In such a CD-ROM version dictionary / encyclopedia (electronic dictionary / encyclopedia), an index file is generally provided for the purpose of shortening search time. The index file is information (what is called a pointer) indicating where in the CD-ROM the property (descriptive text if a dictionary or the like) corresponding to the word is located for each word (keyword or index word) to be searched. The file to be searched by performing a character string search on the index file, that is, by checking whether there is a headword or index word in the index file that matches the search keyword entered by the user. Can be accessed in a short time.
[0003]
In the case of a national language dictionary, it can be considered that a headword and a property (descriptive text) corresponding to the headword correspond to each other one by one, but in the case of an encyclopedia or the like, a plurality of properties per index word. (Description) may correspond. Further, even in a CD-ROM storing a full-text database such as a patent document, the index file is registered in the index file by pre-configuring the index file based on the keyword used for the search by, for example, the controlled word method. It is possible to perform a full text search for keywords in a short time.
[0004]
By the way, as a character string search method for searching whether there is a character string that matches the search keyword in the file, the search keyword is divided to generate a group of collocations, and the string in the file matches the group of collocations There is a method of searching for a character string by obtaining the degree, and this method is widely used. A collocation is a sequence of one or several characters long composed of a combination of adjacent characters in a search keyword. If the search keyword is left as it is, the length of the search keyword is not constant, and the processing becomes complicated. However, by dividing the search into collocations as described above, a high-speed search process can be performed for a large amount of data.
[0005]
Here, a character string search method using this collocation will be described with reference to the flowchart of FIG. Here, it is assumed that the collocation character length is 2 characters and “Aiueo” is selected as the search keyword.
[0006]
First, a search keyword (here, “Aiueo”) is input by the user (Step 91), and the input search keyword is a combination of two words “Ai”, “say”, “Ue”, “Eo”. (Step 92). Subsequently, the target file for each collocation is searched to check whether the collocation is included in each item in the file, and the character string that matches the collocation is counted (step 93). It is judged whether or not the search for all the collocations has been completed (step 94). If there are unsearched collocations, the process returns to step 93. If the search for all collocations has been completed, the count is summed for each character string. The degree of coincidence is calculated (step 95), a character string having a degree of coincidence of 100% is output (step 96), and the process ends.
[0007]
The degree of match is a measure of the degree of match between the search keyword and the string, and for each string,
Match (%) = [(total number of counts) / (type of collocation)] × 100
It is calculated by the following formula.
[0008]
Here, since the collocation length is 2 characters and the search keyword is “Aiueo”, there are four types of collocations: “Ai”, “say”, “Ue”, “Eo”. Table 1 is a table showing the degree of coincidence for various character strings, and a circle in the table indicates that the collocation is included in the character string. When the matching degree for a character string is 100%, the character string is often the same character string as the search keyword, so that a character string with a matching degree of 100% is output to the searcher. .
[0009]
[Table 1]

By the way, in an actual character string search aspect, searching only a character string that completely matches a search keyword (complete match search) may be insufficient for a user's search request. For example, in the case of searching for a dictionary entry item, the search keyword entered by the user may not match the entry item in the dictionary if there is a variation in the notation, or similar words are covered. In these cases, it is necessary to perform an ambiguous search because it is not possible to reach the target item by searching only an exact match item. If you want to search for all words that start with a partial character string, all words that end with a partial character string, or all words that include a partial character string, you can search for a first match, a backward match, and a partial match, respectively. Need to do a search. In the following description, a complete match search, a head match search, a back match search, and a partial match search are collectively referred to as a match search. Further, search types such as complete match search, head match search, backward match search, partial match search, fuzzy search, and match search are called search types.
[0010]
In the above-described character string search method using collocations, it is possible to search for a character string similar to a search keyword by assuming that a search is performed even if the degree of matching is equal to or higher than a certain threshold value and less than 100%. A fuzzy search can be performed.
[0011]
[Problems to be solved by the invention]
However, the above-described character string search method based on collocation is not sufficient to accurately cope with various search types including complete match search, head match search, backward match search, partial match search, fuzzy search, and the like. There is a problem that there is. In the method based on collocation, it is possible to detect even a search keyword that does not match the search keyword (overdetection), but there is still room for improvement in order to reduce overdetection and perform high-speed character string search processing. .
[0012]
In the case of Japanese, since Kana and Kanji coexist as notation characters, there may be a wide variety of search keywords input by the user for the same item. Therefore, as an index word in the dictionary, it is possible to register the reading in addition to the dictionary word (register the reading “Fujisan” for the index word “Mt. Fuji”), but even in that case, For example, as a search keyword for the item “Mt. Fuji”, inputs such as “Mt. Fuji”, “Mt. Fuji”, “Mr. Fuji”, “Mt. Fuji” can be considered. In the conventional method using collocation, it is not easy to search for a target character string in response to input of “Fujiyama” or “Fujiyama”.
[0013]
The object of the present invention is to perform a search with various search types such as exact match search and fuzzy search with a low over-search and high-speed search, and also with a search keyword in which kanji and kana are arbitrarily mixed. It is an object of the present invention to provide a possible character string search method and apparatus.
[0014]
[Means for Solving the Problems]
The character string search method of the present invention is a character string search method for searching an item corresponding to a search keyword from a file based on an input search keyword and a specified search type. When the search type entered by the user is determined, and the search keyword is composed of only kana characters and the search type is a match search, the character length is 2 are extracted sequentially from the search keyword, otherwise, a combination of characters with a character length of 1 is sequentially extracted from the search keyword, A character string of each item in the file is compared with the group of collocations to calculate a degree of matching with the character string, and an item having a degree of matching equal to or higher than a threshold is set as a searched item. .
[0015]
The character string search method of the present invention is intended to extract, from a search keyword, collocations according to different generation rules depending on the character type and search type of the character of the search keyword. Here, the character type refers to different types such as kanji and kana characters. By changing the generation rule in this way, it is possible to execute a search with various search types at a high speed with few over-searches, and it is also possible to perform a search with a search keyword in which kanji and kana are arbitrarily mixed. .
[0016]
Specifically, for example, when the search keyword is composed of only kana characters and the search type is a match search, consecutive words having a character length of 2 are sequentially extracted from the search keyword. Conjunctions having a length of 1 are sequentially extracted from the search keyword. When the search keyword is composed of only kana characters in the match search, the over-detection is suppressed by setting the collocation length to 2 characters, and in other cases, the collocation length is set to 1 character. It becomes possible to perform fuzzy searches accurately.
[0017]
Furthermore, in order to correspond to a search keyword in which kanji and kana characters are arbitrarily mixed, a kana character string corresponding to the reading of the item may be added to each item in the file. desirable. As described above, for a search keyword mixed with kanji and kana, the string length of the collocation is short, for example, 1, so that a character string search can be effectively performed even for a search keyword arbitrarily mixed with kanji. It becomes possible to do.
[0018]
A character string search device according to the present invention includes a search keyword and a search type in a character string search device that searches an item corresponding to the search keyword from a file based on an input search keyword and a specified search type. age An input means for inputting a threshold value; The search type entered by the user is discriminated and the character type of the search keyword entered by the user is the character length of the search keyword consisting of only kana characters and the search type is a match search. A collocation generating unit that sequentially extracts collocations that are 2 from the search keyword, and otherwise generates a collocation with a character length of 1; A string for each item in the file Communicating Processing means for comparing words and calculating the degree of coincidence for the character string. Entered An item that is equal to or greater than a threshold value is set as a searched item.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Next, preferred embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating an information search system according to an embodiment of the present invention.
[0020]
This information retrieval system includes a CD-ROM 20 having a dictionary and encyclopedia as contents, and a processing device 10 that retrieves the CD-ROM 20 according to a retrieval keyword input by a user and displays a retrieval result. As will be described later, when searching the CD-ROM 20, the character string search is performed on the items in the index data file 30 by the method of the present invention, and the processing device 10 is used as the character string search device of the present invention. Also works.
[0021]
The processing device 10 includes a CD-ROM drive 11 for loading a CD-ROM 20 and reading necessary data, and a CPU and the like, and a process for performing a search process and controlling the operation of the CD-ROM drive 11. Unit 12, a file storage memory 13 for temporarily storing files necessary for the search process, and a search request from the user, a search keyword, a search type, a threshold value, and the like, which are input from a touch panel or a keyboard, are input. And an input unit 14 including a liquid crystal panel and the like, and a display unit 15 for displaying a search result to the user. The processing unit 12 includes a search unit 16 that searches for a file in the CD-ROM 20 or in the file storage memory 13 by a collocation, and a collocation generation unit 17 that generates a collocation from the input search keyword according to a search condition. And a comparison unit 18 that calculates the degree of coincidence and compares it with a threshold value. The display unit 15 may output a video signal for displaying the search result as a television image to an external television receiver.
[0022]
The configuration of the storage area of the CD-ROM 20 is shown in FIG. Here, an example is shown in which the CD-ROM 20 is a CD-ROM version dictionary, but it is not necessarily limited to a dictionary. Encyclopedias, photo collections, travel guide books, various handbooks / standards, Any material such as a collection of papers or patent publications may be used as long as the purpose is to search and access desired data.
[0023]
The storage area of the CD-ROM 20 includes a processing program storage unit 21 in which a search processing program is stored, an index file storage unit 22 in which index files are stored, and a dictionary data main body in which a description (property) of the dictionary is stored. It is divided into a storage unit 23. In the present embodiment, the search processing program itself to be run by the processing unit 12 of the processing device 10 is stored in the CD-ROM 20 to be searched, and when the CD-ROM 20 is mounted on the CD-ROM drive 11, The search processing program is read by the processing unit 12 of the processing device 10.
[0024]
In this embodiment, as shown in FIG. 3, an index data file 30 is used as an index file, and a search instruction file 31 and a search inverted file 32 are used for speeding up the search. The search instruction file 31 and the search inversion file 32 are files generated from the index data file 30 through a learning process. The index data file 30, the search instruction file 31, and the search inverted file 32 are all stored in the index file storage unit 22. Of these, the search instruction file 31 is stored in the file storage memory 13 of the processing apparatus 10 during the search. It is supposed to be read in. In addition, an index number is assigned to each explanatory sentence by a serial number, and by knowing the index number from the index word, the storage location of the corresponding explanatory sentence in the CD-ROM 20 can be immediately accessed. It is like that. Hereinafter, each of the files 30 to 32 will be described.
[0025]
As shown in FIG. 4, the index data file 30 is a basic file for accessing an explanatory note (property) in the CD-ROM 20. For each explanatory note, an index number and an index word ( Headword) and the storage position in the CD-ROM 20 are described. The explanatory texts are arranged in the order of the reading of the index words, and the index numbers, which are consecutive numbers starting from 0, are assigned to the explanatory texts so as not to overlap. Each index word is divided into "reading" and "substance", "reading" stores the reading of the index word, "entity" stores the actual notation (kanji and alphabet) of the index word Has been. In this embodiment, there is no distinction between hiragana and katakana, clear sound and muddy sound, and semi-turbid sound. Not.
[0026]
The search inverted file 32 is configured as a so-called inverted file, and in order to realize an ambiguous search or the like, an index word (keyword) is a one-character or two-character collocation (for example, “a”, “ ”,“ Oh ”,“ Mountain ”), and the index number of the item including the collocation can be referred to using the collocation as a key. A collocation originally refers to a character string group of two or more characters, but in this specification, a one-character string is also called a collocation. Since the index word is decomposed into collocations, even if only one explanatory sentence corresponds to one index word (in the case of a national language dictionary or the like), one collocation corresponds to a plurality of index numbers. If the record is formed for each, the search inverted file 32 can be said to be a variable length record file. Hereinafter, the sequence of index numbers for each collocation in the search inverted file is referred to as a collocation record. Since the search instruction file 31 is provided, it is not necessary to store the collocation itself in the search inverted file 32. On the other hand, the search instruction file 31 is a file that indicates where the record of the collocation is in the inverted file for search using the collocation as a key. Therefore, if a record is formed for each collocation, it can be said that the search instruction file is a fixed-length file. As will be described later, when a search is actually performed, the search instruction file 31 is read from the CD-ROM 20 to the processing device 10 prior to the search.
[0027]
Next, a learning process for generating the search instruction file 31 and the search inverted file 32 from the index data file 30 will be described with reference to FIG. First, constituent characters as one-letter collocation are extracted from each index word. For the “reading” part, a two-letter word (component character string) is also extracted. For example, from the headwords “(reading) Aso-san, (substance) Asoyama”, “a”, “so”, “sa”, “n”, “aso”, “sosa”, “san” , “A”, “Su”, “Mountain” are extracted. Then, it is determined which index number includes each of these constituent characters, and the index number is stored. That is, an inverted file is generated in which the constituent characters (sequences) are used as keys and the index numbers are arranged. Then, paging processing is executed so that the index number after paging is recorded instead of the index number. Paging is to divide a series of index numbers into a plurality of pages for the purpose of improving search speed. For example, the index number is 65536 (= 2 ¹⁶ ), The quotient is the page number and the remainder is the paging index number. When paging is defined in this way, as a result of paging, the index number 23210 is represented as 23210 of the 0th page, and 65537 is represented as 1 of the first page.
[0028]
Since index numbers are assigned in the order of the reading of the index word, if the first character of the index word reading is specified, the range of the corresponding index number value and which page it belongs to I can know. In the present embodiment, this is utilized to speed up the complete match search and the head match search. In some cases, the number of index numbers included in one page may be made variable so that the boundary between the page boundary and the first character matches, so that if the first character is specified, the page to be searched It will be fixed to one. Alternatively, a head character position file may be provided as an auxiliary file, and an index number starting with the head character may be stored for each head character regarding the “read” portion. As a result, for example, in the case of “reading”, if the first character is “u”, it is immediately known that the index number is in the range of 2369 to 3955, which is useful for narrowing down the search target.
[0029]
FIG. 6 shows a configuration example of the search instruction file 31. Here, for each page of each constituent character, the number of index words in which the constituent character appears (the number of corresponding index numbers) is stored. The order of the constituent characters in the search instruction file 31 is the same as the order of the constituent characters in the search inverted file 32. The number of appearances up to the constituent characters immediately before the constituent character of interest in the search instruction file 31 is as follows. If the sum of the stored numbers is obtained, the sum can be handled as a pointer to the constituent character of interest in the search inverted file 32. Alternatively, the search instruction file 31 may directly record the start address of the page of the constituent character in the search inverted file 32 for each page of the constituent character. For example, the value in the search instruction file 31 can be used as it is as a pointer to the record in the search inversion file 32.
[0030]
FIG. 7 shows a configuration example of the search inversion file 32. In this inversion file for search 32, a record is configured with each page of each constituent character as a unit, and each record has a variable length and stores an index number appearing on the corresponding page of the corresponding constituent character as a list. ing. Each record does not store data representing constituent characters or pages. The index number itself is represented as predetermined integer type data. If the data stored in the search instruction file 31 is as shown in FIG. 6, the number of elements of each record (the number of index numbers stored) is the number represented as the number of elements in FIG. Become.
[0031]
Next, the information search process will be described. First, the outline of the information search processing procedure will be described with reference to FIGS.
[0032]
When the CD-ROM 20 is loaded in the processing device 10, first, a search processing program is read from the CD-ROM 20, loaded into the processing unit 12 of the processing device 10, and execution of this search processing program starts (step 101). ). Subsequently, the search instruction file 31 is read from the CD-ROM 20 and stored in the file storage memory 13 of the processing apparatus 10 (step 102).
[0033]
When the user inputs a search keyword (step 103), the search instruction file 31 in the file storage memory 13 is searched according to the input search keyword, and the search inverted file 32 in the CD-ROM 20 is searched according to the search result. Is searched (step 104). That is, as shown in FIG. 9, the search keyword is decomposed into collocations, the search instruction file 31 is searched with the collocations, and the position of the record to be searched in the search inverted file 32 is obtained. Then, the corresponding collocation record is retrieved from the search inversion file 32 and read into the processing apparatus 10 side. The ratio of the number of records in which a certain index number appears with respect to the number of collocation records read, that is, the degree of coincidence is obtained. This degree of coincidence is 100% for a coincidence search, and this coincidence is predetermined for an ambiguous search. When the threshold is exceeded, the description is read based on the index number. Then, the explanatory text read as described above, that is, the explanatory text of the search result is displayed on the display unit 15, and the user is inquired whether or not to perform the next search (step 105). If the next search is to be performed, the process returns to step 103 to accept the input of the next search keyword. If the next search is not to be performed, the process is terminated.
[0034]
In this embodiment, the index data file 30 and the search inverted file 32 having a large data amount are left in the CD-ROM 20, and the search instruction file having a small data amount and used as a pointer to the search inverted file 32 is used. Even if a sufficient memory is not provided by loading 31 to the file storage memory 13 in the processing device 10 and executing a search based on the search keyword on the search instruction file 31 first. It becomes possible to search at high speed. That is, finally, it is necessary to read data from the search inversion file 32 to the processing device 10, but since the search target file 31 is narrowed down using the search instruction file 31, the search inversion is inverted. It is possible to minimize the number of records read from the file 32 and reduce the time required for reading from the CD-ROM 20. Since the search instruction file 31 can be resident in the file storage memory 13, the search time can be greatly reduced when the search is repeated.
[0035]
Hereinafter, the details of the above-described

steps

103 and 104, that is, the information search process including the character string search process will be described with reference to FIGS.
[0036]
When the user inputs a search type (exact search, partial match search, head match search, backward match search or fuzzy search) and a search keyword (step 111), first, it is determined whether the search is fuzzy or not. (Step 112). In the case of fuzzy search, a threshold x for the degree of coincidence is input from the user (step 113), and a collocation composed of one kanji character and one hiragana character from the inputted search keyword. Are sequentially extracted (step 114). In this embodiment, as described above, the search keyword is decomposed into one or two consecutive words, and the search is performed based on the combined words obtained by the decomposition. For example, from the search keyword “Asoyama”, “Aso”, “So”, and “Mountain” are extracted as collocations. Note that the same collocation is not extracted twice. Then, the search instruction file 31 already stored in the file storage memory 13 is searched with the extracted consecutive words (step 115). In the example of the search keyword “Asoyama”, the contents of the constituent characters “a”, “so”, “mountain” in the search instruction file 31 are read out, and “a”, “so”, “ The pointers to the search inversion file 32 relating to “mountains” are respectively calculated. Then, the process proceeds to step 125.
[0037]
On the other hand, if it is not an ambiguous search in step 112, that is, if it is a match search, the threshold value x is automatically set to 100% (step 116), and the input search keyword consists of all kana characters or all kanji characters. It is determined whether or not (step 117). When not all kana characters or all kanji characters (typically when kana and kanji characters are mixed), the

above steps

114 and 125 are sequentially executed, and the process proceeds to step 125. In this case, it is determined whether all the search keywords are kana (step 118). If all of them are kana in step 118, consecutive words composed of two hiragana characters are sequentially extracted from the search keyword (step 119). For example, if the search keyword is “Asosan”, “Aso”, “Sosa”, and “san” are extracted as collocations. On the other hand, if it is not all kana in step 118, that is, if all kanji are used, consecutive words composed of one kanji character are sequentially extracted from the search keyword (step 120). For example, “a”, “so”, and “mountain” are extracted from the search keyword “Mt. Aso” as collocation words. In both cases where step 119 is executed and step 120 is executed, the search instruction file 31 already stored in the file storage memory 13 is searched using the extracted consecutive words as described above. (Step 121).
[0038]
By the way, as will be described later, the search inverted file 32 in the CD-ROM 20 is finally searched based on the search execution character, and if there are a large number of collocations at that time, it is transferred to the CD-ROM 20 accordingly. Will increase the number of accesses. Therefore, after execution of step 121, it is determined whether or not N or more collocations have been found. If there are N or more collocations, the number of collocations is reduced to N-1 by cutting from the collocation with the highest number of occurrences (step S1). 122). The number of appearances of the collocation is described in the search instruction file 31. N is set to 7, for example. Here, the reason why the number of appearances is larger is that the collocations with the larger number of appearances are included in many headwords, and it is considered that they are not very useful for specifying the input search keyword. After execution of step 122, it is determined whether (1) the search type is an exact match search or a head match search, and (2) the first character is kana (step 123). If it is not satisfied, the process proceeds to step 125 as it is. If it is satisfied, the first character of the constituent character corresponds to a specific page as described above. Based on the above, a target page to be searched is determined (step 124), and then the process proceeds to step 126.
[0039]
In step 125, it is determined whether the search type is fuzzy search. If it is fuzzy search, the process proceeds to step 126 as it is. If it is not fuzzy search, the process proceeds to step 123.
[0040]
In step 126, an unprocessed record for one page is read from the search inverted file 32 in the CD-ROM 20 in accordance with the search result of the search instruction file 31 in

step

115 or 121. In the example of the search keyword “Mt. Aso”, records for each of “A”, “So”, and “Mountain” are read. As will be described later, step 125 is repeatedly executed except when the target page is set in step 124. For example, first, a record belonging to the 0th page is read, and then step 125 is executed. Records belonging to the first page are read. If a target page is set in step 124, a record belonging to the target page is read. As described above, in step 115 or step 121, a pointer (information on the storage position) to the record of the collocation in the search inversion file 32 is obtained for each collocation, and the search inversion is performed using this pointer. What is necessary is just to access the file 32 and read the record of the collocation. That is, it is not necessary to scan the entire search inverted file 32, and it is possible to directly access a necessary place of the search inverted file 32.
[0041]
Then, the degree of coincidence of each index number with respect to a group of consecutive words generated from the search keyword is obtained (step 127). FIG. 12 is a diagram for explaining the calculation of the degree of coincidence. That is, the number of appearances is counted for each index number for the record read from the search inverted file 32. In FIG. 15, a circle indicates that the index number is recorded in the record. In this example, each index “a”, “so”, “mountain” extracted from the search keyword “Asoyama” is shown for each index number. It is indicated that the

index numbers

0, 3, 8, 9, 13, and 15 are recorded in the record "." Then, the degree of coincidence is obtained for each index character by dividing the number of appearances by the number of collocations (3 in this example). In this example, the index number 13 is commonly included in each record of the collocation (the number of appearances is 3), and the degree of coincidence with the index number 13 is 100%.
[0042]
When the aggregation of the degree of coincidence is completed, it is determined whether or not the condition that (1) the search type is complete match or head match search and (2) the head of the search character string is kana is satisfied (step 128). If this condition is not satisfied, the process proceeds to step 130 as it is. If satisfied, the range of index numbers to be evaluated is obtained according to the first character of the search keyword (step 129). The process proceeds to step 130 so that only the index number is targeted. In this way, narrowing down the range of index numbers according to the first character is because the headword “Yama Aso” will also be hit against the search keyword “Asoyama” when focusing only on the matching degree of the index numbers. This is to prevent the occurrence of such search noise and reduce unnecessary access to the CD-ROM 20. If the range is limited by the first character “A”, “Ayama So” will be hit for the search keyword “Asoyama”, but hits such as “Yama Aso” can be prevented.
[0043]
In step 130, an index number having a matching degree equal to or greater than a threshold value x is obtained. Since x = 100% at step 116 for the matching search, only the index number with a matching degree of 100% is obtained. On the other hand, in the case of fuzzy search, an index number is obtained according to the threshold value x input in step 113. Then, based on the obtained index numbers, the index data file 30 in the CD-ROM 20 is referred to and headwords corresponding to those index numbers are obtained (step 131). At this time, the storage position of the explanatory text corresponding to these index numbers in the dictionary data main body storage unit 23 is also obtained.
[0044]
Subsequently, it is determined whether or not the search type is fuzzy search (step 132). If it is fuzzy search, the process proceeds to step 134 as it is. If it is not fuzzy search, that is, if it is a match search, the obtained headword Is matched with the search condition (step 133). If the search condition is met in step 133, the process proceeds to step 134, and if the search condition is not met, the process proceeds to step 135. Here, according to the procedure of the present embodiment, whether or not the search condition is met is determined by finding both “Asoyama” and “Asoyamaso” for the search keyword “Asoyama”. Because it is detected as a word, it is to eliminate the noise “Ayamaso”. In the case of an ambiguous search, there is a possibility that “Ayamaso” is also included in the search target intended by the user, so that it is not checked in step 133 whether the search condition is met. .
[0045]
If the search is ambiguous and if the search condition is met in step 133, the process proceeds to step 134. In step 134, the explanatory text corresponding to the corresponding index number is stored in the dictionary data body storage unit 23 of the CD-ROM 20. The headwords retrieved and retrieved and the corresponding explanatory text are displayed on the display unit 15, and the process proceeds to step 135. When accessing the dictionary data main body storage unit 23, the storage position information already obtained when the index data file 30 is accessed in step 131 is used.
[0046]
In step 135, it is determined whether or not all pages have been processed. If there are unprocessed pages, the process returns to step 126. If all pages have been processed, the input search is performed. The information retrieval process for the keyword is terminated. If the target page is determined in step 124, there is no unprocessed page, and the process is terminated as it is.
[0047]
【The invention's effect】
As described above, the present invention intends to extract from a search keyword a collocation word based on a different generation rule depending on the character type and search type of the search keyword. By changing the generation rule in this way, it is possible to execute a search with various search types at a high speed with few over-searches, and it is also possible to perform a search with a search keyword in which kanji and kana are arbitrarily mixed. There is an effect.
[0048]
For example, when the search keyword is composed of only kana characters in the match search, the over-detection in the match search of only the kana characters is suppressed by setting the collocation length to 2 characters, and in other cases By setting the collocation length to one character, it is possible to accurately perform an ambiguous search or the like.
[0049]
By adding a kana character string that corresponds to the reading of the item to each item in the file, it is possible to perform a character string search effectively even for search keywords that contain kanji and kana. Become.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating an information search system according to an embodiment of this invention.
FIG. 2 is a diagram showing an arrangement of data in a CD-ROM.
FIG. 3 is a diagram illustrating a relationship between various files used for information search processing;
FIG. 4 is a diagram illustrating an example of the contents of an index data file.
FIG. 5 is a diagram illustrating a learning process for generating various files from an index data file.
FIG. 6 is a diagram showing an example of the contents of a search instruction file.
FIG. 7 is a diagram illustrating an example of the contents of a search inverted file.
FIG. 8 is a flowchart showing an outline of information search processing in the information search system of FIG. 1;
FIG. 9 is a diagram showing an outline of a data flow during information search processing in the information search system of FIG. 1;
FIG. 10 is a flowchart showing a specific processing procedure of information search processing;
FIG. 11 is a flowchart showing a specific processing procedure of information search processing;
FIG. 12 is a diagram for explaining the calculation of the degree of coincidence.
FIG. 13 is a flowchart showing an example of a processing procedure of a conventional character string search method.
[Explanation of symbols]
10 Processing device
11 CD-ROM drive
12 Processing unit
13 File storage memory
14 Input section
15 Display section
16 Search part
17 collocation generator
18 Comparison part
20 CD-ROM
21 Processing program storage
22 Index file storage
23 Dictionary data body storage
30 Index data file
31 Instruction file for search
32 Inverted file for search
101-105, 111-135 steps

Claims

In a character string search method for searching for an item corresponding to the search keyword from a file based on an input search keyword and a specified search type,
Determine the search type entered by the user,
As for the character types of the characters constituting the search keyword input by the user, if the search keyword consists only of kana characters and the search type is a match search, a collocation with a character length of 2 is selected as the search keyword. Sequentially, otherwise, the collocation with a character length of 1 is extracted sequentially from the search keyword,
Comparing the character string of each item in the file with the group of collocations to calculate the degree of coincidence for the character string,
A method for searching for a character string, wherein an item whose matching degree is equal to or greater than a threshold value is set as a searched item.

The character string search method according to claim 1, wherein a kana character string corresponding to reading of the item is added to each item in the file.

In the character string search device for searching for an item corresponding to the search keyword from the file based on the input search keyword and the specified search type,
Input means for inputting the threshold and the search type and the search keyword,
For the character type of the character constituting the search keyword input by the user while determining the search type input by the user, when the search keyword consists only of kana characters and the search type is a match search, Collocation generating means for sequentially extracting collocations having a character length of 2 from the search keyword, and otherwise generating collocations having a character length of 1;
And a processing means for calculating the degree of coincidence with respect to the character string is compared with the previous Killen words and strings of each item in the file,
A character string search device characterized in that an item whose matching degree is equal to or greater than the input threshold value is a searched item.