JP2004318766A

JP2004318766A - Information retrieval device, program and storage medium

Info

Publication number: JP2004318766A
Application number: JP2003125402A
Authority: JP
Inventors: Fumihiro Hasegawa; 史裕長谷川; Toshifumi Yamaai; 敏文山合; Shinobu Yamamoto; 忍山本; Toshio Miyazawa; 利夫宮澤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2003-02-26
Filing date: 2003-04-30
Publication date: 2004-11-11
Anticipated expiration: 2023-04-30
Also published as: JP4278134B2

Abstract

<P>PROBLEM TO BE SOLVED: To accelerate information retrieval irrespective of the number of retrieval hits in a device, a program and a storage medium for retrieving information with a component matching a retrieval key from information files as using a component constituting a document as the retrieval key. <P>SOLUTION: File type identifying means 8 identify the type of an information file, and if it is a word processor document file, file type converting means 9 convert it to an image file and feed it to component extracting means 2. The extracting means 2 extract components from the image file, add tag information representing corresponding attributes to the components to create component data, and store them in component storing means 3. Means 5 access the means 3 with a retrieval key fed from retrieval key acquiring means 4 to retrieve components matching the retrieval key, and pass the component data on the components to reach key information creating means 6. The creating means 6 create screen information for displaying the components as retrieval result images and linking the images to the source information files. Information file displaying means 7 interpret the screen information to display the retrieval result images together or the source information files linked from the images one by one. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、文書を構成する構成要素を検索キーとして情報ファイルから検索キーに合致する構成要素を有する情報を検索する装置及びプログラム並びに記憶媒体に関する。
【０００２】
【従来の技術】
従来、文書画像をスキャナで読取り、読取った画像を蓄積することにより電子ファイルを作成することが行われている。このファイルから所望の文書画像を取り出すために、入力された文書画像から抽出される文書の外観的な特徴を表す情報を、入力された文書画像と関連付けて記憶しておき、文書の外観的な特徴を表す情報が検索キーとして入力されたとき、入力された検索キーと前記記憶された情報とを照合し、この照合結果に従い関連する文書画像を出力する。
【０００３】
前記照合結果の文書画像を出力したとき、各ページが全て同程度の表示濃度で表示されると、表示されたページ中から検索者が要求する特徴部分に合致する特徴部分を探し出すのに手間がかかるということがあった。これを回避するために、例えば「図形」を検索キーとし、所望の文書画像が得られたとき、文書中の図形領域のみが１００％濃度で表示され、他の領域は１０％濃度で表示して、検索者は、図形領域のみに集中してその内容を把握する文書検索装置が公知である（特許文献１参照）。
【０００４】
【特許文献１】特開平７−２８２０８６号公報（段落（００２３）、図５）
【０００５】
上述の文書検索装置によれば、文書検索の結果、検索対象となる文書の各ページが表示され、そのページ中で、検索された特徴領域が異なる濃度で表示されるので、ユーザーは、表示されているページが検索したいページであることを容易に判別することができる。従って、検索件数を少数に絞って検索を行うときは、出力された文書画像を１ページずつ視認することによって、所望のページを検索することが可能になる。
【０００６】
【発明が解決しようとする課題】
しかしながら、検索漏れのない検索を目標とする場合であって、検索件数がある程度多くなることを前提とする検索の場合には、所望のページを検索するまでに時間がかかり検索を効率的に行えないという問題があった。また、特徴部分以外の部分は、特徴部分と異なる濃度で表示されるため、特徴部分が検索語で検索される場合、検索語が文章中でどのような前後関係のもとに使用されているかを見極めようとすると、濃度が低下している文字行部分を読み取らなければならなくなる。しかし読み取り難いため、濃度を再調整したり元の文書を再表示させて読み直さなければならないという問題があった。
【０００７】
そこで本発明の第１の目的は、検索件数の多少に拘わらず情報検索を迅速に行うようにすることであり、第２の目的は、検索された特徴部分が元情報ファイルの中でどのように使用されているかの状況を、情報ファイル画像の表示濃度を再調整したり元情報ファイルを再表示することなく判るようにすることである。また第３の目的は、検索のための操作及び検索件数が多くなるときの検索処理を迅速に行えるようにすることである。
【０００８】
【課題を解決するための手段】
請求項１の発明は、情報ファイルから抽出された情報ファイルの構成要素に、該構成要素の属性を表現するタグ情報を付与して構成要素データを作成する手段と、タグ情報が付与された構成要素データを格納する手段と、格納された構成要素データを参照して検索キー情報に合致する構成要素を検索する手段と、検索された構成要素をタグ情報に基いて情報ファイルから切り出す手段と、切り出された構成要素から情報ファイルに結合可能にする手段と、を備えたことを特徴とする情報検索装置である。
【０００９】
請求項２の発明は、請求項１記載の情報検索装置において、前記検索された構成要素をタグ情報に基いて情報ファイルから切り出す手段は、検索件数が所定値を越えるとき、情報ファイルから切り出す構成要素の数を制限することを特徴とする情報検索装置である。
【００１０】
請求項３の発明は、請求項１又は２記載の情報検索装置において、前記切り出された構成要素を表示する画面情報を作成する手段を備えたことを特徴とする情報検索装置である。
【００１１】
請求項４の発明は、請求項２又は３記載の情報検索装置において、前記切り出された構成要素を情報ファイル上で識別可能にする画面情報を作成する手段を備えたことを特徴とする情報検索装置である。
【００１２】
請求項５の発明は、請求項３又は４記載の情報検索装置において、前記切り出された構成要素を表示する画面情報を作成する手段は、切り出された構成要素を含む周辺領域を表示する画面情報を作成することを特徴とする情報検索装置である。
【００１３】
請求項６の発明は、請求項５記載の情報検索装置において、表示された情報ファイルから切り出された構成要素を検索キーとして入力する手段を備えたことを特徴とする情報検索装置である。
【００１４】
請求項７の発明は、コンピュータを、情報ファイルから抽出された情報ファイルの構成要素に、該構成要素の属性を表現するタグ情報を付与して構成要素データを作成する手段、タグ情報が付与された構成要素データを格納手段に格納させる手段、格納された構成要素データを参照して検索キー情報に合致する構成要素を検索する手段、検索された構成要素をタグ情報に基いて情報ファイルから切り出す手段、切り出された構成要素から情報ファイルに結合する結合情報を作成する手段、切り出された構成要素を表示する画面情報を作成する手段、切り出された構成要素を情報ファイル上で識別可能にする画面情報を作成する手段、として機能させるためのプログラムである。
【００１５】
請求項８の発明は、請求項７記載のプログラムをコンピュータ読取り可能に記録した記憶媒体である。
【００１６】
【発明の実施の形態】
以下、本発明の実施形態を図面を参照して説明する。
初めに情報検索装置の構成について説明する。
図１は、本発明の一実施形態に係る情報検索装置の構成を示す図であり、本実施形態に係る情報検索装置は、情報ファイルを構成する構成要素の登録機能と検索機能を有する情報検索装置として構成される。
図１において、情報ファイル格納手段１は、ハードディスクなどの記憶手段であり、情報ファイルを保存する。ここで、情報ファイルとは、スキャナ等により文書を読み取ることにより得られた画像ファイル及びワードプロセッサ等により作成された文書ファイル（文字データ及び／又は図形データを含む）をいう。ファイル種識別手段８は、構成要素の登録が指示されたとき、情報ファイル格納手段１から情報ファイルを読み出し、読み出した情報ファイルが、スキャナ等により走査されて得られた画像ファイルか、ワードプロセッサ等により作成された文書ファイルかを識別し、ワードプロセッサ等により作成された文書ファイルの場合、これをファイル種変換手段９に渡す。ファイル種変換手段９は、受取った文書データのファイルを画像ファイルに変換し、構成要素抽出手段２に渡す。スキャナ等により読み取られた画像ファイルの場合は、受取られた画像ファイルがそのまま構成要素抽出手段２に渡される。
【００１７】
構成要素抽出手段２は、受取った画像ファイルから文書を構成する構成要素、例えばテキスト、図、表、それらの位置情報、言語種、罫線、背景色等を抽出し、抽出した構成要素に該構成要素の属性を表現するタグ情報を付与して構成要素データを作成し、構成要素格納手段３に渡す。構成要素格納手段３は、渡された構成要素データをページ単位の構成要素ファイルとして格納、登録する。なお、情報ファイルの構成要素とその登録手法については後述する。
【００１８】
検索キー取得手段４は、ユーザーが検索操作パネルから入力した検索キーを取得し、構成要素検索手段５に渡す。構成要素検索手段５は、構成要素格納手段３にアクセスし、検索キーに合致する構成要素を検索、取得する。そして検索した構成要素を到達キー情報作成手段６に渡す。
【００１９】
到達キー情報作成手段６は、渡された構成要素に基いてその構成要素が存在していた情報ファイル（以下、元情報ファイルという）に到達するための到達キー情報（ＨＴＭＬファイル）を作成する。なお、到達キー情報については後述する。情報ファイル表示手段７は、到達キー情報を解釈して検索結果の構成要素及び／又は元情報ファイルを表示する。
【００２０】
ここで、情報ファイルの構成要素の具体例を示す。
【００２１】
ア）文書を構成する言語の種類：異なる言語を対象とする複数の文字認識処理を行い、最も確からしさが高いものを抽出する。この処理で原稿が主に何語で記述されているかが判定できる。
【００２２】
イ）罫線、罫線の種類：非常に細長い連結成分を抽出すればこれは実線とみなす。やや細長く、並んで存在すれば点線とみなす。
【００２３】
ウ）周縁部に存在するノイズ領域：コピー機で複写原稿を作成する際、画像の周縁部が黒くなることがある。特に天板をあけて撮像すると、オリジナルの紙面の周りに黒い枠状の領域が発生する。これは、画像の周縁部のサイズの大きな連結成分として抽出し、これをノイズ領域とみなす。
【００２４】
エ）情報ファイルの発信元・送付先：情報ファイルとしてファックス画像を選んだ場合、画像中のファックスの発信元や送付先が記述されていることがある。これらは画像の端にあることが多いので、この部分の文字を抽出する。
【００２５】
オ）セパレーター：文書のコラムを区切る線状の要素である。罫線であることもあるし単に空白であることもある。罫線であれば前述した罫線の抽出方法で対処する。空白であれば、ある程度以上の長さや大きさを持つ白画素の連結成分として抽出する。
【００２６】
カ）仮想罫線：表は通常、表の要素を区切る罫線があるが、ないものも多い。その代わりに背景色が変化して表の要素を分離していることもある。この場合は背景色の変化が罫線の代わりをしている。この、実際には存在しないが表の要素を分割するものを仮想罫線と呼ぶことにする。仮想罫線の抽出は、背景色が急激に変化している部分を直線的につないで罫線とみなすことで抽出する。
【００２７】
キ）文書方向：画像を情報ファイルとして入力する際、撮像時に紙の置き方を変えると画像が９０度単位で回転したものが登録される。このように、文書が９０度単位でどちら側に向いているかを文書方向と呼ぶこととする。文書方向の求め方は、画像を９０度単位で回転させた上で文字認識処理を行い、最も確信度の高い結果を残した方向を、文書方向と定義すればよい。
【００２８】
ク）手書き文字：同一の文字行に対し、手書き用の文字認識処理と活字用の文字認識処理を施し、手書き用の結果がより確からしいと判断されたら手書き文字と認識できる。
【００２９】
ケ）追記された情報：ある印刷原稿に対し、印刷の色とは別の色で手書きでメモを書き入れ、この原稿をスキャナで画像入力する。この条件であれば色の違いを利用して追記された手書きメモ部分を抽出する。
【００３０】
コ）パンチ穴領域：パンチ穴のあいた紙原稿をスキャナで画像入力すると穴の部分が黒い丸い領域として画像再現される。したがって、これを抽出するには画像の端にある黒い丸を抽出すればよい。
【００３１】
サ）タイミングマーク：マークシートなどで利用される、位置合わせのための手がかりとなるマークをタイミングマークをいう。タイミングマークは独特の形状（塗りつぶしの正方形など）をしているので、これを手がかりに抽出できる。
【００３２】
シ）文字のフォント情報：文字とわかっている画像に対し、ストロークの変化が小さければゴシック、そうでなければ明朝と判断することでこれら２種類の区別をつけることができる。
【００３３】
ス）構成要素の相対的な位置関係：構成要素が存在する領域が記録されていれば、これをもとに各要素間の相対的な位置が定義できる。「表の左側にある写真」などの検索ができる。
【００３４】
この他にも、図、表、写真、文字列、タイトル、背景のドットパターンがある領域、背景に網掛けがある領域、及びこれらの位置等を構成要素として抽出することができる。
【００３５】
次に、情報ファイルの構成要素の登録について説明する。
本発明の実施形態は、抽出した構成要素にタグを付し、タグ付き構成要素データとして登録する。
【００３６】
タグとは、コンピュータのデータの一部に付けられた目印のことである。本実施形態では、タグの一例として、多くのページ記述言語で使用される”＜”や”＞”を使用するタグ形式を例に説明する。
【００３７】
次は、タグ付き構成要素データの例である。

【００３８】
このタグ付き構成要素データは、＜ｔｉｔｌｅ＞タグにより抽出された構成要素の元情報ファイルが「かくりつ」つまり確率というタイトルの文書ファイルであることを示し、＜ｉｍａｇｅ＞タグにより抽出された構成要素（この場合、「図」）の元情報ファイルへのリンクを示し、＜ｒｅｇｉｏｎ＞タグにより前記構成要素が元情報ファイル中の画像領域１に存在することを示し、＜ｋｉｎｄ＞タグにより構成要素の種類が図であることを示し、＜ａｒｅａ＞タグにより図の位置情報を示し、＜ｃｏｌｏｒ＞タグでその色を示す。同じようにして、画像領域２の位置に他の構成要素であるテキストデータが存在し、横書き、文字色、背景色、日本語等であることが示される。タグ付き構成要素データは、エディタにより自動的に作成する。
【００３９】
図２は、構成要素の登録処理のフロー図であり、図２を参照して説明すると、ユーザーは不図示の登録操作パネルから構成要素の登録を指示すると、ファイル種識別手段８は情報ファイル格納手段１から情報ファイルを取得する（Ｓ１）。そして取得した情報ファイルが、スキャナ等により走査されて得られた画像ファイルか、ワードプロセッサ等により作成された文書ファイルかを識別し、スキャナ等により読み取られた画像ファイルの場合は、取得した画像ファイルをそのまま構成要素抽出手段２に渡し、ワードプロセッサ等により作成された文書ファイルの場合、ファイル種変換手段９に渡す（Ｓ２）。ファイル種変換手段９は、受取った文書ファイルを画像ファイルに変換し、構成要素抽出手段２に渡す（Ｓ３）。構成要素抽出手段２は、渡された画像ファイルから構成要素の抽出を行い、抽出した構成要素にタグを付し、タグ付き構成要素データを作成する（Ｓ４）。そして、このタグ付き構成要素データを構成要素格納手段３に格納する（Ｓ５）。
【００４０】
タグ付き構成要素データを作成し登録しておくことにより、情報ファイルの検索や元情報ファイルから検索された構成要素の切り出しを容易に行うことができるようになる。また構成要素の図や写真を登録するとき、元情報ファイルから図形部分や写真部分を切り出して保管する必要がないので、格納領域を節約することができる。なお、タグ付き構成要素データは冗長性があるので、記憶容量に制限がある場合には圧縮して登録するなどの方策をとる。圧縮して登録したときは検索時に伸張して用いることになる。
【００４１】
更に、情報ファイルの検索手法について説明する。
本発明では、検索キーに相当する構成要素を検索し、検索された構成要素に基いて元情報ファイルへの結合要素を含むＨＴＭＬファイルを作成、このファイルにより検索した構成要素を表示し、表示された構成要素から元情報ファイルにリンクするようにする。
【００４２】
ユーザーは、情報検索をスタートさせると、検索キーの入力画面が表示されるので、入力画面に表示された検索キーをマウス操作のポインタによりポイントすることによって検索キーを入力する。
【００４３】
図３は、検索キーの入力画面を示す図であり、図中、検索キーとして、写真、図、検索語、文字色（赤、青、黄、緑、白、黒）、言語種（日、英、独、仏、伊、西）が表示されている。ユーザーは、これらをポイントすることにより検索キーを入力することができる。このとき、検索キーの表示マーク（白丸印）の表示色が反転するので、検索キーの入力を確認することができる。表示された検索キー以外の検索キーは、他の検索キー欄をポイントすることにより次ページ画面を表示させ、入力することができる。例えば具体的には、「画像の真ん中あたりの表の中の白い文字で行方向が縦で日本語」などの組合わせ検索キーを使用することが可能である。検索キーがポイントされると、検索キー取得手段４は、これを取得し構成要素検索手段５に渡す。
【００４４】
構成要素検索手段５は、構成要素格納手段３の構成要素ファイルにアクセスし、格納されている構成要素データを参照して渡された検索キーに合致する構成要素を検索する。なお、図１の情報検索装置の構成は、構成要素格納手段は１つだけ有しているが、例えば１つは所定の場所、他はＬＡＮやインターネットで接続された遠隔の場所、のように複数備えるようにしてもよい。このときは、各格納手段に対して順次検索を行う。
【００４５】
格納された構成要素ファイルの全てについて検索が終了したとき、検索された構成要素の構成要素データを到達キー情報作成手段６に渡す。
【００４６】
到達キー情報作成手段６は、渡された構成要素データに基いてＨＴＭＬファイルを作成する。
【００４７】
このときＨＴＭＬファイルは、ア）検索された構成要素を表示し、該構成要素から元情報ファイルへの結合を可能にする、イ）検索された構成要素を元情報ファイル上で区別可能（例えば点線枠で囲む）にする、ウ）検索された構成要素の元情報ファイルの識別番号を表示し、該識別番号から構成要素が抽出された元情報ファイルへの結合を可能にする、エ）検索された構成要素が存在する元情報ファイルの格納ファイルを識別可能にする、オ）検索された複数の構成要素を一覧表示する、オ）構成要素の識別情報を一覧表示する、等のように作成することができる。作成されたＨＴＭＬファイルは、情報ファイル表示手段９に渡される。
【００４８】
次は、検索された構成要素を表示し、該構成要素から元情報ファイルへの結合を可能にするＨＴＭＬファイルの例である。

【００４９】
このＨＴＭＬファイルは、＜ａｈｒｅｆ＝”００１．ｈｔｍｌ”＞で、００１．ＨＴＭＬへアクセス（リンク）できるようになる。この００１．ＨＴＭＬは、後述する図６に示される元情報ファイルを指す。＜ａ＞〜＜／ａ＞で囲まれた部分に書かれたものをクリックすることにより、ｈｒｅｆ＝に書かれた先にリンクできるようになる。また＜ｉｍｇｓｒｃ＝”００１．ｊｐｇ”＞で、検索キーである「図」を元情報ファイルから切り出し表示できるようになる。＜ａｈｒｅｆ＝”００１．ｈｔｍｌ”＞の後に書くことにより、表示された「図」（図５の図形１１）をクリックすることにより００１．ｈｔｍｌが開くようにすることができる。＜ｂｒ＞は改行タグである。
【００５０】
前記ＨＴＭＬファイルの作成は、構成要素ファイル単位即ちページ単位の構成要素の検索が終了する毎にＨＴＭＬ作成ソフトウエアにより自動的に作成される。
【００５１】
情報ファイル表示手段９は、取得したＨＴＭＬファイルをＷＷＷブラウザにより解釈し、検索された構成要素に係るウエブ形式の画面を構成、表示する。
【００５２】
なお、ＨＴＭＬファイルを作成する代わりにＸＭＬファイルを作成してもよい。
【００５３】
図４は、以上述べた検索処理のフロー図であり、図４を参照して検索キーを「図」とする場合の処理手順を説明する。ユーザーは図３の入力画面から検索キーとして「図」をポイントすると、ポイントされた「図」は、検索キー取得手段６に取得され（Ｓ１１）、そして構成要素検索手段７に渡される。構成要素検索手段７は、構成要素格納手段３にアクセスし、構成要素ファイルの１つを選択し（Ｓ１２）、構成要素ファイル中のタグ付き構成要素データを１つずつ参照する（Ｓ１３）。そして、この構成要素が検索キーと合致するか否かを判断する（Ｓ１４）。合致した場合（Ｓ１４，ＹＥＳ）、例えばタグ＜ａｒｅａ＞と＜／ａｒｅａ＞が付与された位置情報、タグ＜ｉｍａｇｅ＞と＜／ｉｍａｇｅ＞が付与された図等のタグ付き構成要素データを、一旦ＲＡＭにセーブする（Ｓ１５）。
【００５４】
到達キー情報作成手段６は、セーブされたタグ付き構成要素データに基いてＨＴＭＬファイルを作成する。ここで作成されるＨＴＭＬファイルは２種類あり、１つは、元情報ファイルから検索された図が切り取られて表示され、この図をクリックすることにより元情報ファイル全体が表示されるように作成されるＨＴＭＬファイル（以下、Ａファイルと略称）であり、他の１つは、元情報ファイル中で検索された図であることが区別できるように、図に例えば点線枠を付して表示するように作成されるＨＴＭＬファイル（以下、Ｂファイルと略称）である。
【００５５】
即ち、到達キー情報作成手段６は、まず元情報ファイル中の検索された図に、この図を囲む点線枠を重ね合わせて表示するようにするＢファイルを作成する。従って、Ｂファイルには、元情報ファイル名を指定して元情報ファイル全体を表示させる命令と検索された図を囲む点線枠を描画する命令を書くことになる（Ｓ１６）。
【００５６】
次に到達キー情報作成手段６は、検索された図を切り取り表示し、この図をクリックすると元情報ファイル全体が表示されるようにするＡファイルを作成する。従って、Ａファイルには、切り取る図名を指定して検索された図を表示する命令と、この図をクリックすることによりＢファイルを表示させる命令を書くことになる（Ｓ１７）。Ａファイル及びＢファイルが作成されると、到達キー情報作成手段６は、構成要素全てのチェックが終了したか否かを判断し（Ｓ１８）、全てのチェックが終了していないとき（Ｓ１８，ＮＯ）、ステップＳ１３にリターンし再度上述の処理を行う。全てのチェックが終了したとき（Ｓ１８，ＹＥＳ）、構成要素ファイル全てのチェックが終了したか否かを判断し（Ｓ１９）、終了していないとき（Ｓ１９，ＮＯ）、ステップＳ１２にリターンし、次ぎの構成要素ファイルを選択し、上述の処理を行う。構成要素ファイル全てのチェックが終了したとき（Ｓ１９，ＹＥＳ）、情報ファイル表示手段７にＡファイルを表示する（Ｓ２０）。
【００５７】
なお、Ａファイルは、検索キーに合致した条件を持つ構成要素があれば、複数の構成要素が貼り付けられることになるが、ファイル自体は１つだけ作成される。一方、Ｂファイルは、複数合致した場合、合致した構成要素の数だけのファイルが作成される。
【００５８】
図５は、検索結果の画面を示す図である。図５において、検索キーを図（ｆｉｇｕｒｅ）としたとき、検索された構成要素である４つの図１１，１２，１３，１４が貼り付けられたＡファイルによる表示画面を示す。これらの図は、リンク要素として構成されているので、マウスでクリックすることによりこの図を含む元情報ファイルを表示させることができる。
【００５９】
図６は、Ｂファイルにより、図５の「図」からリンクして表示される元情報ファイルの画面を示す図である。図５の「図」１１をクリックしたとき、その元情報ファイルと共にこのファイル中に存在する「図」１１を示す。２１は検索された「図」１１であることを区別するための点線枠である。
【００６０】
ここで点線枠２１は、点線枠とする代わりに、画像領域データ（座標値）に基いて、ア）点線枠に相当する位置に矢印や三角形を表示させる、イ）図の周囲を囲む別の図を表示する、ウ）これらを点滅させる、エ）元情報ファイルがカラー画像の場合、図だけをカラー表示し他を白黒表示する、等により区別するようにしてもよい。
【００６１】
図７は、検索された４つの図（図７（ａ））からスクロールにより順次表示される元情報ファイルの画面を示す図であり、図中、最初のクリックで図１１を含む元情報ファイルが表示され（図７（ｂ））、スクロール１で図１２を含む元情報ファイルが表示され（図７（ｃ））、スクロール２で図１３を含む元情報ファイルが表示される（図６（ｄ））。またスクロール３で図１４を含む元情報ファイルが表示される（図７（ｅ）。
【００６２】
本検索手法によれば、元情報ファイルへの到達キー情報をＨＴＭＬファイルにより作成し、検索された構成要素の図を表示し、その図からのリンクにより、その図が検索された図であることを示す点線枠を表示すると共に、点線枠で囲まれた図の元情報ファイルを表示するので、検索件数が多数の場合においても検索結果の情報ファイルへの到達が早くなる。
【００６３】
次に、検索語を検索キーとして、検索語を含む文字行を検索する場合について述べる。
【００６４】
この場合、図３で示した検索操作パネルの検索語入力欄に例えば「チェンジ」を入力し、検索開始欄をクリックする。このクリックに基いて、図４で説明した検索処理によりテキスト中の「チェンジ」が検索され、「チェンジ」を含む文字行が検索結果として表示される。
【００６５】
図８は、この検索結果の画面を示す図であり、図８（Ａ）は、チェンジを含む文字行「ルチェンジしました。」を表示している。しかしながら、ユーザーはこの検索結果を見ただけでは、チェンジという検索語がどのような文章の中で使用されているかは判らない。そこで、到達キー情報作成手段６は、検索した文字行とその前後の文字行を表示させ、この文字行をクリックするとこれら文字行を含む元情報ファイル全体が表示されるようにするＡファイルを作成する。従って、このＡファイルには、文字行名を指定して検索された文字行及びその前後の文字行を表示する命令と、これら文字行をクリックすることによりＢファイルを表示させる命令を書くことになる。
【００６６】
図８（Ｂ）は、検索結果を当該文字行とその前後の文字行と共に表示する例を示す図であり、ユーザーは、検索された文字行の前の文字行「い先月Ａ１２３がモデ」と後の文字行「赤いラインが好評」を続けて読むことによりチェンジの使用形態が判るようになる。従って、「チェンジ」をクリックして元情報ファイルを表示させるまでもなく必要とする情報ファイルを検索することができる。また必要に応じ元情報ファイルの表示も可能になる。
【００６７】
更に、検索結果の件数が多数になった場合の処理について述べる。
【００６８】
このような場合は、元情報ファイルから位置情報に相当する部分の構成要素を切り出すとき、ある程度の時間を要する。検索キーに合致した構成要素の数（検索件数）が多くなり、切り出す構成要素の数が多くなった場合には、検索要求を行ってから検索完了までの時間が長時間に及ぶことになる。
【００６９】
そこで、到達キー情報作成手段６が作成するＨＴＭＬファイルは、ア）構成要素の検索が終了した時点で、検索件数をカウントし、総検索件数が所定数以上に達したときは、警告を発して構成要素の切り出し処理を中止する。切り出し処理の中止を解除するときは、検索条件を追加して検索件数を減らすようにする。
【００７０】
また、イ）所定の検索件数を越えたとき、複数の構成要素の切り出し処理は行わず、１つの構成要素から元情報ファイルへのリンクを可能にするのみで多数の構成要素を表示しないようにする。
【００７１】
更に、ウ）切り出した構成要素を間引き処理する、或いはエ）貼り付ける構成要素のサイズを縮小する。更にまた、オ）所定数までは構成要素を表示し、所定数を越えたときは、元情報ファイルへのリンク情報のみにする。
【００７２】
更にまた、カ）構成要素を表示することなく、構成要素の元情報ファイルの識別情報、例えばドキュメント番号を表示させる。
【００７３】
図９は、検索結果をドキュメント番号で表示する例を示す図であり、図中、検索結果は、検索キーを表としたとき、検索結果が５８件であることを表示し、それらをドキュメント番号３１で表示している。このドキュメント番号もリンク要素として構成されているので、ドキュメント番号をクリックすることにより、検索結果の表を含む元情報ファイルにリンクして表示することができる。
【００７４】
ドキュメント番号を表示することにより、多数ヒットした場合、狭い範囲内に多数の情報を一覧で表示できるので、表を表示するのに比較して視認性が低下することはない。
【００７５】
更に次に、情報ファイルの検索したい領域をドラッグするすることによって検索キーを入力する手法について述べる。
【００７６】
図１０は、ドラッグ入力により検索を行う情報検索装置の構成を示す図であり、図中、位置情報取得手段１０は、情報ファイル画面４１の所定の領域がドラッグされたとき、その領域の位置情報（座標値）を取得し、該位置情報を構成要素抽出手段２に渡す。
【００７７】
構成要素抽出手段２は、取得された位置情報で規定される領域内の情報を取得する。図１０の例では、「さいころ」という文字情報がドラッグされ、ユーザーは文字コードがほしいので、ドラッグにより取得された情報を不図示の文字認識手段により認識処理を施し、「さいころ」という文字を取得する。取得した「さいころ」は検索キーとして検索キー取得手段４に渡される。以後の検索処理は、図４の処理フローで説明した処理と同じである。
【００７８】
ドラッグにより取得する検索キーは、検索語に限定されることなく、文字色、背景色などであってもよい。また、文字領域以外をドラッグすることにより写真、図、表等を検索キーとすることもできる。
【００７９】
本入力手法によれば、ユーザーは、さいころという文字をキー操作により１文字ずつ入力する必要がないので、入力ミスがなくなり、また入力が容易に行える。
【００８０】
続いて、本発明の他の実施形態に係る情報検索システムについて説明する。
図１１は、本発明の他の実施形態に係る情報検索システムの構成を示す図であり、図中、図１の構成部品と同じ参照番号が付された構成部品は図１の構成部品と同じ動作を行う。
【００８１】
図１１において、サーバコンピュータ２０とクライアントコンピュータ３０は、ＬＡＮ、インターネット等の電気通信回線４０を介して接続されている。
【００８２】
サーバコンピュータ２０は、情報ファイル格納手段１、構成要素格納手段３、構成要素検索手段５、到達キー情報作成手段６を備える。またサーバコンピュータ２０は、必要に応じ構成要素抽出手段（図１の構成要素抽出手段２と同じもの）を備え、クライアントコンピュータ２０からの指示により、情報ファイル格納手段１から情報ファイルを読み出し、構成要素を抽出し、タグ情報を付与して構成要素格納手段３に格納する。クライアントコンピュータ３０は、情報ファイル表示手段（図１の情報ファイル表示手段７と同じもの）を備える。
【００８３】
ユーザーは、情報検索を行うとき、クライアントコンピュータ３０から検索キーを電気通信回線４０を介してサーバコンピュータ２０に送信する。検索キーを受信したサーバコンピュータ２０の構成要素検索手段５は、構成要素格納手段３にアクセスして検索キーに合致する構成要素を検索する。そして合致した構成要素を到達キー情報作成手段６に渡す。到達キー情報作成手段６は、上述のＨＴＭＬファイル（Ａファイル、Ｂファイル）を作成する。サーバコンピュータ２０は、検索結果として、作成されたＨＴＭＬファイルを電気通信回線４０を介してクライアントコンピュータ３０に送信する。クライアントコンピュータ３０の情報ファイル表示手段は、ＷＷＷブラウザでＨＴＭＬファイルを解釈して検索された構成要素及び／又はその元情報ファイルをウエブ形式の画面でディスプレイに表示する。
【００８４】
本実施形態によれば、クライアントコンピュータは汎用のパーソナルコンピュータを使用し、ＷＷＷブラウザを搭載するだけで、情報検索を行うことができる。
【００８５】
前記実施形態では、ワードプロセッサ等により作成された文書データは、画像データに変換し、この画像データに基いて登録及び検索を行う処理について記載しているが、情報検索装置の構成を変更することによりワードプロセッサ等により作成された文書データを画像データに変換することなく構成要素の登録及び情報検索を行うことができる。
【００８６】
以上、本発明の実施形態に係る情報ファイルの構成要素の登録手法及び検索手法について説明したが、これらの手法をコンピュータにおいて実行させるために、プログラム化し、このプログラムをＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ＭＯ等の任意の記録媒体に記録し、これをコンピュータに読み取らせることで情報検索装置を構成する。これにより任意のコンピュータを容易に情報検索装置として機能させることができる。
【００８７】
【発明の効果】
請求項１，２，３，４に対応する効果：検索された構成要素をタグ情報に基いて切り出し、切り出された構成要素から該構成要素の情報ファイルに結合し、また検索された構成要素を情報ファイルの表示画面上で特定するので、検索された結果の件数が多くなる場合においても情報ファイルの検索を迅速に行うことができる。
請求項５に対応する効果：検索された構成要素をその周辺領域まで広めて表示するので、元情報ファイルを再表示することなく検索結果の内容を把握することができる。
請求項６に対応する効果：検索キーの入力が容易になり検索を迅速に行うことができる。
請求項７，８に対応する効果：任意のコンピュータを容易に情報検索装置として使用することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る情報検索装置の構成を示す図である。
【図２】構成要素の登録処理のフロー図である。
【図３】検索キーの入力画面を示す図である。
【図４】情報ファイルの検索処理のフロー図である。
【図５】検索結果の図を表示する画面を示す図である。
【図６】検索結果の図とリンクして表示される元情報ファイルの画面を示す図である。
【図７】検索結果の図からスクロールにより順次表示される元情報ファイルの画面を示す図である。
【図８】検索語による検索結果を表示する画面を示す図である。
【図９】検索結果をドキュメント番号で表示する画面を示す図である。
【図１０】ドラッグにより検索キー入力を行う情報検索装置の構成を示す図である。
【図１１】本発明の他の実施形態に係る情報検索システムの構成を示す図である。
【符号の説明】
１…情報ファイル格納手段２…構成要素抽出手段
３…構成要素格納手段４…検索キー取得手段
５…構成要素検索手段６…到達キー情報作成手段
７…情報ファイル表示手段８…ファイル種識別手段
９…ファイル種変換手段１０…位置情報取得手段。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an apparatus, a program, and a storage medium for retrieving information having a component that matches a search key from an information file by using a component constituting a document as a search key.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, an electronic file is created by reading a document image with a scanner and storing the read image. In order to extract a desired document image from this file, information representing the appearance characteristics of the document extracted from the input document image is stored in association with the input document image, and the appearance of the document is stored. When information representing a feature is input as a search key, the input search key is collated with the stored information, and a related document image is output according to the collation result.
[0003]
When outputting the document image of the collation result, if all pages are displayed at substantially the same display density, it takes time and effort to search the displayed pages for a characteristic part that matches the characteristic part requested by the searcher. There was such a thing. In order to avoid this, for example, when a "document" is used as a search key and a desired document image is obtained, only the graphic region in the document is displayed at 100% density, and the other regions are displayed at 10% density. A document search device is known in which a searcher concentrates only on a graphic area and grasps the content (see Patent Document 1).
[0004]
[Patent Document 1] Japanese Patent Application Laid-Open No. Hei 7-22886 (paragraph (0023), FIG. 5)
[0005]
According to the above-described document search device, as a result of the document search, each page of the document to be searched is displayed, and in the page, the searched characteristic regions are displayed with different densities. It is possible to easily determine that the current page is the page to be searched. Therefore, when performing a search by narrowing the number of search cases to a small number, it is possible to search for a desired page by viewing the output document images one by one.
[0006]
[Problems to be solved by the invention]
However, if the search is aimed at a search without omissions and the search is based on the assumption that the number of searches will increase to some extent, it takes time to search for the desired page and the search can be performed efficiently. There was no problem. In addition, since the parts other than the characteristic part are displayed with a different density from the characteristic part, when the characteristic part is searched by the search word, what kind of context is used for the search word in the sentence In order to determine the character line, it is necessary to read a character line portion having a reduced density. However, since it is difficult to read, there is a problem that the density must be readjusted or the original document must be redisplayed and read again.
[0007]
Therefore, a first object of the present invention is to perform an information search promptly regardless of the number of search cases, and a second object is to find out how a searched characteristic part is included in an original information file. The purpose of this is to make it possible to determine whether or not the original information file is used without readjusting the display density of the information file image or redisplaying the original information file. A third object is to enable a search operation and a search process to be performed quickly when the number of searches increases.
[0008]
[Means for Solving the Problems]
According to the first aspect of the present invention, there is provided a configuration in which tag information expressing an attribute of a component is added to a component of the information file extracted from the information file to create component data; Means for storing element data, means for searching for constituent elements that match the search key information with reference to the stored constituent data, and means for extracting the searched constituent elements from the information file based on the tag information, Means for enabling a cut-out component to be combined with an information file.
[0009]
According to a second aspect of the present invention, in the information retrieval apparatus according to the first aspect, the means for extracting the searched component from the information file based on the tag information is configured to cut out the information file when the number of searches exceeds a predetermined value. An information retrieval apparatus characterized by limiting the number of elements.
[0010]
According to a third aspect of the present invention, there is provided the information retrieval apparatus according to the first or second aspect, further comprising means for creating screen information for displaying the cut-out component.
[0011]
According to a fourth aspect of the present invention, in the information retrieval apparatus according to the second or third aspect, there is provided means for creating screen information for enabling the cut-out component to be identified on an information file. Device.
[0012]
According to a fifth aspect of the present invention, in the information search device according to the third or fourth aspect, the means for creating screen information for displaying the cut-out component includes screen information for displaying a peripheral area including the cut-out component. An information retrieval apparatus characterized by creating
[0013]
According to a sixth aspect of the present invention, there is provided the information retrieval apparatus according to the fifth aspect, further comprising means for inputting a component cut out from the displayed information file as a retrieval key.
[0014]
According to a seventh aspect of the present invention, there is provided means for creating a component data by providing a computer with tag information expressing attributes of the component, to the component of the information file extracted from the information file, and the tag information is provided. Means for storing the retrieved component data in the storage means, means for referring to the stored component data for a component that matches the search key information, and extracting the retrieved component from the information file based on the tag information Means, means for creating connection information to be combined with the information file from the extracted components, means for creating screen information for displaying the extracted components, a screen for enabling the extracted components to be identified on the information file A program for functioning as a means for creating information.
[0015]
According to an eighth aspect of the present invention, there is provided a computer-readable storage medium storing the program according to the seventh aspect.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, the configuration of the information search device will be described.
FIG. 1 is a diagram showing a configuration of an information search device according to an embodiment of the present invention. The information search device according to the embodiment has an information search function having a function of registering constituent elements of an information file and a search function. It is configured as a device.
In FIG. 1, an information file storage unit 1 is a storage unit such as a hard disk, and stores an information file. Here, the information file refers to an image file obtained by reading a document by a scanner or the like and a document file (including character data and / or graphic data) created by a word processor or the like. The file type identification unit 8 reads the information file from the information file storage unit 1 when the registration of the component is instructed, and reads the read information file from an image file obtained by scanning with a scanner or the like, or a word processor or the like. The document file is identified as a created document file, and in the case of a document file created by a word processor or the like, it is passed to the file type conversion means 9. The file type conversion means 9 converts the received document data file into an image file and passes it to the component extraction means 2. In the case of an image file read by a scanner or the like, the received image file is passed to the component extracting means 2 as it is.
[0017]
The component extracting means 2 extracts components constituting the document from the received image file, for example, texts, figures, tables, their positional information, language types, ruled lines, background colors, etc. The component data is created by adding tag information representing the attribute of the element, and is passed to the component storage unit 3. The component storage unit 3 stores and registers the passed component data as a component file for each page. The components of the information file and the registration method will be described later.
[0018]
The search key obtaining means 4 obtains the search key input by the user from the search operation panel, and passes it to the component search means 5. The component search means 5 accesses the component storage means 3 to search for and obtain a component that matches the search key. Then, the retrieved component is passed to the arrival key information creating means 6.
[0019]
The arrival key information creating means 6 creates arrival key information (HTML file) for reaching the information file (hereinafter, referred to as the original information file) in which the component existed based on the passed component. Note that the arrival key information will be described later. The information file display means 7 interprets the arrival key information and displays the components of the search result and / or the original information file.
[0020]
Here, specific examples of the components of the information file will be described.
[0021]
A) Kind of language constituting the document: A plurality of character recognition processes for different languages are performed, and the one with the highest probability is extracted. In this process, it is possible to determine in what language the original is mainly described.
[0022]
B) Ruled line, type of ruled line: If a very elongated connected component is extracted, it is regarded as a solid line. It is considered to be a dotted line if it is slightly elongated and exists side by side.
[0023]
C) Noise area existing at the periphery: When a copy original is created by a copier, the periphery of an image may be blackened. In particular, when an image is taken with the top plate opened, a black frame-like region is generated around the original paper surface. This is extracted as a connected component having a large size at the periphery of the image, and this is regarded as a noise region.
[0024]
D) Source and destination of information file: When a fax image is selected as the information file, the source and destination of the fax in the image may be described. Since these are often at the edge of the image, the characters in this part are extracted.
[0025]
E) Separator: A linear element that separates document columns. It can be a ruled line or just blank. If it is a ruled line, the above-described ruled line extraction method is used. If it is blank, it is extracted as a connected component of white pixels having a certain length or size.
[0026]
F) Virtual ruled lines: Tables usually have ruled lines that separate the elements of the table, but many do not. Instead, the background color may change to separate table elements. In this case, the change in the background color replaces the ruled line. This one which does not actually exist but divides the elements of the table is called a virtual ruled line. The extraction of the virtual ruled line is performed by connecting portions where the background color changes abruptly in a straight line and treating it as a ruled line.
[0027]
G) Document direction: When an image is input as an information file, if the orientation of the paper is changed at the time of imaging, an image rotated by 90 degrees is registered. In this manner, the direction in which the document faces in 90-degree units is referred to as the document direction. To determine the document direction, the direction in which the image with the highest degree of certainty is left after performing the character recognition process after rotating the image in units of 90 degrees may be defined as the document direction.
[0028]
G) Handwritten characters: The same character line is subjected to a handwritten character recognition process and a character recognition process for printed characters, and if it is determined that the result for handwriting is more reliable, it can be recognized as a handwritten character.
[0029]
F) Added information: Write a memo by hand on a certain print original in a color different from the print color, and input this image with a scanner. Under this condition, the additionally written handwritten memo portion is extracted using the difference in color.
[0030]
C) Punch hole area: When a paper original with punch holes is input as an image by a scanner, the image of the hole is reproduced as a black round area. Therefore, in order to extract this, a black circle at the end of the image may be extracted.
[0031]
C) Timing mark: A mark used as a clue for alignment used in a mark sheet or the like is a timing mark. Since the timing mark has a unique shape (filled square or the like), this can be extracted as a clue.
[0032]
B) Font information of characters: For an image known to be characters, if the change in stroke is small, it is determined to be Gothic, otherwise to Mincho, these two types can be distinguished.
[0033]
S) Relative positional relationship between components: If the area where the component exists is recorded, the relative position between the components can be defined based on this. You can search for "photos on the left side of the table".
[0034]
In addition, a figure, a table, a photograph, a character string, a title, a region having a dot pattern of the background, a region having a hatched background, and positions thereof can be extracted as constituent elements.
[0035]
Next, registration of components of the information file will be described.
In the embodiment of the present invention, a tag is attached to the extracted component, and the extracted component is registered as tagged component data.
[0036]
A tag is a mark attached to a part of computer data. In the present embodiment, a tag format using "<" or ">" used in many page description languages will be described as an example of a tag.
[0037]
The following is an example of tagged component data.

[0038]
The component data with a tag indicates that the original information file of the component extracted by the <title> tag is a document file with a title of “sharpness”, that is, a probability, and the component extracted by the <image> tag. A link to the original information file (in this case, “figure”) is shown, a <region> tag indicates that the component exists in the image area 1 in the original information file, and a <kind> tag indicates the component. The type indicates a figure, the <area> tag indicates position information of the figure, and the <color> tag indicates its color. Similarly, the text data as another component exists at the position of the image area 2 and indicates that the text data is horizontal writing, character color, background color, Japanese, and the like. Tagged component data is automatically created by an editor.
[0039]
FIG. 2 is a flowchart of the component registration process. Referring to FIG. 2, when the user instructs the component registration from a registration operation panel (not shown), the file type identification unit 8 stores the information file. An information file is obtained from the means 1 (S1). Then, the acquired information file is identified as an image file obtained by scanning with a scanner or the like or a document file created by a word processor or the like, and in the case of an image file read by a scanner or the like, the acquired image file is The document file is passed as it is to the component extraction unit 2, and in the case of a document file created by a word processor or the like, it is passed to the file type conversion unit 9 (S2). The file type conversion means 9 converts the received document file into an image file and passes it to the component extraction means 2 (S3). The component extracting means 2 extracts components from the passed image file, attaches a tag to the extracted components, and creates tagged component data (S4). Then, this tagged component data is stored in the component storage means 3 (S5).
[0040]
By creating and registering the tagged component data, it is possible to easily search the information file and cut out the retrieved component from the original information file. In addition, when a diagram or a photograph of a constituent element is registered, it is not necessary to cut out a graphic portion or a photograph portion from the original information file and store it, thereby saving storage space. Since tagged component data has redundancy, if the storage capacity is limited, measures such as compression and registration are taken. When compressed and registered, it is expanded and used at the time of retrieval.
[0041]
Further, an information file search technique will be described.
According to the present invention, a component corresponding to a search key is searched, an HTML file including a component connected to the original information file is created based on the searched component, and the component searched by the file is displayed. Link to the original information file from the component.
[0042]
When the user starts the information search, a search key input screen is displayed. The user inputs the search key by pointing the search key displayed on the input screen with a mouse operation pointer.
[0043]
FIG. 3 is a diagram showing a search key input screen. In the drawing, photos, figures, search words, character colors (red, blue, yellow, green, white, black) and language types (day, English, German, French, Italian, Western) are displayed. The user can input a search key by pointing to these. At this time, the display color of the search key display mark (white circle) is inverted, so that the input of the search key can be confirmed. Search keys other than the displayed search keys can be input by pointing to another search key field to display the next page screen. For example, specifically, it is possible to use a combination search key such as "white characters in the table near the center of the image, the line direction is vertical and Japanese". When the search key is pointed, the search key acquisition means 4 acquires this and passes it to the component search means 5.
[0044]
The component search unit 5 accesses the component file of the component storage unit 3 and searches for a component that matches the passed search key with reference to the stored component data. Note that the configuration of the information retrieval apparatus in FIG. 1 has only one component storage unit, but one is a predetermined location, and the other is a remote location connected via a LAN or the Internet. A plurality may be provided. At this time, a search is sequentially performed for each storage unit.
[0045]
When the search is completed for all of the stored component files, the component data of the searched component is passed to the arrival key information creating means 6.
[0046]
The arrival key information creating means 6 creates an HTML file based on the passed component data.
[0047]
At this time, the HTML file a) displays the retrieved components, and enables the combination of the components with the original information file. A) The retrieved components can be distinguished on the original information file (for example, a dotted line). C) display the identification number of the original information file of the retrieved component, and enable the component to be combined with the original information file from which the component was extracted from the identification number; The stored file of the original information file in which the found component exists, e) a list of a plurality of retrieved components, e) a list of the identification information of the components, etc. be able to. The created HTML file is passed to the information file display means 9.
[0048]
The following is an example of an HTML file that displays retrieved components and allows the components to be combined into the original information file.

[0049]
This HTML file is <a href=“001.html”>, and 001.html. It will be possible to access (link) HTML. This 001. HTML indicates an original information file shown in FIG. 6 described later. By clicking on what is written in the portion between <a> to </a>, it becomes possible to link to the destination written in href =. Also, with <img src = “001.jpg”>, the figure as a search key can be cut out from the original information file and displayed. By writing after <a href=“001.html”>, by clicking on the displayed “figure” (figure 11 in FIG. 5), the 001. html can be opened. <br> is a line feed tag.
[0050]
The creation of the HTML file is automatically created by the HTML creation software every time the search for the component in the component file unit, that is, the page unit is completed.
[0051]
The information file display means 9 interprets the acquired HTML file by a WWW browser, and configures and displays a screen in a web format related to the searched component.
[0052]
Note that an XML file may be created instead of creating an HTML file.
[0053]
FIG. 4 is a flowchart of the above-described search processing, and a processing procedure when the search key is set to “figure” will be described with reference to FIG. When the user points “figure” as a search key from the input screen of FIG. 3, the pointed “figure” is obtained by the search key obtaining means 6 (S11) and passed to the component search means 7. The component search means 7 accesses the component storage means 3, selects one of the component files (S12), and refers to the tagged component data in the component files one by one (S13). Then, it is determined whether or not this component matches the search key (S14). If they match (S14, YES), for example, the tag information such as the position information with tags <area> and </ area> and the tagged component data such as the figure with tags <image> and </ image> are once stored. Save to RAM (S15).
[0054]
The arrival key information creating means 6 creates an HTML file based on the saved tagged component data. There are two types of HTML files created here. One is created by cutting out the figure retrieved from the original information file and displaying it, and clicking this figure displays the entire original information file. HTML file (hereinafter abbreviated as A file), and the other one is displayed with a dotted frame, for example, so that it can be distinguished from the figure searched in the original information file. Is an HTML file (hereinafter abbreviated as B file).
[0055]
That is, the arrival key information creating means 6 first creates a B file for displaying a dotted figure frame surrounding the figure overlaid on the figure found in the original information file. Therefore, in the B file, an instruction to specify the original information file name to display the entire original information file and an instruction to draw a dotted frame surrounding the searched figure are written (S16).
[0056]
Next, the arrival key information creating means 6 cuts out and displays the searched figure, and creates an A file so that the entire original information file is displayed when this figure is clicked. Therefore, in the file A, an instruction to display the figure retrieved by designating the figure to be cut and an instruction to display the file B by clicking this figure are written (S17). When the A file and the B file are created, the arrival key information creating means 6 determines whether or not all the components have been checked (S18), and when all the checks have not been completed (S18, NO). ), And returns to step S13 to perform the above-described processing again. When all checks have been completed (S18, YES), it is determined whether all component file checks have been completed (S19). When not completed (S19, NO), the process returns to step S12, and the next step is performed. Is selected, and the above processing is performed. When all the component file files have been checked (S19, YES), the A file is displayed on the information file display means 7 (S20).
[0057]
Note that if the A file has components having conditions matching the search key, a plurality of components will be pasted, but only one file is created. On the other hand, in the case where a plurality of B files match, files are created in the number of matching components.
[0058]
FIG. 5 is a diagram showing a search result screen. In FIG. 5, when a search key is a figure, a display screen of an A file to which four FIGS. 11, 12, 13, and 14, which are searched components, are pasted is shown. Since these figures are configured as link elements, the original information file including the figures can be displayed by clicking with a mouse.
[0059]
FIG. 6 is a diagram showing a screen of the original information file displayed by linking from the “figure” in FIG. 5 by the B file. When "Figure" 11 in Fig. 5 is clicked, "Figure" 11 present in this file is shown together with the original information file. Reference numeral 21 denotes a dotted frame for distinguishing the retrieved “figure” 11.
[0060]
Here, the dotted line frame 21 is not a dotted line frame, but is based on the image area data (coordinate values). A) An arrow or a triangle is displayed at a position corresponding to the dotted line frame. The figures may be displayed, c) these may be blinked, or d) when the original information file is a color image, only the figures may be displayed in color and the others may be displayed in black and white.
[0061]
FIG. 7 is a diagram showing a screen of the original information file sequentially displayed by scrolling from the searched four figures (FIG. 7A). In the figure, the original information file including FIG. 7 (b), scroll 1 displays the original information file including FIG. 12 (FIG. 7 (c)), and scroll 2 displays the original information file including FIG. 13 (FIG. 6 (d) )). Also, the original information file including FIG. 14 is displayed by scroll 3 (FIG. 7E).
[0062]
According to this search method, key information for reaching the original information file is created by an HTML file, a diagram of the searched component is displayed, and the diagram is searched by a link from the diagram. Is displayed, and the original information file of the figure surrounded by the dotted frame is displayed. Therefore, even when the number of searches is large, the search result can be quickly reached to the information file.
[0063]
Next, a case where a character line including a search word is searched using the search word as a search key will be described.
[0064]
In this case, for example, "change" is entered in the search term input field of the search operation panel shown in FIG. 3, and the search start field is clicked. Based on this click, "change" in the text is searched by the search processing described in FIG. 4, and a character line containing "change" is displayed as a search result.
[0065]
FIG. 8 is a diagram showing a screen of this search result, and FIG. 8 (A) shows a character line “changed” including a change. However, the user does not know in which sentence the search term “change” is used just by looking at the search results. Therefore, the arrival key information creating means 6 creates an A-file so that the searched character line and the character lines before and after the searched character line are displayed, and when this character line is clicked, the entire original information file including these character lines is displayed. I do. Therefore, in this A file, write a command to display the character line searched by specifying the character line name and the character lines before and after that, and write an instruction to display the B file by clicking these character lines. Become.
[0066]
FIG. 8B is a diagram illustrating an example in which the search result is displayed together with the character line and the character lines before and after the character line. The user may input the character line “last month A123 is model” before the searched character line. By reading the subsequent character line "Red line is popular", the usage of the change can be understood. Therefore, a necessary information file can be searched without clicking "change" to display the original information file. In addition, the original information file can be displayed as needed.
[0067]
Further, processing when the number of search results becomes large will be described.
[0068]
In such a case, it takes a certain amount of time to cut out the components corresponding to the position information from the original information file. If the number of components that match the search key (the number of searches) increases and the number of components to be cut out increases, the time from when a search request is made to when the search is completed will take a long time.
[0069]
Therefore, the HTML file created by the arrival key information creating means 6 is as follows: a) When the search for the component is completed, the number of searches is counted, and when the total number of searches reaches a predetermined number or more, a warning is issued. Cancels the component extraction processing. When canceling the suspension of the extraction processing, search conditions are added to reduce the number of search cases.
[0070]
In addition, a) when a predetermined number of retrievals is exceeded, a process of extracting a plurality of components is not performed, and only one component can be linked to the original information file, and a large number of components are not displayed. I do.
[0071]
Further, c) thinning-out processing of the cut-out components or d) reducing the size of the components to be pasted. Further, e) components are displayed up to a predetermined number, and if the number exceeds the predetermined number, only link information to the original information file is displayed.
[0072]
Furthermore, f) the identification information of the original information file of the component, for example, the document number is displayed without displaying the component.
[0073]
FIG. 9 is a diagram showing an example of displaying search results by document number. In the figure, when the search key is a table, it indicates that there are 58 search results, and these are displayed as document numbers. This is indicated by 31. Since this document number is also configured as a link element, by clicking the document number, it is possible to link and display the original information file including the search result table.
[0074]
By displaying the document number, when a large number of hits are made, a large number of information can be displayed in a narrow range in a list, so that the visibility is not reduced as compared with displaying a table.
[0075]
Next, a method of inputting a search key by dragging an area to be searched in the information file will be described.
[0076]
FIG. 10 is a diagram showing a configuration of an information search apparatus for performing a search by drag input. In the figure, when a predetermined area of the information file screen 41 is dragged, the position information obtaining means 10 reads the position information of the area. (Coordinate value), and passes the position information to the component extracting means 2.
[0077]
The component extracting unit 2 acquires information in an area defined by the acquired position information. In the example of FIG. 10, the character information “Dice” is dragged, and the user wants a character code. Therefore, the information obtained by dragging is subjected to recognition processing by a character recognition unit (not shown) to obtain the character “Dice”. I do. The obtained “die” is passed to the search key obtaining means 4 as a search key. Subsequent search processing is the same as the processing described in the processing flow of FIG.
[0078]
The search key obtained by dragging is not limited to a search word, but may be a character color, a background color, or the like. Also, by dragging the area other than the character area, a photograph, a figure, a table, or the like can be used as a search key.
[0079]
According to this input method, the user does not need to input the dice character by key operation one by one, so that there is no input error and the input can be performed easily.
[0080]
Subsequently, an information search system according to another embodiment of the present invention will be described.
FIG. 11 is a diagram showing a configuration of an information retrieval system according to another embodiment of the present invention. In the drawing, components denoted by the same reference numerals as those in FIG. 1 are the same as those in FIG. Perform the operation.
[0081]
In FIG. 11, a server computer 20 and a client computer 30 are connected via an electric communication line 40 such as a LAN and the Internet.
[0082]
The server computer 20 includes an information file storage unit 1, a component storage unit 3, a component search unit 5, and an arrival key information creation unit 6. Further, the server computer 20 includes a component extracting means (same as the component extracting means 2 in FIG. 1) as needed, reads an information file from the information file storing means 1 according to an instruction from the client computer 20, and Is extracted, added to tag information, and stored in the component storage unit 3. The client computer 30 includes information file display means (the same as the information file display means 7 in FIG. 1).
[0083]
When performing the information search, the user transmits a search key from the client computer 30 to the server computer 20 via the electric communication line 40. The component search unit 5 of the server computer 20 that has received the search key accesses the component storage unit 3 and searches for a component that matches the search key. Then, the matching component is passed to the arrival key information creating means 6. The arrival key information creating means 6 creates the above-mentioned HTML file (A file, B file). The server computer 20 transmits the created HTML file to the client computer 30 via the telecommunication line 40 as a search result. The information file display means of the client computer 30 interprets the HTML file by the WWW browser and / or displays the retrieved element and / or its original information file on a display in a web-format screen.
[0084]
According to the present embodiment, a general-purpose personal computer is used as a client computer, and information can be searched for only by installing a WWW browser.
[0085]
In the above-described embodiment, the process of converting document data created by a word processor or the like into image data and performing registration and search based on the image data is described, but by changing the configuration of the information search device, Component elements can be registered and information can be searched without converting document data created by a word processor or the like into image data.
[0086]
As described above, the registration method and the search method of the components of the information file according to the embodiment of the present invention have been described. An information retrieval device is configured by recording the information on an arbitrary recording medium such as an MO and reading the recorded information on a computer. This allows any computer to easily function as an information search device.
[0087]
【The invention's effect】
According to the first, second, third and fourth aspects, the retrieved component is cut out based on the tag information, the extracted component is combined with the information file of the component, and the retrieved component is added to the information file. Since the information file is specified on the display screen, the information file can be quickly searched even when the number of searched results increases.
Advantageous Effect According to Claim 5: Since the searched constituent elements are spread and displayed to the surrounding area, the contents of the search result can be grasped without redisplaying the original information file.
Effect corresponding to claim 6: The input of the search key is facilitated, and the search can be performed quickly.
Effects corresponding to claims 7 and 8: An arbitrary computer can be easily used as an information search device.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of an information search device according to an embodiment of the present invention.
FIG. 2 is a flowchart of a component registration process.
FIG. 3 is a diagram showing a search key input screen.
FIG. 4 is a flowchart of an information file search process.
FIG. 5 is a diagram showing a screen displaying a diagram of a search result.
FIG. 6 is a diagram showing a screen of an original information file displayed in link with a diagram of a search result.
FIG. 7 is a diagram showing a screen of an original information file sequentially displayed by scrolling from a diagram of a search result.
FIG. 8 is a diagram showing a screen displaying a search result based on a search word.
FIG. 9 is a diagram showing a screen displaying a search result by a document number.
FIG. 10 is a diagram showing a configuration of an information search device for inputting a search key by dragging.
FIG. 11 is a diagram showing a configuration of an information search system according to another embodiment of the present invention.
[Explanation of symbols]
1. Information file storage means 2. Component extraction means
3 ... Component storage means 4 ... Search key acquisition means
5 ... component search means 6 ... arrival key information creation means
7 ... information file display means 8 ... file type identification means
9 ... File type conversion means 10 ... Position information acquisition means

Claims

Means for creating component data by adding tag information representing attributes of the component to components of the information file extracted from the information file;
Means for storing component data to which tag information has been added;
Means for searching for a component that matches the search key information with reference to the stored component data,
Means for extracting the searched component from the information file based on the tag information;
Means for allowing the extracted component to be combined with the information file;
An information retrieval device comprising:

The information retrieval device according to claim 1,
The information retrieval apparatus, wherein the means for extracting the searched components from the information file based on the tag information limits the number of components cut from the information file when the number of searches exceeds a predetermined value.

The information retrieval device according to claim 1 or 2,
An information retrieval apparatus comprising means for creating screen information for displaying the cut-out component.

The information retrieval device according to claim 2 or 3,
An information search device, comprising: means for creating screen information that allows the cut-out component to be identified on an information file.

The information search device according to claim 3 or 4,
The information search device, wherein the means for creating screen information for displaying the cut-out component generates screen information for displaying a peripheral area including the cut-out component.

6. The information retrieval apparatus according to claim 5, further comprising means for inputting, as a retrieval key, a component cut out from the displayed information file.

Computer
Means for creating component data by adding tag information expressing attributes of the components to the components of the information file extracted from the information file,
Means for storing the component data to which the tag information has been added in the storage means,
Means for searching for a component that matches the search key information with reference to the stored component data,
Means for extracting the searched component from the information file based on the tag information,
Means for creating combined information to be combined with the information file from the cut-out component,
Means for creating screen information for displaying the extracted component,
Means for creating screen information that allows the cut-out component to be identified on the information file,
Program to function as.

A storage medium storing the program according to claim 7 in a computer-readable manner.