JP2004240488A

JP2004240488A - Document managing device

Info

Publication number: JP2004240488A
Application number: JP2003026144A
Authority: JP
Inventors: Shuichi Morisawa; 秀一森澤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-02-03
Filing date: 2003-02-03
Publication date: 2004-08-26

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document integrally managing system for paper/electronic file which is improved so that an operator can retrieve a document not from a date when the document has been scanned but from a date when the document has been actually prepared or a date close thereto on which the contents of the text are reflected. <P>SOLUTION: This document managing device for storing and managing an original image or an electronic document read by a scanner etc. is provided with a word outputting means for performing the character recognition of a character string in the original image by using a dictionary for recognition, and for performing morphemic analysis by using the word dictionary, and for outputting words being the recognition candidates, an extracting device for extracting a portion expressing dates or time from the word string outputted by the word outputting means, an estimating means for estimating a period when the original image has been prepared from the data related with the dates or time extracted by the extracting means and an imparting means for imparting the period estimated by the estimating means as the attributes of the original image. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、コンピュータ上で大量の文書を保存、管理する文書管理サーバに関するものであり、オフィスその他の様々な場面で旧来の紙媒体によるドキュメントと電子化文書とが入り混じる状況の中で両者をシームレスに統合管理し、分類、加工、検索等のニーズに対して使い勝手が向上するように工夫されたシステムに関するものである。
【０００２】
【従来の技術】
文書管理サーバにおいて、旧来の紙媒体によるドキュメントを新たにコンピュータ上で保管、管理するために、スキャナで原稿を読み取り、文字や図表、写真等、画像の種類ごとに領域を分割し、文字部分はＯＣＲ等により文字コード列に変換してテキスト化し、その他の部分は画像ファイルとして別々に管理することが行われている。一般にドキュメントがコンピュータ上でファイルとして管理される場合には、ファイルの内容自体とともに、そのドキュメントの属性として幾つかの付帯情報が記憶される。例えば、ファイルのサイズや形式、又そのファイルが作成された（或は更新された）日時や作成者等である。
【０００３】
ところで、紙文書が一旦電子化された後には、既にキーボードからの入力等により作成され文書保管装置に記憶されていた文書群の各ドキュメントとの区別は特に意識しないで、シームレスに検索を行いたいという要求が、或る段階でオペレータには発生するものと思われる。検索にあたっては、ドキュメントに書かれた内容やその書式で探すことが最も多いと思われるが、一方でそのドキュメントが作成された年月日を足掛かりに探すことも良く見られる傾向である。ドキュメントによっては、その内容と書かれた時期とが密接に関連している場合があるためである。
【０００４】
然るに、従来のシステムにおいては、紙媒体のドキュメントを先に説明したやり方で電子化した場合、そのときの日時、即ち電子化という作業を完了した時点の日時がそのファイルの属性として付与されてしまうのが普通であった。紙ドキュメントは、オフィスにおいて電子化文書が主流となり、大量の電子化文書を管理するための文書管理サーバが稼動し始める以前の昔から存在しているのが普通であるから、システム稼動後に電子化され、その読込み作業時点での日付がファイルの作成日時として属性付与されても、検索時には本来の作成された年月日を目安に探したいというのが背景にあると思われる。
【０００５】
文書の日付管理や日付データ自動生成に関する過去の技術を見ると、特開平０９−０６２６６５号「文書管理装置」に記載された発明においては、非定型の文書の関連する日付情報は、文書内容を実際に読むことでしか得られず、文書に関連する日付情報の或る期間ごとの分布を容易に把握することができなかったという問題点の解決を図る目的で、電子化された文書群における日付に関連する文書の頻度を、年／月／曜日／日といった複数の時間的観点から整理して、分かり易く提示することが提案されている。本文中から日付データを抽出する点では本発明と重複しており、又、ＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）の観点から見れば優れた発明であるが、紙で存在しているドキュメントを電子化ファイルと共存させるに当たり、ファイル属性というコンピュータ上で文書を管理する上で本質的なデータを文書管理サーバが操作するという視点は見られない。
【０００６】
又、特開平０６−１０３２６８号「文書管理装置」における発明は、文書の記載漏れを検出して自動的に生成可能な情報を自動的に文書に埋め込む技術に関し、例えば作成日の情報の場合、それの意味するところは文書を登録する日付のことであるから、システムが持っているタイマーの値から一意に決めることができるので、それを取得して文書中に挿入する。
【０００７】
しかしながら、過去に作成された紙ドキュメントの実際の作成日を本文中から推論することは提案されていない。そのため、古い紙ドキュメントと電子化文書とを共存させ、検索に適した日付管理は不可能である。
【０００８】
【発明が解決しようとする課題】
上で述べたように、文書管理システム稼動後にスキャンして電子化した紙ドキュメントと、最初から電子化データとして作成されたドキュメントとが混在した中から、作成日付を頼りに検索しようと思った場合、紙ドキュメントでは実際に作成された年月日ではなく、スキャンして電子化した日時がファイル属性として与えられているために、既に電子化されていたドキュメントとで日付管理の基準が異なっており、両者を一律に探すことが困難であると言え、又、スキャンの日時では書かれている内容との関連性がなく、検索時のあてにすることができないという問題点が存在する。
【０００９】
そこで、本発明は、紙ドキュメントをスキャンして電子化する場合に、文字認識の結果をサーチして実際にそのドキュメントが作成された日付が書かれていると思われる文字列部分を見出し、それをファイル属性として付与することで、オペレータが検索時に、スキャンした日時ではなく、実際に作成された日付或は本文の内容を反映した、それに近い日付で検索できるように改良した、紙・電子ファイル統合文書管理システムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
上記目的を達成するため、本発明は、スキャナ等で読み込んだ原稿画像や電子化文書を保管、管理する文書管理装置を、前記原稿画像中の文字列を認識用辞書を用いて文字認識を行い、単語辞書を用いて形態素解析を行い、その認識候補とする単語を出力する単語出力手段と、前記単語出力手段が出力した単語列から年月日や時刻を表す部分を抽出する抽出装置と、前記抽出手段が抽出した年月日や時刻に関するデータから前記原稿画像が作成された時期を推定する推定手段と、前記推定手段が推定した時期を前記原稿画像の属性として付与する付与手段とを含んで構成したことを特徴とする。
【００１１】
【発明の実施の形態】
以下に本発明の実施の形態を添付図面に基づいて説明する。
【００１２】
図１は本発明に係る文書管理装置の全体構成を示すブロック図である。
【００１３】
図１において、１０１はＯＣＲ（光学的文字読取装置）で、原稿画像中のテキスト部分を読み取り、認識候補文字の文字コードを出力するものである。１０２は前記ＯＣＲ１０１の出力である認識文字候補に対して単語辞書１０３を参照しながら形態素に分割する形態素解析装置である。１０３は前記１０２の形態素解析装置が候補文字列から単語列を切り出す際に参照する単語辞書である。１０４は前記１０２の形態素解析装置により分割された形態素列の中から、日付データに相当するものを探す日付データ探索装置である。
【００１４】
１０５は読み取った原稿画像を電子化ファイルとして保存する際に、内容以外にその原稿の属性として幾つかの付帯情報を書き込むための属性付与装置である。属性としては、例えば作成したファイルのサイズやフォーマット、又そのファイルが作成された（或は更新された）日時や作成者（この場合は画像の読込み作業者）等である。１０６は文書記憶装置であり、電子化されたドキュメントを記憶しておくためのハードディスク等の大規模記憶装置で構成される。
【００１５】
以下、（１）〜（３）の３段階に分けて処理の流れを説明する。
（１）文字認識処理
今、図３に示した原稿画像が入力されたものとし、図２のフローチャートに従って処理の流れを説明する。
【００１６】
先ず、ＯＣＲ装置１０１により、原稿画像のテキスト部分に関して文字認識を行う（ステップ２０１）。その結果、読み取られた各文字について複数個の文字候補から成る文字コード列が出力される。図４は図３の原稿画像のうち、…開催日時：６月２０日（…の１１文字の部分について文字認識を行った結果出力された文字候補を示したものである。
【００１７】
それぞれの認識文字に対して第四文字候補までの文字を示してあるが、一般には何位の候補文字まで出力されるかは決まっていない。形態素解析装置１０２はこれらの候補文字の全ての組み合わせを考え、単語辞書１０３を参照して可能な形態素列への分割を行う（ステップ２０２）。図５は図４で示された部分の形態素解析を行って得た分割の一例である。
（２）ジャンル推定処理
続いてステップ２１２のジャンル推定処理について説明する。
【００１８】
ドキュメントの作成された時期を推定するためには、原稿のおおまかなジャンルを決定できていることが前提となる。例えば、オフィスで作成される報告書や企画書等の書類の場合、フォーマットが一定なことが多く、作成日付を見出すことも容易である。一方で新聞記事や紀行文、物語文等の一般のソースの場合を考えると、明示的に作成日付は記載されていないことが多く、本文中から日付部分をサーチし、作成された時期を推測する必要がある。ここでは、文書のフォーマットによりジャンルやカテゴリを決定できなかった場合に、本文に書かれた内容からそれらを推定する処理について説明する。
【００１９】
そのためには過去に蓄積された大量の電子化文書が、既にジャンルやカテゴリに分類されていることが重要となる。そこで、先ず、文書管理サーバに蓄積された文書群をユーザの意図に沿って、予め決められたカテゴリ群の何れかに分類する方法の１つとして、ベクトル空間モデルを利用した方法を説明する。このベクトル空間モデルでは、分類に有用な語や文書、カテゴリをベクトルで表現し、ベクトルの方向を目安としてその文書が属するカテゴリを決定する。このベクトル空間モデルを利用した文書自動分類の処理は、学習フェーズ及び分類フェーズ２つのフェーズに分けられる。
【００２０】
学習フェーズでは、予め正しく分類された学習用文書から分類に有用な語（以下、有効語と言う）を選出し、各有効語をベクトル表現する。このベクトルは有効語ベクトルと呼ばれ、この有効語ベクトルの成分は、出現頻度や単語共起確率等により求められる。各有効語を見出しとし、その有効語ベクトルを格納したものを有効語辞書と呼ぶ。又、学習用文書に含まれる各有効語ベクトルの重み付き平均を計算することにより、文書の内容をベクトル表現でき、各カテゴリごとにそれに属する文書ベクトルの平均を求めることで、カテゴリの特徴を表すフォルダベクトルの算出が行われる。
【００２１】
分類フェーズでは、学習フェーズで得られた有効語辞書を用いて分類対象文書をベクトルで表現し（以下、文書ベクトルと言う）、この文書ベクトルとフォルダベクトルとを比較し、該比較結果に応じて分類対象文書が属するカテゴリを決定する。
【００２２】
ジャンル推定装置は、図６に示すように、学習用文書を保持する学習用文書データベース６０１と、分類対象文書を保持する分類対象文書保持部６０２と、学習用文書から有効語を抽出する有効語抽出手段６０３と、抽出された有効語を各カテゴリへの帰属度と共に保持する有効語辞書６０４と、有効語の重要度を評価する評価項目別に重要度の値が記述された評価項目テーブル６１８と、学習用文書と有効語辞書とを参照して各文書に含まれている有効語の数を求める有効語数計算部６０５と、求められた各文書内の有効語数を保持する有効語数保持部６０６と、有効語の数を基に各有効語の組の共起頻度を求める共起頻度計算手段６０７と、共起頻度を参照して各有効語の有効語ベクトルを求める有効語ベクトル計算手段６０９と、学習用文書と分類対象文書とのそれぞれについて有効語ベクトルを参照して文書ベクトルを求める文書ベクトル計算手段６１１と、学習用文書について求められた文書ベクトルを用いて各カテゴリのフォルダベクトルを求めるフォルダベクトル計算手段６１３と、分類対象文書について求められた文書ベクトルと各カテゴリのフォルダベクトルとを比較し、比較結果に応じて分類対象文書が属するカテゴリを推定するカテゴリ推定手段６１５と、カテゴリ推定部がカテゴリごとに分類した分類対象文書の一覧を表示する分類結果表示手段６１７と、カテゴリ推定手段による推定結果がユーザの意図に反する場合に推定結果表示部上で適当と思われるカテゴリを複数個選択して指示することにより評価項目テーブル６１８内の重要度の値を修正して学習する学習手段６１９とから構成される。
【００２３】
有効語数保持部６０６に保持された文書内の有効語数は共起頻度計算手段６０７に与えられ、共起頻度計算手段６０７は有効語数を用いて各有効語の組の共起頻度を求める。この求められた共起頻度は、共起頻度保持部６０８に保持された後に、有効語ベクトル計算手段６０９に与えられる。有効語ベクトル計算手段６０９は、共起頻度を用いて各有効語の有効語ベクトルを求める。ここで、有効語Ｔｉと有効語Ｔｊの共起確率をｃｉ，ｊ、有効語数をＮとすると、有効語Ｔｉの有効語ベクトルＴｉは、次の（１）式により、
Ｔｉ＝（ｃｉ，１，ｃｉ，２，…，ｃｉ，Ｎ） …（１）
となる。
【００２４】
又、共起確率ｃｉ，ｊは次の（２）式により定義される。
【００２５】
ｃｉ，ｊ＝（ＴｉとＴｊの両方を含む文書数）／（Ｔｉを含む文書数）…（２）
有効語ベクトル計算手段６０９により求められた有効語ベクトルは、有効語ベクトル保持部６１０に保持された後に文書ベクトル計算手段６１１に与えられる。文書ベクトル計算手段６１１は、学習用文書と分類対象文書のそれぞれについて、有効語ベクトルを参照して文書ベクトルを求め、学習用文書と分類対象文書のそれぞれについて求められた文書ベクトルは文書ベクトル保持部６１２に保持される。文書ベクトル保持部６１２に保持された学習用文書の文書ベクトルは、フォルダベクトル計算手段６１３に与えられ、フォルダベクトル計算手段６１３は、学習用文書の文書ベクトルを用いて各カテゴリのフォルダベクトルを求める。求められた各カテゴリのフォルダベクトルは、フォルダベクトル保持部６１４に保持される。
【００２６】
フォルダベクトル保持部６１４に保持された各カテゴリのフォルダベクトルは、文書ベクトル保持部６１２に保持された分類対象文書の文書ベクトルと共にカテゴリ推定手段６１５に与えられ、カテゴリ推定手段６１５は、分類対象文書の文書ベクトルと各カテゴリのフォルダベクトルとを比較し、該比較結果に応じて分類対象文書が属するカテゴリを決定する。この決定された分類対象文書のカテゴリは、分類結果保持部６１６に保持される。
【００２７】
次に、ジャンル推定装置における学習フェーズの処理手順について図７を参照しながら説明する。
【００２８】
先ず、ステップ７０１において学習用文書を形態素解析し、それに含まれる語の中から、分類に有用な語を有効語として選定し、続くステップ７０２で、各文書内に含まれている選定した有効語の数を求める。有効語は普通名詞、固有名詞、サ変名詞及び未知語を対象に、特定のカテゴリに偏って出現する単語を選定する。
【００２９】
次に、各有効語の重み付けを行う。重みの評価は２つの観点から行う。即ち、先ず、▲１▼その有効語自体が分類という行為に対してどの程度有効かという点。▲２▼その有効語が各文書の中でどの程度重要な位置を示しているかという点。
【００３０】
▲１▼の重みｗ１は、各カテゴリへの帰属度の度合いを表すもので、特定のカテゴリを特徴付ける度合いの高い有効語ほど重みを重くするという考えであり、次の要領で算出する。先ず、カテゴリＣｋに属する学習用文書の中で、有効語Ｗｉを含む文書の割合Ｐｉｋを求める。
【００３１】
Ｐｉｋ＝（カテゴリＣｋに属し有効語Ｗｉを含む文書の数）／（カテゴリＣｋに属する文書の数）
但し、
ΣＰｉｋ（全てのカテゴリに亘る和）＝１
となるよう正規化する。
【００３２】
Ｗ１＝１−Ｈ（Ｗｉ）、但し、Ｈ（Ｗｉ）はＰｉｋのエントロピー
と定義する。
【００３３】
▲２▼の重みｗ２は、対象とする文書の中でその有効語がどのように使われているか、文書の内容とどのように関わっているのか、という側面を評価する。例えば、『…テレビの発達したマスメディア国家アメリカでは、サーカスみたいな政治になっていて、優れた学生は政治家になりたいとは思わないだろう。』という文章を含む、『政治』カテゴリに属すべき新聞記事があった場合、“サーカス”という単語はアメリカの“政治”に対する比喩として用いられたに過ぎず、文章の内容とは直接関係は少ない。
【００３４】
従って、“サーカス”という単語そのものが『娯楽』という特定のカテゴリを特徴付ける度合いが高いからといって、この有効語に高い重みを付けてしまうと、文書ベクトルが誤った方向に引っ張られてしまう。そこで、▲１▼と合わせて▲２▼のような重みの評価も必要となる（▲２▼の重みを文書内重要度と呼ぶことにする）。
【００３５】
文書中での重要性に関連ある要素として、（１）その有効語の出現位置及び（２）その有効語の格役割、修飾タイプ等の言語的役割、に注目して評価項目を予め作成しておき、有効語が各評価項目の条件を満たした場合に与える重みの値を学習によって求める。
【００３６】
先ず、（１）の重みについて説明する。
【００３７】
文書中での有効語の出現位置は、その重要度と相関が強いと考えられる。例えば、新聞記事では先頭段落に大意を表現するような重要な単語が現れることは周知の事実である。そこで、文書全体を文を単位としてｎ個のブロックに等分し、各ブロックに対する重みを求める。
【００３８】
次に、（２）の重みについて説明する。
【００３９】
言語的役割と有効語の重要度との相関としては、例えば、『“…が”、“…は”等の主語の形で使われている単語は重要』『連体修飾する用言は余り重要ではない』等の法則が考えられる。そこで、言語的役割として次のような評価項目を用意し、それぞれの重みを学習によって求める。
「が」格
「を」格
「に」格
「へ」格
「は」格
「も」格
その他の連用修飾する体言
連体修飾する体言
連体修飾するサ変
文末のサ変
読点付きのサ変
文末の体言
本実施の形態では、評価項目としては、図８に示したように、□有効語が「段落の先頭文に含まれているか」、□有効語の「が」格、「を」格等の「格役割」、□有効語が「文末のサ変名詞」であるか、を採用しているが、評価項目としては、その有効語の文書内での出現位置やその単語の係り受けの役割等を採用することも可能である。尚、図８は評価項目テーブルの初期状態を示しており、各文書内重要度の値は全て“１．０”となっている。
【００４０】
次いで、ステップ７０３に進み、有効語数から各有効語の組の共起頻度を求め、続くステップ７０４で、共起頻度から有効語ベクトルを算出する。そして、有効語ベクトルを参照して学習用文書から有効語を取り出し、続くステップ７０５で、取り出した有効語の有効語ベクトルの平均を取って学習用文書の文書ベクトルを求める。
【００４１】
次いで、ステップ７０６に進み、各カテゴリごとにそれに属する全ての学習用文書の文書ベクトルの平均を求め、これをそのカテゴリの代表ベクトルとして本処理を終了する。
【００４２】
次に、文書内重要度の値の学習アルゴリズムについて図９を参照しながら説明する。
【００４３】
先ず、全ての評価項目に対する文書内重要度の値を１に初期化する。
【００４４】
次に、学習に用いた学習用文書の数をカウントするカウンタ変数ｎを“０”に初期化する（ステップ９０１）。次に、正しい分類カテゴリ（＝Ｃ＊とする）が付与された学習用文書を読出し、カウンタ変数ｎを“１”だけインクリメントする（ステップ９０２）。読み出した文書の例として、カテゴリ『事件』に分類されている図１０のような文書を想定する。そして、読み出した学習用文書の中から有効語辞書６０４に記載された有効語を抽出し、評価項目テーブル６１８の評価項目に従って抽出に係る有効語の属性テーブルを作成する（ステップ９０３）。ここで、図１０の文書において有効語辞書６０４に従って抽出された有効語と、その有効語に係る属性テーブルの例を図１１に示す。
【００４５】
次に、抽出した各有効語に係る属性テーブルと評価項目テーブル６１８に記述された重要度の値に基づいて各有効語の文書内重要度を計算する（ステップ９０４）。そして、計算した文書内重要度、有効語辞書６０４に保持された各有効語のカテゴリへの帰属度データ等を用いて、その文書の各カテゴリへの帰属度を計算し、最も帰属度の高いカテゴリを分類結果（＝Ｃ）とする（ステップ９０５）。
【００４６】
そして、この分類結果（Ｃ）をステップ９０２にて取得した正しい分類カテゴリＣ＊と比較して、その分類結果Ｃが正しいか否かを調べ（ステップ９０６）、正しくなければ、評価項目テーブル６１８の重要度の値を修正する（ステップ９０７）。
【００４７】
ここで、同じく図８及び図１０〜１１を用いて重要度の値の更新方法を説明する。
【００４８】
今、図１０の文書がカテゴリ『科学』に誤分類されたとすると、先ず、誤分類であるカテゴリ『科学』への帰属度の大きい有効語「工学部」、「研究室」、「化学実験」に注目する。これら有効語は誤分類の原因であると考えられるので、その文書内重要度が小さくなるように、図８の評価項目テーブルの重要度の値を修正する。
【００４９】
即ち、「工学部」に着目した際には、「工学部」に係る図１１の属性テーブルの「段落先頭文にあるか否か」（図８の評価項目１）の属性値は“ＴＲＵＥ”であり、「格役割」（図８の評価項目２）は“「の」の連体”であるため、評価項目テーブル＊＊の「段落先頭文にあるか否か」の“ＴＲＵＥ”及び「格役割」の“「の」の連体”の重要度の値を微小量だけ減らす。「研究室」、「化学実験」に着目した際にも、「工学部」と同様の処理を行う。
【００５０】
次に、正しい分類カテゴリである『事件』への帰属度の大きい有効語「火災」、「負傷」に注目する。正しい分類結果を出すには、これら有効語の文書内重要度を大きく評価しなければならない。
【００５１】
そこで、「火災」に着目した際には、「火災」に係る属性テーブルの「段落先頭文にあるか否か」（評価項目１）の属性値は“ＴＲＵＥ”であり、「格役割」（評価項目２）は“「が」格”であるため、評価項目テーブル６１８の「段落先頭文にあるか否か」の“ＴＲＵＥ”及び「格役割」の“「が格」”の重要度の値を微小量だけ増やす。「負傷」に着目した際にも、「火災」と同様の処理を行う。
【００５２】
このようにして重要度の値を更新した後は、ステップ９０８に進み、過去Ｎ個（Ｎ≦ｎ）の学習用文書に対する分類の正解率ｒを計算する。尚、ステップ９０６にて分類結果が正しいと判別されたときは、ステップ９０７での重要度の値の更新処理をスキップしてステップ９０８に進む。
【００５３】
次に、正解率ｒが所定の値Ｔｈを越えているか、又は学習に用いた学習用文書数が所定の値Ｍを越えているかを調べ（ステップ９０９）、何れかが満たされていたら終了し、何れも満たされていなければ、ステップ９０２に戻り、次の学習用文書に基づいて同様の処理を行う。
【００５４】
このような処理を行うことにより、各評価項目の重要度が適切に修正された評価項目テーブル６１８が実現されることとなる。
【００５５】
このように、単語の出現位置、格役割、修飾タイプ等、文書内重要度の評価に有用であると思われる評価項目に対する具体的な重要度の値を、カテゴリごとに別けて保存された複数の学習用文書によって学習により求めている。即ち、最初は、各評価項目の重要度の初期値を適当に与えておき、学習用文書を分類させてみて、その分類結果が正しいカテゴリと異なっており、誤分類が発生した場合には、その誤分類に大きな影響を与えた有効語を抽出し、文書内重要度の評価項目でそのケースに当て嵌まるものに付与された重要度の値を微少量だけ修正する。このような処理を大量の学習用文書に対して行って、分類の正解率が極力高くなるような文書内重要度の値を自動的に求める。
【００５６】
次に、実際に入力されたカテゴリの不明な分類対象文書を自動分類させる分類フェーズの処理手順について図１２を参照しながら説明する。
【００５７】
分類フェーズでは、先ず、ステップ１２０１において上記ステップ７０４で求めた有効語ベクトルを参照して分類対象文書から有効語を取り出し、続くステップ１２０２で取り出した有効語のベクトル（上記ステップ７０４で求めた有効語ベクトル）の平均を取り、このベクトルの平均から分類対象文書の文書ベクトルを求める。
【００５８】
次いで、ステップ１２０３に進み、分類対象文書の文書ベクトルと学習フェーズで求められたフォルダベクトルとを比較し、該比較結果に応じて分類対象文書が属するカテゴリを決定し、本処理を終了する。
【００５９】
以上説明したような方法で文書管理サーバ上に蓄積された電子化文書をカテゴリに分類しておけば、新たに読み込んだ紙ドキュメントに対しても同様にカテゴリを推定できる。
（３）日付探索処理
再び図２のフローチャートに戻って、日付探索処理について説明する。
【００６０】
先ず、形態素列の中から、日付を表していると考えられるものを探す。その際、日付部分の前後に改行コード若しくは複数個のスペースやタブが入っていれば、それは日付のみの単独データが存在する部分としてピックアップする（図２のステップ２０３）。もし、ピックアップされた日付データが存在すれば、それをファイル属性としての作成日付とする（ステップ２０４）。原稿画像が図３の場合、先頭行から数えて３行目に、『（スペース又はタブ）２００１年５月３０日（改行）』という行が存在するので、これにより作成日付のファイル属性が「西暦２００１年５月３０日＊＊時＊＊分＊＊秒」として付与される。ここで、時分秒に相当する部分は形態素解析により抽出できなかったので、システムによりデフォルトの時間が割り当てられる。
【００６１】
ステップ２０３において、もしピックアップされた日付データがなければ、単独の日付行以外に日付データがないかを分割された形態素列の中からサーチする（ステップ２０５）。
【００６２】
次に、サーチされた日付データの中から、現在又は未来の日付を表しているものは捨て、過去の日付データのみを残す（ステップ２０６〜２０８）。図３の例でもし仮に日付行と見なされる第３行目が存在していなければ、本文中から『６月２０日（日）９：００〜１５：００』及び『６月１０日（木）』の部分が抽出される。
【００６３】
全ての形態素列に対して探索された後、抽出された日付データの中から、実際に当該原稿が作成された日付、ないしはなるだけそれに近い年月日をヒューリスティックに推定することが行われる（ステップ２０９）。その際、ジャンル推定処理により推定されたドキュメントのジャンルが考慮される。例えば、オフィスにおいて作成された企画書や見積書、報告書等では、過去の年月日が複数個出現していれば、その中で最も過去に遡る日付が、より作成日付に近いと考えられる。
【００６４】
一方、新聞や雑誌の切り抜き等に見られるニュース文や報道記事では、過去の事件を扱ったものであるか、或はそれが書かれた瞬間よりも未来に起こるであろう出来事を扱っているのかを内容や抽出された日付データ等から判断し、前者の場合には抽出された日付データのうち最も新しい日付が、又、後者の場合には最も古い日付が、そのドキュメントが作成された日付に最も近いものと結論されるため、このデータをファイル属性として付与する。
【００６５】
又、このようにして抽出された日付からその原稿の作成日付を推定した場合には、キーボード入力等により作成された最初から電子化データである場合に付与される作成日付とは性質が異なり、不確実性を持った情報であるという識別子としてフラグを立てる（ステップ２１０）。
【００６６】
尚、日付データが全く存在しなかった場合には、コンピュータシステムが付与するファイル属性をそのまま付与する（ステップ２１１）。
【００６７】
＜他の実施の形態＞
以上の実施の形態では、原稿ドキュメントに対し、既に蓄積され自動分類された電子化文書群を利用してドキュメントに書かれた本文の内容によりジャンルの推定を行ったが、本文の言語的な解析による分類ではなく、レイアウト情報や書式パターンを予め登録しておき、画像認識技術により原稿の書式パターンを認識してそれに応じたジャンルに決定する方法が考えられる。
【００６８】
又、原稿に対して、ジャンルの推定処理及び日付データ抽出処理を行うタイミングは、実施の形態にて説明した通りである必要はなく、日付データの抽出度合いに応じてジャンル推定処理の処理量やそのタイミングを調節し、システムへの負荷を軽減することも行われる。
【００６９】
【発明の効果】
以上説明したように、本発明によれば、既に作成されている電子化ドキュメントと、紙による文書をスキャナにより読み込ませ文字認識させたＯＣＲドキュメントとが共存して管理される文書管理サーバにおいて、紙ドキュメントをスキャンして電子化する際に、文字認識の結果をサーチして実際にそのドキュメントが作成された日付が書かれていると思われる文字列部分を見出し、それをファイル属性として付与することで、オペレータが検索時に、スキャンした日時ではなく、実際に作成された日付、ないしは本文の内容を反映した、それに近い日付で検索できるように改良した、紙・電子ファイル統合文書管理システムを提供することができる。
【図面の簡単な説明】
【図１】本発明におけるシステムの全体構成を示す図である。
【図２】本発明における処理の流れを示すフローチャートである。
【図３】原稿画像の一例を示す図である。
【図４】認識文字候補の例を示す図である。
【図５】辞書を検索して抽出された単語候補の例を示す図である。
【図６】ジャンル推定装置の構成を示す図である。
【図７】ジャンル推定装置における学習フェーズの処理手順を示すフローチャートである。
【図８】評価項目テーブルの例を示す図である。
【図９】文書内重要度の学習アルゴリズムについて説明したフローチャートである。
【図１０】分類対象文書の一例を示す図である。
【図１１】属性テーブルの例を示す図である。
【図１２】ジャンル推定装置における分類フェーズの処理手順を示したフローチャートである。
【符号の説明】
１０１ＯＣＲ
１０２形態素解析装置
１０３単語辞書
１０４ジャンル推定装置
１０５日付データ探索装置
１０６属性付与装置
１０７文書記憶装置[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document management server that stores and manages a large number of documents on a computer, and in a situation where a document using a conventional paper medium and a digitized document are mixed in an office or various other situations, both can be used. The present invention relates to a system that is seamlessly integrated, and is devised so that usability is improved for needs such as classification, processing, and search.
[0002]
[Prior art]
In the document management server, in order to store and manage old paper documents on a new computer, scan the original with a scanner and divide the area for each type of image, such as characters, charts, photographs, etc. The text is converted into a character code string by OCR or the like, and the other parts are separately managed as image files. Generally, when a document is managed as a file on a computer, some additional information is stored as attributes of the document together with the content of the file itself. For example, the size and format of the file, the date and time when the file was created (or updated), the creator, and the like.
[0003]
By the way, once a paper document has been digitized, it is desired to perform a seamless search without any particular awareness of the distinction between documents in a document group that has already been created by input from a keyboard or the like and stored in a document storage device. Is likely to occur to the operator at some stage. Searches are most likely to be based on the content or format of the document, but tend to be based on the date the document was created. This is because the contents and the writing time of some documents are closely related.
[0004]
However, in the conventional system, when a paper document is digitized in the manner described above, the date and time at that time, that is, the date and time when the work of digitization is completed, is added as an attribute of the file. Was normal. Since paper documents have become the mainstream in offices and usually exist before the operation of a document management server for managing a large number of digitized documents in the office, it has existed for a long time. However, even if the date at the time of the reading operation is given an attribute as the date and time of creation of the file, it seems that the background is that the user wants to search using the original date of creation when searching.
[0005]
Looking at past technologies related to document date management and automatic generation of date data, according to the invention described in Japanese Patent Application Laid-Open No. 09-062665, “Document Management Apparatus”, the date information related to an atypical document contains the contents of the document. For the purpose of solving the problem that the distribution of date information related to the document over a certain period could not be easily grasped only by actually reading the document, It has been proposed that the frequency of a document related to a date is arranged from a plurality of time perspectives such as year / month / day of the week / day and presented in an easy-to-understand manner. The present invention overlaps with the present invention in that date data is extracted from the text, and is an excellent invention from the viewpoint of a GUI (Graphical User Interface). However, a document existing in paper is regarded as an electronic file. For coexistence, there is no viewpoint that the document management server operates essential data for managing a document on a computer called a file attribute.
[0006]
Also, the invention in Japanese Patent Application Laid-Open No. 06-103268 “Document management device” relates to a technology for detecting omission of description of a document and automatically embedding information that can be automatically generated in the document. Since it means the date when the document is registered, it can be uniquely determined from the value of the timer of the system, so it is obtained and inserted into the document.
[0007]
However, it has not been proposed to infer the actual creation date of a paper document created in the past from the text. For this reason, it is impossible to cope with old paper documents and digitized documents, and to manage dates suitable for searching.
[0008]
[Problems to be solved by the invention]
As mentioned above, if you want to search based on the date of creation from a mixture of paper documents scanned and digitized after the operation of the document management system and documents created as digitized data from the beginning However, in paper documents, the date and time of scanning and digitization are given as file attributes instead of the date actually created, so the date management standards differ from those of documents that have already been digitized. However, it can be said that it is difficult to find both of them uniformly, and there is a problem that the date and time of the scan has no relevance to the written contents and cannot be relied on at the time of the search.
[0009]
Accordingly, the present invention provides a method for searching a character recognition result and finding a character string portion that seems to contain the date when the document was actually created when scanning and digitizing a paper document. Is added as a file attribute, so that the operator can search by the date created or the date close to the date reflecting the contents of the text, instead of the date and time of scanning, when searching. The purpose is to provide an integrated document management system.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides a document management device that stores and manages document images and digitized documents read by a scanner or the like, performs character recognition using a dictionary for recognizing character strings in the document images. A word output unit that performs a morphological analysis using a word dictionary and outputs a word as a recognition candidate thereof, and an extraction device that extracts a portion representing a date and time from a word string output by the word output unit, Estimating means for estimating the time when the document image was created from data on the date and time extracted by the extracting means, and providing means for giving the time estimated by the estimating means as an attribute of the document image. It is characterized by comprising.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
[0012]
FIG. 1 is a block diagram showing the overall configuration of a document management apparatus according to the present invention.
[0013]
In FIG. 1, reference numeral 101 denotes an OCR (optical character reading device) which reads a text portion in a document image and outputs a character code of a recognition candidate character. Reference numeral 102 denotes a morphological analyzer that divides the recognized character candidates output from the OCR 101 into morphemes while referring to the word dictionary 103. Reference numeral 103 denotes a word dictionary which is referred to when the morphological analysis device of the above 102 cuts out a word string from a candidate character string. Reference numeral 104 denotes a date data search device that searches the morphological sequence divided by the morphological analyzer 102 for data corresponding to date data.
[0014]
Reference numeral 105 denotes an attribute assigning device for writing some supplementary information as attributes of the original in addition to the contents when the read original image is stored as an electronic file. The attributes include, for example, the size and format of the created file, the date and time when the file was created (or updated), and the creator (in this case, an image reading operator). Reference numeral 106 denotes a document storage device, which includes a large-scale storage device such as a hard disk for storing digitized documents.
[0015]
Hereinafter, the processing flow will be described in three stages (1) to (3).
(1) Character recognition processing
Now, it is assumed that the document image shown in FIG. 3 has been input, and the flow of processing will be described according to the flowchart of FIG.
[0016]
First, the OCR device 101 performs character recognition on a text portion of a document image (step 201). As a result, a character code string including a plurality of character candidates is output for each read character. FIG. 4 shows character candidates output as a result of performing character recognition on 11 characters of... Date and time: June 20 (...) In the document image of FIG.
[0017]
Characters up to the fourth character candidate are shown for each recognized character, but in general the number of candidate characters to be output is not determined. The morphological analyzer 102 considers all combinations of these candidate characters, and divides them into possible morpheme strings with reference to the word dictionary 103 (step 202). FIG. 5 is an example of the division obtained by performing the morphological analysis of the portion shown in FIG.
(2) Genre estimation processing
Next, the genre estimating process in step 212 will be described.
[0018]
In order to estimate the time when the document was created, it is assumed that the approximate genre of the document has been determined. For example, in the case of a document such as a report or a plan document created in an office, the format is often fixed, and it is easy to find the creation date. On the other hand, in the case of general sources such as newspaper articles, travelogues, narratives, etc., the creation date is often not explicitly stated, and the date part is searched from the text to estimate the time of creation There is a need to. Here, a description will be given of a process of estimating the genre or category from the contents described in the main text when the genre or category cannot be determined based on the document format.
[0019]
For that purpose, it is important that a large number of digitized documents accumulated in the past have already been classified into genres and categories. Therefore, first, a method using a vector space model will be described as one of methods for classifying a document group stored in the document management server into one of predetermined categories according to a user's intention. In this vector space model, words, documents, and categories useful for classification are represented by vectors, and the category to which the documents belong is determined using the direction of the vectors as a guide. The processing of automatic document classification using this vector space model can be divided into a learning phase and a classification phase.
[0020]
In the learning phase, words useful for classification (hereinafter, referred to as effective words) are selected from learning documents that have been correctly classified in advance, and each effective word is expressed as a vector. This vector is called an effective word vector, and the components of this effective word vector are obtained from the appearance frequency, the word co-occurrence probability, and the like. Each effective word is used as a heading, and the effective word vector is stored as an effective word dictionary. Also, by calculating the weighted average of each effective word vector included in the learning document, the contents of the document can be expressed in a vector, and by averaging the document vectors belonging to each category, the characteristics of the category can be expressed. The calculation of the folder vector is performed.
[0021]
In the classification phase, a document to be classified is expressed as a vector using the effective word dictionary obtained in the learning phase (hereinafter, referred to as a document vector), the document vector is compared with a folder vector, and according to the comparison result, The category to which the document to be classified belongs is determined.
[0022]
As shown in FIG. 6, the genre estimating apparatus includes a learning document database 601 that holds learning documents, a classification target document holding unit 602 that stores classification target documents, and valid words that extract valid words from the learning documents. An extraction means 603, an effective word dictionary 604 for holding the extracted effective words together with the degree of belonging to each category, an evaluation item table 618 in which importance values are described for each evaluation item for evaluating the importance of the effective words, An effective word number calculation unit 605 for obtaining the number of effective words included in each document by referring to the learning document and the effective word dictionary, and an effective word number holding unit 606 for holding the obtained effective word number in each document. And a co-occurrence frequency calculating means 607 for calculating a co-occurrence frequency of each set of effective words based on the number of effective words, and an effective word vector calculating means 609 for obtaining an effective word vector of each effective word by referring to the co-occurrence frequency And learning Document vector calculating means 611 for obtaining a document vector by referring to an effective word vector for each of a document and a classification target document, and folder vector calculating means for obtaining a folder vector of each category using the document vector obtained for a learning document 613, a category estimating unit 615 that compares the document vector obtained for the classification target document with the folder vector of each category, and estimates a category to which the classification target document belongs according to the comparison result. A classification result display unit 617 for displaying a list of classified documents to be classified, and a plurality of appropriate categories are selected and indicated on the estimation result display unit when the estimation result by the category estimation unit is contrary to the user's intention. By correcting the importance value in the evaluation item table 618, It consists of learning means 619 Metropolitan to learn.
[0023]
The number of effective words in the document held in the effective word number holding unit 606 is given to the co-occurrence frequency calculation means 607, and the co-occurrence frequency calculation means 607 calculates the co-occurrence frequency of each set of effective words using the effective word number. The obtained co-occurrence frequency is stored in the co-occurrence frequency storage unit 608, and then provided to the effective word vector calculation unit 609. The effective word vector calculation means 609 obtains an effective word vector of each effective word using the co-occurrence frequency. Here, assuming that the co-occurrence probability of the valid word Ti and the valid word Tj is ci, j and the number of valid words is N, the valid word vector Ti of the valid word Ti is given by the following equation (1).
Ti = (ci, 1, ci, 2, ..., ci, N) (1)
It becomes.
[0024]
The co-occurrence probability ci, j is defined by the following equation (2).
[0025]
ci, j = (the number of documents including both Ti and Tj) / (the number of documents including Ti) (2)
The valid word vector obtained by the valid word vector calculation unit 609 is provided to the document vector calculation unit 611 after being stored in the valid word vector storage unit 610. The document vector calculation unit 611 obtains a document vector for each of the learning document and the classification target document by referring to the effective word vector. The document vector obtained for each of the learning document and the classification target document is a document vector holding unit. 612. The document vector of the learning document held in the document vector holding unit 612 is provided to the folder vector calculating unit 613, and the folder vector calculating unit 613 obtains a folder vector of each category using the document vector of the learning document. The obtained folder vector of each category is stored in the folder vector storage unit 614.
[0026]
The folder vector of each category held in the folder vector holding unit 614 is provided to the category estimating unit 615 together with the document vector of the classifying target document held in the document vector holding unit 612. The document vector is compared with the folder vector of each category, and the category to which the classification target document belongs is determined according to the comparison result. The determined category of the classification target document is stored in the classification result storage unit 616.
[0027]
Next, the processing procedure of the learning phase in the genre estimation device will be described with reference to FIG.
[0028]
First, in step 701, the learning document is subjected to morphological analysis, and words useful for classification are selected as effective words from the words included in the learning document. In step 702, the selected effective words included in each document are selected. Find the number of. Effective words are selected from common nouns, proper nouns, sa-variant nouns, and unknown words that appear in a particular category.
[0029]
Next, each effective word is weighted. The evaluation of the weight is performed from two viewpoints. That is, first, (1) the extent to which the effective word itself is effective for the act of classification. (2) How important the effective word indicates in each document.
[0030]
The weight w1 of (1) represents the degree of belonging to each category, and is based on the idea that an effective word having a higher degree of characterizing a specific category has a higher weight, and is calculated in the following manner. First, the ratio Pik of the document containing the valid word Wi in the learning documents belonging to the category Ck is obtained.
[0031]
Pik = (number of documents belonging to category Ck and including valid word Wi) / (number of documents belonging to category Ck)
However,
ΣPik (sum over all categories) = 1
Normalize so that
[0032]
W1 = 1−H (Wi), where H (Wi) is the entropy of Pik
Is defined.
[0033]
The weight w2 of (2) evaluates how the effective word is used in the target document and how it relates to the contents of the document. For example, "... In the United States, where the television is a mass media nation, politics is like a circus, and good students will not want to be politicians. , The word "circus" was only used as a metaphor for "politics" in the United States, and had little direct relevance to the text. .
[0034]
Therefore, if the word “circus” itself highly characterizes a specific category of “entertainment”, and if the effective word is given a high weight, the document vector will be pulled in the wrong direction. Therefore, it is necessary to evaluate the weight as in (2) in addition to (1) (the weight in (2) will be referred to as the document importance).
[0035]
Evaluation items are created in advance by focusing on (1) the appearance position of the valid word and (2) the linguistic role of the valid word, such as the case role and the modification type, as elements relevant to the importance in the document. In advance, a value of a weight given when an effective word satisfies the condition of each evaluation item is obtained by learning.
[0036]
First, the weight (1) will be described.
[0037]
It is considered that the appearance position of the effective word in the document has a strong correlation with its importance. For example, it is a well-known fact that in newspaper articles, important words appearing in the first paragraph that express meaning are apparent. Therefore, the entire document is equally divided into n blocks in units of sentences, and a weight for each block is obtained.
[0038]
Next, the weight (2) will be described.
[0039]
As a correlation between the linguistic role and the importance of the effective word, for example, "words used in the form of a subject such as" ... ga "," ... ha "are important" Is not possible. ' Therefore, the following evaluation items are prepared as linguistic roles, and the respective weights are obtained by learning.
Ga
""
"Ni" case
"He" case
"Ha" case
"Mo" case
Other nominative nomenclature
Nominal modifier
Modification of the noun
End of sentence
Sentence with reading point
End of sentence
In the present embodiment, as shown in FIG. 8, evaluation items such as □ valid word “is included in the first sentence of the paragraph”, □ valid word “ga” case, “wo” case, etc. "Case role", □ Effective words are "sentence nouns at the end of the sentence". The evaluation items include the position of the effective word in the document and the role of the dependency of the word. It is also possible to employ. FIG. 8 shows the initial state of the evaluation item table, and the values of the importance in each document are all “1.0”.
[0040]
Next, in step 703, the co-occurrence frequency of each set of valid words is obtained from the number of valid words, and in step 704, an effective word vector is calculated from the co-occurrence frequency. Then, an effective word is extracted from the learning document by referring to the effective word vector, and in a succeeding step 705, a document vector of the learning document is obtained by averaging the effective word vectors of the extracted effective words.
[0041]
Next, the process proceeds to step 706, where the average of the document vectors of all the learning documents belonging to each category is calculated, and the average is set as the representative vector of the category, and the present process ends.
[0042]
Next, a learning algorithm of the value of the in-document importance will be described with reference to FIG.
[0043]
First, the values of the in-document importance for all the evaluation items are initialized to one.
[0044]
Next, a counter variable n for counting the number of learning documents used for learning is initialized to "0" (step 901). Next, the learning document to which the correct classification category (= C *) is added is read, and the counter variable n is incremented by "1" (step 902). As an example of the read document, a document as shown in FIG. 10 which is classified into the category “incident” is assumed. Then, an effective word described in the effective word dictionary 604 is extracted from the read learning document, and an attribute table of the extracted effective words is created according to the evaluation items of the evaluation item table 618 (step 903). Here, FIG. 11 shows an example of valid words extracted from the document of FIG. 10 according to the valid word dictionary 604 and an attribute table relating to the valid words.
[0045]
Next, the importance of each effective word in the document is calculated based on the extracted importance table and the value of importance described in the evaluation item table 618 (step 904). Then, the degree of belonging to each category of the document is calculated using the calculated in-document importance, the degree of belonging of each valid word to the category held in the valid word dictionary 604, and the like, and the highest degree of belonging is calculated. The category is set as the classification result (= C) (step 905).
[0046]
Then, the classification result (C) is compared with the correct classification category C * acquired in step 902 to check whether or not the classification result C is correct (step 906). The value of the importance is corrected (step 907).
[0047]
Here, a method of updating the importance value will be described with reference to FIG. 8 and FIGS.
[0048]
Now, if the document in FIG. 10 is misclassified into the category "science", first, the effective words "Faculty of Engineering", "Laboratory", and "Chemical Experiment" which have a high degree of belonging to the category "Science", which is a misclassification. Focus on it. Since these valid words are considered to be the cause of misclassification, the value of the importance in the evaluation item table of FIG. 8 is corrected so that the importance in the document is reduced.
[0049]
That is, when attention is paid to “engineering department”, the attribute value of “whether or not in the paragraph head sentence” (evaluation item 1 in FIG. 8) in the attribute table of FIG. 11 relating to “engineering department” is “TRUE”. , “Case role” (evaluation item 2 in FIG. 8) is “union of“ no ””, so “TRUE” and “case role” of “whether or not in first sentence of paragraph” in evaluation item table **. Decreases the value of the importance of the “no” union ”by a small amount. When focusing on“ laboratory ”and“ chemical experiment ”, the same processing as“ engineering department ”is performed.
[0050]
Next, attention is paid to the valid words “fire” and “injury” that have a high degree of belonging to the correct classification category “incident”. In order to produce a correct classification result, the importance of these effective words in a document must be greatly evaluated.
[0051]
Therefore, when focusing on "fire", the attribute value of "whether or not in the first sentence of the paragraph" (evaluation item 1) in the attribute table relating to "fire" is "TRUE" and "case role" ( Since the evaluation item 2) is “ga” case, the importance level of “TRUE” in “whether or not in the first sentence of the paragraph” in the evaluation item table 618 and the importance level of “ga ka” in “case role” are evaluated. Increase the value by a small amount.When focusing on "injury", the same processing as "fire" is performed.
[0052]
After the importance value is updated in this way, the process proceeds to step 908, where the correct answer rate r of the classification for the past N (N ≦ n) learning documents is calculated. If it is determined in step 906 that the classification result is correct, the process of updating the value of importance in step 907 is skipped, and the process proceeds to step 908.
[0053]
Next, it is checked whether the correct answer rate r has exceeded a predetermined value Th or the number of learning documents used for learning has exceeded a predetermined value M (step 909). If none of the above is satisfied, the process returns to step 902, and the same processing is performed based on the next learning document.
[0054]
By performing such processing, the evaluation item table 618 in which the importance of each evaluation item is appropriately corrected is realized.
[0055]
In this way, specific importance values for evaluation items that are considered useful for evaluating importance in a document, such as the appearance position of a word, case role, modification type, etc., are stored for each category and are stored separately. Is obtained by learning using the learning document. That is, initially, the initial value of the importance of each evaluation item is appropriately given, and the learning documents are classified. If the classification result is different from the correct category, and an erroneous classification occurs, An effective word that has had a great influence on the misclassification is extracted, and the value of the importance assigned to the evaluation item of the importance in the document that applies to the case is corrected by a small amount. Such processing is performed on a large number of learning documents, and a value of the in-document importance that automatically maximizes the classification accuracy is automatically obtained.
[0056]
Next, referring to FIG. 12, a description will be given of a processing procedure in a classification phase for automatically classifying a classification target document whose category is actually unknown.
[0057]
In the classification phase, first, in step 1201, an effective word is extracted from the document to be classified by referring to the effective word vector obtained in step 704, and the effective word vector extracted in step 1202 (the effective word obtained in step 704). Vector), and a document vector of the document to be classified is obtained from the average of the vectors.
[0058]
Next, the process proceeds to step 1203, where the document vector of the document to be classified is compared with the folder vector obtained in the learning phase, the category to which the document to be classified belongs is determined according to the comparison result, and the process ends.
[0059]
If the digitized documents stored on the document management server are classified into categories by the method described above, the category can be similarly estimated for newly read paper documents.
(3) Date search processing
Returning to the flowchart of FIG. 2 again, the date search process will be described.
[0060]
First, a search is made from the morpheme sequence for a sequence that is considered to represent a date. At this time, if there is a line feed code or a plurality of spaces or tabs before and after the date portion, it is picked up as a portion in which only date-only data exists (step 203 in FIG. 2). If there is date data picked up, the date data is set as a creation date as a file attribute (step 204). In the case of the original image shown in FIG. 3, the line “(space or tab) May 30, 2001 (line feed)” exists in the third line counted from the first line. May 30, 2001, ** hours ** minutes ** seconds ". Here, since a portion corresponding to the hour, minute, and second could not be extracted by morphological analysis, a default time is assigned by the system.
[0061]
In step 203, if there is no date data picked up, a search is made from the divided morpheme strings for date data other than a single date row (step 205).
[0062]
Next, of the searched date data, those representing the present or future date are discarded, and only the past date data is left (steps 206 to 208). In the example of FIG. 3, if there is no third line regarded as a date line, “June 20 (Sun) 9:00 to 15:00” and “June 10 (Thu) )] Is extracted.
[0063]
After all the morpheme strings have been searched, heuristic estimation of the date on which the manuscript was actually created, or the date as close as possible, is performed from the extracted date data (step 209). At this time, the genre of the document estimated by the genre estimation processing is considered. For example, in a proposal, quotation, report, etc. created in an office, if multiple past dates appear, the date that goes back to the past is considered closer to the creation date. .
[0064]
On the other hand, news stories and news articles found in clippings of newspapers and magazines deal with past events or events that will occur in the future more than the moment they were written. Is determined from the contents and the date data extracted, etc., in the former case, the latest date of the extracted date data, and in the latter case, the oldest date is the date when the document was created. This data is assigned as a file attribute because it is concluded that the data is closest to the data.
[0065]
Also, when the creation date of the document is estimated from the date extracted in this way, the nature differs from the creation date given when the data is digitized data from the beginning created by keyboard input or the like, A flag is set as an identifier indicating that the information has uncertainty (step 210).
[0066]
If no date data exists, the file attribute assigned by the computer system is assigned as it is (step 211).
[0067]
<Other embodiments>
In the above-described embodiment, the genre of the original document is estimated based on the content of the text written in the document by using the electronic document group that has already been stored and automatically classified. Rather than classifying the documents, layout information and format patterns are registered in advance, and the format pattern of the document is recognized by an image recognition technique, and a genre is determined according to the format pattern.
[0068]
The timing of performing the genre estimation processing and the date data extraction processing on the document need not be as described in the embodiment, and the processing amount of the genre estimation processing and the The timing is adjusted to reduce the load on the system.
[0069]
【The invention's effect】
As described above, according to the present invention, in a document management server in which an already created digitized document and an OCR document in which a paper document is read by a scanner and character-recognized coexist and managed, When scanning and digitizing a document, search for the result of character recognition to find a character string that seems to contain the date the document was actually created, and assign it as a file attribute The present invention provides a paper / electronic file integrated document management system in which the operator can perform a search not on the date and time of scanning but on the date of actual creation or on the date close to the text, instead of the date and time of scanning. be able to.
[Brief description of the drawings]
FIG. 1 is a diagram showing an overall configuration of a system according to the present invention.
FIG. 2 is a flowchart showing a flow of processing in the present invention.
FIG. 3 is a diagram illustrating an example of a document image.
FIG. 4 is a diagram illustrating an example of a recognized character candidate.
FIG. 5 is a diagram showing an example of word candidates extracted by searching a dictionary.
FIG. 6 is a diagram showing a configuration of a genre estimation device.
FIG. 7 is a flowchart illustrating a processing procedure of a learning phase in the genre estimating apparatus.
FIG. 8 is a diagram showing an example of an evaluation item table.
FIG. 9 is a flowchart illustrating a learning algorithm of a degree of importance in a document.
FIG. 10 is a diagram illustrating an example of a classification target document.
FIG. 11 is a diagram illustrating an example of an attribute table.
FIG. 12 is a flowchart showing a processing procedure of a classification phase in the genre estimating apparatus.
[Explanation of symbols]
101 OCR
102 Morphological analyzer
103 Word Dictionary
104 Genre Estimation Device
105 Date data search device
106 Attribute Assignment Device
107 Document storage device

Claims

A document management device that stores and manages document images and digitized documents read by a scanner or the like,
The character string in the document image is subjected to character recognition using a recognition dictionary, morphological analysis is performed using a word dictionary, and word output means for outputting a word as a recognition candidate is output by the word output means. An extracting device for extracting a portion representing a date and time from a word string; an estimating device for estimating a time when the document image was created from data on the date and time extracted by the extracting device; and an estimating device. And a providing unit for providing the estimated time as an attribute of the document image.

Valid word extracting means for extracting characteristic words as valid words for automatically classifying documents into predetermined categories; and estimating a category to which the document belongs from the appearance degree of the valid words extracted by the valid word extracting means. 2. The document management apparatus according to claim 1, further comprising: a category estimating unit; and a category adjusting unit that changes a time estimated by the estimating unit according to the category estimated by the category estimating unit.

2. The apparatus according to claim 1, further comprising: a format discriminating unit for discriminating a format of the document image from layout information and the like, and a format adjusting unit for changing a time estimated by the estimating unit according to the format discriminated by the format discriminating unit. Document management device as described.

2. The document management apparatus according to claim 1, wherein, when the extraction unit fails in the extraction, the date and time at which the original image was read are added as attributes assigned by the assignment unit.

The creation means extracted and attached from the text and the creation date as the date and time when the reading operation was performed can be distinguished and displayed as an attribute. 2. The document management apparatus according to claim 1, wherein a case where the document image is read as an attribute of the document image and a case where the date and time when the document image is read are provided as attributes of the document image can be distinguished.