JP4030624B2

JP4030624B2 - Document processing apparatus, storage medium storing document processing program, and document processing method

Info

Publication number: JP4030624B2
Application number: JP21930197A
Authority: JP
Inventors: 直之野村; 信二藤澤
Original assignee: 株式会社ジャストシステム
Priority date: 1997-07-29
Filing date: 1997-07-29
Publication date: 2008-01-09
Anticipated expiration: 2017-07-29
Also published as: JPH1153396A

Description

【０００１】
【発明の属する技術分野】
この発明は、文書処理装置、文書処理方法および文書処理プログラムを記憶した記憶媒体に係り、詳細には、蓄積してある過去の原文と要約の組を新規に要約を作成するにあたり、参照・活用することで、より読み易い要約を作成する技術に関する。
【０００２】
【従来の技術】
従来、書籍、論文、報告書等の各種の文書に対し、要約（抄録を含む）の自動作成処理や、他文書等との関連づけ処理等の各種処理をコンピュータを用いて行うことが行われている。
文書の自動要約については、例えば、「全文情報からの意味的情報の抽出と加工」（情報処理学会第３８回全国大会予稿集、第２２２頁；１９８９年）で提案されている。この方法では、まず文書中の重要語を字種や動詞等の情報から抽出し、さらに重要語の出現頻度から最重要語を決定する。次に、重要語と最重要語が出現するか否かから重要文を決定することで、自動的に要約を作成することが可能になる。また、文章の段落の性質を反映させることで、より正確に要約を作成する特開平３−１９１４７５号公報に記載された方法等も提案されている。
一方、他のデータとの関連づけとしては、インターネットにおけるハイパーリンクや、フレームシステム等による知識処理（エキスパートシステム等）における関連づけ等が行われている。
【０００３】
【発明が解決しようとする課題】
このような、従来の文書処理装置では、その都度該装置で要約を行い、過去になされた要約を参照して行うものではなかった。ところで、通常、文書を作成する際、例えば、挨拶状、法律的な警告状などを作成するとき、過去に作成された同様な書類を見本として参照することが多い。同様に要約を作成する場合、過去に作成された要約を参考にすることも有意義であると考えられる。特に、多くの人に使用された実績のある要約文は、要約を行う際に参照することが有益であると期待できる。
【０００４】
そこで、本発明は、このような従来の課題を解決するために成されたもので、過去に行われた要約と原文と条件設定パラメータの組を蓄積しておき、条件設定パラメータを参照して新たな要約を作成することができる文書処理装置および文書処理方法を提供することを第１の目的とする。
また、本発明は、過去に行われた要約と原文と条件設定パラメータの組を蓄積しておき、条件設定パラメータを参照して新たな要約を作成することができるコンピュータ読取り可能な文書処理プログラムを記憶した記憶媒体を提供することを第２の目的とする。
【０００５】
【課題を解決するための手段】
請求項１記載の発明では、文書処理装置に、文書を、当該文書の要約、当該要約時の条件設定パラメータと共に蓄積するデータベースと、所定形式の文書を取得する文書取得手段と、前記文書取得手段により取得された文書と類似している文書を前記データベースから検索する類似文書検索手段と、この類似文書検索手段で検索された文書に対応する条件設定パラメータを前記データベースから取得する条件設定パラメータ取得手段と、この条件設定パラメータ取得手段で取得した条件設定パラメータに基づいて、前記文書取得手段で取得した文書の要約文を作成する要約手段とを備えさせて前記第１の目的を達成する。
【０００７】
請求項２に記載した発明では、請求項１に記載した文書処理装置において、前記文書取得手段で取得された文書を特徴づける文書ベクトルを決定する文書ベクトル決定手段を備え、前記類似文書検索手段は前記文書ベクトル決定手段で決定された各文書の文書ベクトルにより各文書間の類似度を判定して検索する。
【０００８】
請求項３に記載した発明では、請求項１または請求項２記載の文書処理装置において、前記類似文書検索手段で文書を検索する際の類似判断の条件は、利用者による設定が可能であることとした。
【００１０】
請求項４に記載した発明では、記憶媒体に、文書を、当該文書の要約、当該要約時の条件設定パラメータと共に蓄積するデータベースを備えた文書処理装置に、所定形式の文書を取得する文書取得機能と、前記文書取得機能により取得された文書と類似している文書を前記データベースから検索する類似文書検索機能と、この類似文書検索機能で検索された文書に対応する条件設定パラメータを前記データベースから取得する条件設定パラメータ取得機能と、この条件設定パラメータ取得機能で取得した条件設定パラメータに基づいて、前記文書取得機能で取得した文書の要約文を作成する要約機能と、を実現させるためのコンピュータ読取り可能な文書処理プログラムを記憶させて前記第２の目的を達成する。
【００１２】
請求項５に記載した発明では、請求項４記載の記憶媒体に、前記文書取得機能で取得された文書を特徴づける文書ベクトルを決定する文書ベクトル決定機能を備え、前記類似文書検索機能は前記文書ベクトル決定機能で決定された各文書の文書ベクトルにより各文書間の類似度を判定して検索する機能をコンピュータに実現させるためのコンピュータ読取り可能な文書処理プログラムを記憶させて前記第２の目的を達成する。
【００１３】
請求項６に記載した発明では、請求項４または請求項５記載の記憶媒体に、前記類似文書検索機能で文書を検索する際の類似判断の条件は、利用者による設定が可能であることとする機能をコンピュータに実現させるためのコンピュータ読取り可能な文書処理プログラムを記憶させて前記第２の目的を達成する。
【００１５】
請求項７に記載した発明では、文書を、当該文書の要約、当該要約時の条件設定パラメータと共に蓄積するデータベースと、文書取得手段と、類似文書検索手段と、条件設定パラメータ取得手段と、要約手段と、を備えた文書処理装置において文書処理を行う際に用いられる文書処理方法であって、前記文書取得手段が、所定形式の文書を取得する第１ステップと、前記類似文書検索手段が、前記第１ステップにより取得された文書と類似している文書を前記データベースから検索する第２ステップと、前記条件設定パラメータ取得手段が、前記第２ステップで検索された文書に対応する条件設定パラメータを前記データベースから取得する第３ステップと、前記要約手段が、前記第３ステップで取得した条件設定パラメータに基づいて、前記第１ステップで取得した文書の要約文を作成する第４ステップと、を備えた文書処理方法を提供することにより前記第１の目的を達成する。
【００１６】
請求項８に記載した発明では、請求項７の文書処理方法において、さらに、文書ベクトル決定手段を備えた文書処理装置において文書処理を行う際に用いられる文書処理方法であって、前記文書ベクトル決定手段が、前記第１ステップで取得された文書を特徴づける文書ベクトルを決定する第５ステップを備え、前記第２ステップは、前記第５ステップで決定された各文書の文書ベクトルにより各文書間の類似度を判定して検索する文書処理方法を提供することにより前記第１の目的を達成する。
【００１７】
【発明の実施の形態】
以下、本発明の文書処理装置、文書処理方法および文書処理プログラムを記憶した記憶媒体の好適な実施の形態を、図１ないし図６を参照して詳細に説明する。
（１）実施の形態の概要
本実施の形態では、対象文書の文書ベクトルを求め、検索対象の各文書との文書ベクトルの差をとる。これらの連続する２つの文書間のコサインバリュー（cosine value）が高いか低いかで内容の類似性を判断する。類似度が高い（即ち類似度が高い場合）とされた文書の要約文および条件設定パラメータを取得し、これを参照して新たな要約文を生成する。
【００１８】
（２）実施の形態の詳細
図１は、文書処理装置の構成を表したブロック図である。
本実施の形態の文書処理装置は、パーソナルコンピュータやワードプロセッサ等を含むコンピュータシステムとして構成し、また、ＬＡＮ（ローカル・エリア・ネットワーク）のサーバーやインターネットを含むコンピュータ（パソコン）通信のホストとして構成することが可能である。
文書処理装置は、図１に示すように装置全体を制御するための制御部１１を備えている。この制御部１１には、データバス等のバスライン２１を介して、入力装置としてのキーボード１２やマウス１３、表示装置１４、印刷装置１５、記憶装置１６、記憶媒体駆動装置１７、通信制御装置１８、および、入出力Ｉ／Ｆ１９、および、文字認識装置２０が接続されている。
制御部１１は、ＣＰＵ１１１、ＲＯＭ１１２、ＲＡＭ１１３を備えている。
ＲＯＭ１１２は、ＣＰＵ１１１が各種制御や演算を行うための各種プログラムやデータが予め格納されたリードオンリーメモリである。
【００１９】
ＲＡＭ１１３は、ＣＰＵ１１１にワーキングメモリとして使用されるランダム・アクセス・メモリである。このＲＡＭ１１３には、本実施の形態による要約処理を行うためのエリアとして、要約対象文書格納エリア１１３１、要約パラメータ格納エリア１１３２、検索原文と要約格納エリア１１３３、文書ベクトル格納エリア１１３４、要約格納エリア１１３５、その他の各種エリアが確保されるようになっている。
文書ベクトル格納エリア１１３４には、要約対象文書に対する文書ベクトルと、後述する各サブ文書に対する文書ベクトルとが格納される。
要約格納エリア１１３５には、本実施の形態により発見された各トピックを含む各サブ文書群に対するサブ要約と、要約対象文書全体に対する要約とが格納される。
【００２０】
キーボード１２は、かな文字を入力するためのかなキーやテンキー、各種機能を実行するための機能キー、カーソルキー、等の各種キーが配置されている。
マウス１３は、ポインティングデバイスであり、表示装置１４に表示されたキーやアイコン等を左クリックすることで対応する機能の指定を行う入力装置である。
表示装置１４は、例えばＣＲＴや液晶ディスプレイ等が使用される。この表示装置には、要約対象文書の内容や、本実施の形態により自動生成された要約の内容等が表示されるようになっている。
印刷装置１５は、表示装置１４に表示された文章や、記憶装置１６の文書格納部１６４に格納された文書等の印刷を行うためのものである。この印刷装置としては、レーザプリンタ、ドットプリンタ、インクジェットプリンタ、ページプリンタ、感熱式プリンタ、熱転写式プリンタ、等の各種印刷装置が使用される。
【００２１】
記憶装置１６は、読み書き可能な記憶媒体と、その記憶媒体に対してプログラムやデータ等の各種情報を読み書きするための駆動装置で構成されている。この記憶装置１６に使用される記憶媒体としては、主としてハードディスクが使用されるが、後述の記憶媒体駆動装置１７で使用される各種記憶媒体のうちの読み書き可能な記憶媒体を使用するようにしてもよい。
記憶装置１６は、仮名漢字変換辞書１６１、プログラム格納部１６２、データ格納部１６３、文書データベース１６４、要約データベース１６５、文書ベクトルデータベース１６６、図示しないその他の格納部（例えば、この記憶装置１６内に格納されているプログラムやデータ等をバックアップするための格納部）等を有している。
プログラム格納部１６２には、本実施の形態における自動要約処理プログラム、文書ベクトル作成処理プログラム、要約作成処理プログラム等の各種プログラムの他、仮名漢字変換辞書１６１を使用して入力された仮名文字列を漢字混り文に変換する仮名漢字変換プログラム等の各種プログラムが格納されている。
データ格納部１６３には、要約パラメータのデフォルト値等の各種データが格納されている。要約パラメータのデフォルト値としては、例えば、全文書に対する要約の比率＝「２５％」や、日付時刻、価格情報、物理量（サイズ、重量、温度等）等の数量重視＝「しない」や、ＵＲＬ（Uniform Resource Locator) 重視＝「しない」や、です／ます／であるの選択＝「しない」、等の値が格納されている。
【００２２】
文書データベース１６４には、仮名漢字変換プログラムにより作成された文書や、他の装置で作成されて記憶媒体駆動装置１７や通信制御装置１８から読み込まれた文書が格納される。この文書データベース１６４に格納される各文書の形式は特に限定されるものではなく、テキスト形式の文書、ＨＴＭＬ（Hyper Text Markup Language）形式の文書、ＪＩＳ形式の文書等の各種形式の文書の格納が可能である。文書データベース１６４には、これらの形式の文書データのが格納される。
要約データベース１６５、及び文書ベクトルデータベース１６６には、文書データベース１６４に格納されている各文書に対応する要約や文書ベクトルが格納されるようになっている。
【００２３】
図２は、文書ベクトルデータベース１６６の内容を概念的に表したものである。
この図２に示されるように、文書中から自動抽出されたキーワードｘに対して求められた要素値ｆ（ｘ）が文書ベクトルの要素として格納されている。この文書ベクトルは各文書（Ａ、Ｂ、Ｃ…）毎に格納され、文書データベース１６４に格納されている各文書と対応づけられている。
各文書ベクトルの次元は採用するキーワードｘ（重要語句）の数であるが、２文書間の類似度を両文書ベクトルから求める場合には、両文書のキーワードの和集合の数が両文書ベクトルの次元となる。この場合、一方の文書ベクトルにのみ含まれるキーワードに対する他方の文書ベクトルの要素値は、”０”に定義される。
【００２４】
例えば、図２おいて、文書Ｂのキーワードは「重要、重要語、重要度、…」、文書Ｃのキーワードは「重要、…、政治、…」であり、両文書の文書ベクトルは次の通りである。
文書Ｂの文書ベクトル＝（１，１８，１９，…）
文書Ｃの文書ベクトル＝（１８，…，２１，…）
これに対して文書Ｂと文書Ｃとの類似度を算出する場合には、両文書のキーワードを「重要、重要語、重要度、…、政治、…」とし、両文書の文書ベクトルはつぎの通り定義される。
文書Ａの文書ベクトル＝（１，１８，１９，…，０，…）、
文書Ｃの文書ベクトル＝（１８，０，０，…，２１，…）
【００２５】
記憶媒体駆動装置１７は、ＣＰＵ１１１が外部の記憶媒体からコンピュータプログラムや文書を含むデータ等を読み込むための駆動装置である。記憶媒体に記憶されているコンピュータプログラムには、本実施の形態の文書処理装置により実行される各種処理のためのプログラム、および、そこで使用される辞書、データ等も含まれる。
ここで、記憶媒体とは、コンピュータプログラムやデータ等が記憶される記憶媒体をいい、具体的には、フロッピーディスク、ハードディスク、磁気テープ等の磁気記憶媒体、メモリチップやＩＣカード等の半導体記憶媒体、ＣＤ−ＲＯＭやＭＯ、ＰＤ（相変化書換型光ディスク）等の光学的に情報が読み取られる記憶媒体、紙カードや紙テープ等の用紙（および、用紙に相当する機能を持った媒体）を用いた記憶媒体、その他各種方法でコンピュータプログラム等が記憶される記憶媒体が含まれる。本実施の形態の文書処理装置において使用される記憶媒体としては、主として、ＣＤ−ＲＯＭやフロッピーディスクが使用される。
記憶媒体駆動装置１７は、これらの各種記憶媒体からコンピュータプログラムを読み込む他に、フロッピーディスクのような書き込み可能な記憶媒体に対してＲＡＭ１１３や記憶装置１６に格納されているデータ等を書き込むことが可能である。
【００２６】
本実施の形態の文書処理装置では、制御部１１のＣＰＵ１１１が、記憶媒体駆動装置１７にセットされた外部の記憶媒体からコンピュータプログラムを読み込んで、記憶装置１６の各部に格納する。そして、本実施の形態による自動要約処理等の各種処理を実行する場合、記憶装置１６から該当プログラムをＲＡＭ１１３に読み込み、実行するようになっている。
但し、記憶装置１６からではなく、記憶媒体駆動装置１７により外部の記憶媒体から直接ＲＡＭ１１３に読み込んで実行することも可能である。また、文書処理装置によっては、本実施の形態の自動要約処理プログラム等を予めＲＯＭ１１２に記憶しておき、これをＣＰＵ１１１が実行するようにしてもよい。
【００２７】
通信制御装置１８は、他のパーソナルコンピュータやワードプロセッサ等との間でテキスト形式やＨＴＭＬ形式等の各種形式の文書やビットマップデータ等の各種データの送受信を行うことができるようになっている。
入出力Ｉ／Ｆ１９は、音声や音楽等の出力を行うスピーカ等の各種機器を接続するためのインターフェースである。
文字認識装置２０は、用紙等に記載された文字をテキスト形式やＨＴＭＬ等の各種形式で認識する装置であり、イメイージスキャナや文字認識プログラム等で構成されている。
【００２８】
本実施の形態では、キーボード１２の入力操作により作成した文書（ＲＡＭ１１３の所定格納エリアに格納）の他、外部で作成して所定の記憶媒体に格納した文書で記憶媒体駆動装置１７から読み込んだ文書、予め文書データベースに格納されている文書、通信制御装置１８からダウンロードした文書、及び文字認識装置２０で文字認識した文書、等の各種文書を対象文書として取得する（文字取得手段）ことが可能である。
【００２９】
以上のように構成された本実施の形態の文書処理装置による、複数文書から要約を作成する自動要約処理の動作について図３から図６を用いて説明する。
図３は自動要約処理のメイン動作を表したものである。図４中に示した文書ベクトルは、概念的に理解しやすくするために２次元で表示したものであるが、実際にはＮ次元ベクトルである。
ＣＰＵ１１１は、要約を作成する対象となっている要約対象文書Ａ（図４（Ａ））を取得し、ＲＡＭ１１３の要約対象文書格納エリア１１３１に格納する（ステップ１０）。要約対象文書は、ユーザの指示に従ってＲＡＭ１１３（自装置内で作成された文書である場合）、記憶装置１６の文書データベース１６４（要約が未だ作成されていない文書である場合）、記憶媒体駆動装置１７（自装置または他装置で作成済みの文書の場合）、通信制御装置１８（パソコン通信、インターネット等の通信による場合）から取得する。
【００３０】
次に、ＣＰＵ１１１は、ユーザによってキーボード１２等から要約パラメータが入力された場合には入力値を取得し、ユーザによる入力がない場合にはデータ格納部１６３に格納された要約パラメータのデフォルト値を取得し、要約パラメータ格納エリア１１３２に格納する（ステップ１１）。
【００３１】
次に、ＣＰＵ１１１は、要約対象文書格納エリア１１３１に格納した要約対象文書の各文章に対する文書ベクトルＶ（図４）を求める。
図５は、文書ベクトル作成処理の動作を表したフローチャートである。
ＣＰＵ１１１は、形態素解析を行うことで要約対象文書の文章から自立語を抽出する（ステップ１３１）と共に、名詞句、複合名詞句等を含めた候補語（句）を要約対象文書Ａから抽出しＲＡＭ１１３の所定作業領域に格納する（ステップ１３２）。
そして抽出した候補語（句）の要約対象文書での出現頻度、評価関数から、各候補語（句）重要度ｆ（ｘ）を決定する（ステップ１３３）。ここで、評価関数としては、例えば、所定の重要語が予め指定されている場合にはその重要語に対する重み付け、単語、名詞句、複合名詞句等の候補語（句）の種類による重み付け等が使用される。
さらにＣＰＵ１１１は、決定した重要度ｆ（ｘ）の値から要約対象文書Ａのキーワードａ，ｂ，…を決定する（ステップ１３４）。そして、各キーワードの重要度ｆ（ｘ）を要素として、文書ベクトルＶ＝（ｆ（ａ），ｆ（ｂ），…）をＲＡＭ１１３の文書ベクトル格納エリア１１３４に格納する（ステップ１３５）。
【００３２】
要約対象のに対して文書ベクトルＶが求まるとＣＰＵ１１１は、データベースに蓄積されている原文と要約文の組との類似度を求める（ステップ１２）。このデータベースは、過去の原文と要約文の組を履歴として蓄積してあるものである。この要約文は、自動で行われたか、手動で行われたかを問わず蓄積しておく。また、自動で要約された場合は、その圧縮率などの条件設定パラメータも併せて保存してある。さらに、各文書毎の文書文書をベクトルを予め求めてデータとして保持しておくこともできる。
【００３３】
データベースに蓄積されている文書と要約対象文書間の類似度ｓを、両者の文書ベクトルｂｎと文書ベクトルｂｎ＋１間の角度に依存するコサインにより求める。すなわち、両文書ベクトルｂｎとｂｎ＋１間の角度をｑとし、両文書ベクトルの内積をｂｎ・ｂｎ＋１とし、両文書ベクトルの大きさをそれぞれ｜ｂｎ｜、｜ｂｎ＋１｜とした場合、両文書ベクトルの類似度ｓは次の数式１により求まる。
【００３４】
【数１】
類似度ｓ＝ＣＯＳ（ｑ）＝（ｂｎ・ｂｎ＋１）／（｜ｂｎ｜×｜ｂｎ＋１｜）
【００３５】
この類似度ｓの値は−１≦ｓ≦１までの値をとり、１に近いほど２つの文書ベクトルが互いに平行に近く、２つの文書同士は似ていると考えることができる。
その後、類似するとされた原文と要約文の組から条件設定パラメータや要約文を抽出する（ステップ１３）。そして、この抽出した条件設定パラメータや要約文を参考して要約文を生成する（ステップ１４）。この実施の形態では、要約文を生成する際、文書構造が類似した文書がある場合、例えば、新聞記事、法律文書、科学技術の論文等の類似性を反映した要約文を作成することができる。
また、条件設定パラメータを参照することで、より品質の高い要約を作成できる。例えば、要約圧縮率が過去の条件設定パラメータとして保存されていれば、２５％がよいか１５％がよいかを参照することができる。さらに、要約の際、長文優先、短文優先、数量優先、といったことも参照することができる。
参照できる文書が多数存在した場合、それを表示して利用者に選択させることもできる。特に、過去に再利用が盛んに行われた文書、読まれて肯定的なコメントがされている文書、社内の重要人物が参照したことがある文書を優先して模範要約に利用すると精度の高い要約文が生成きると期待できる。
【００３６】
図６は、要約作成処理の動作を表したフローチャートである。
ＣＰＵ１１１は、まず形態素解析を行うことで各文書群に含まれる自立語を抽出する（ステップ２２１）と共に、名詞句、複合名詞句等を含めた候補語（句）を要約対象文書Ａから抽出しＲＡＭ１１３の所定作業領域に格納する（ステップ２２２）。
そして、ＲＡＭ１６の要約パラメータ格納エリア１１３２に格納した要約パラメータや、抽出した候補語（句）の各文書群中での出現頻度、評価関数等から、各候補語（句）重要度ｆ（ｙ）を決定する（ステップ２２３）。ここで、評価関数としては、例えば、所定の重要語が予め指定されている場合にはその重要語に対する重み付け、単語、名詞句、複合名詞句等の候補語（句）の種類による重み付け等が使用される。
【００３７】
さらにＣＰＵ１１１は、決定した重要度ｆ（ｙ）や要約パラメータ格納エリアリレーに格納された要約パラメータ等から、各文書群含まれる各センテンスに対する重要度Ｆ（ｚ）を決定する（ステップ２２４）。そして、決定したセンテンスの重要度Ｆ（ｚ）の重要度が高いセンテンスの上位から要約パラメータの要約比率（例えば、文書群の全センテンス数の内の上位２５％）以内に入るセンテンスをリストアップする（ステップ２２５）。
そしてＣＰＵ１１１は、リストアップしたセンテンスを文書群の中での出現順に並べることで当該文書についての要約とし、これをＲＡＭ１１３の要約格納エリア１１３５の所定エリアに格納して（ステップ２２６）、図３の自動要約処理ルーチンにリターンして、本実施の形態による自動要約処理を終了する。
【００３８】
以上説明したように、本実施の形態による自動要約処理によれば、過去になされた要約を参考に要約文を作成するので、精度が高く、読み易い要約を作成することができる。
【００３９】
以上の自動要約処理が終了すると、ＣＰＵ１１１はユーザの指示によりＲＡＭ１１３に格納した各データの保存処理を行う。
すなわち、要約対象文書格納エリア１１３１から要約対象文書を読み出して、記憶装置１６の文書データベース１６４に格納する。また作成した要約を要約格納エリア１１３５から読み出し、文書データベース１６４に格納した要約対象文書との関連性を付けて記憶装置１６の要約データベース１６５に格納する。さらに、文書ベクトル作成処理で求めた文書ベクトルＶを文書ベクトル格納エリア１１３５から読み出し、文書データベース１６４に格納した要約対象文書との関連性を付けて記憶装置１６の文書ベクトルデータベース１６６に格納する。
【００４０】
以上、本実施の形態の構成および自動要約処理について説明したが、本発明では、これらの各形態に限定されるものではなく、請求項に記載された発明の範囲内で種々の変形をすることが可能である。
例えば実施の形態では、形態素解析及び候補語（句）の抽出について、文書ベクトル作成処理（図５のステップ１３１とステップ１３２）と、要約作成処理（図６のステップ２２１とステップ２２２）とにおいて独立して同様な処理を行うこととしたが、本発明では、文書ベクトル作成処理で抽出した候補語（句）をＲＡＭ１６の所定エリアに格納しておき、要約作成処理で利用するようにしてもよい。
【００４１】
また説明した実施の形態では、自動要約処理が終了した後の保存処理において、要約対象文書、要約、文書ベクトルＶのみを記憶装置１６の各データベース１６４、１６５、１６６に格納し保存するようにしたが、本発明では更に、文書ベクトル作成処理（図５）のステップ１３２で要約対象文書から抽出し、ＲＡＭ１１３の所定作業領域に格納した候補語（句）を要約対象文書Ａと関連つけて、文書データベース１６４、又は専用の候補語（句）データベースに格納するようにしてもよい。
また要約パラメータ格納エリア１１３２から要約パラメータを読み出して、当該要約に関連付けて、要約データベース１６６、または専用の要約パラメータデータベースに格納するようにしてもよい。
【００４２】
さらに、説明した実施の形態では、文書ベクトル作成処理（及び要約作成処理（ステップ２２、図６）の両処理において、形態素解析（ステップ１３１、２２１）と候補語（句）の抽出（ステップ１３２、２２２）を行った。
しかし、同一センテンスに対する処理であるため、抽出した候補語（句）は同一である。そこで、本発明では、文書ベクトル作成処理で抽出した候補語（句）をＲＡＭ１１３の所定エリアに格納しておき、要約処理において格納した候補語（句）を使用することでステップ２２１とステップ２２２を省略するようにしてもよい。
この候補語（句）についても、要約対象文書に対する候補語（句）として文書データベース１６４、又は専用の候補語（句）データベースに格納するようにしてもよい。
【００４３】
説明した実施の形態では文書ベクトルを作成する方法として図５のフローチャートに従った方法を１例にして説明したが、本発明でこの方法に限られるものではなく、要約対象文書中Ａからキーワードを抽出する方法や、抽出キーワードに対する重要度（＝文書ベクトルの要素値）の決定方法等については、公知の各種方法により置き換えることが可能である。
また、各サブ文書群に対する要約の作成処理についても同様に図６のフローチャートに示した方法に限られるものではなく、公知の各種要約方法、抄録作成方法等を資料することが可能である。
更に、２つの文書ベクトルの類似度の算出方法については、数式１により類似度を算出することとしたが、この数式に限定されるものではなく、ベクトル相互間の類似関係を表すことが可能であれば他の数式により類似度を算出することも可能である。
【００４４】
説明した実施の形態では、日本語で作成された文書に限られるものでなく、あらゆる言語で作成された文書を対象とすることが可能である。その場合、対象となる文書が作成された言語用の形態素解析アルゴリズム等を使用するといった、本発明の構成には影響のない部分を変更するだけでよい。
なお、以上の実施の形態において説明した、各装置、各部、各動作、各処理等に対しては、それらを含む上位概念としての各手段（〜手段）により、実施の形態を構成することが可能である。
例えば、「決定した重要度ｆ（ｘ）の値から要約対象文書Ａのキーワードａ，ｂ，…を決定する（ステップ１３４）」との記載に対して「キーワード決定手段」を構成し、「決定したセンテンスの重要度Ｆ（ｚ）の重要度が高いセンテンスの上位から要約パラメータの要約比率（例えば、サブ文書群の全センテンス数の内の上位２５％）以内に入るセンテンスをリストアップする（ステップ２２５）」との記載に対して「センテンスリストアップ手段」を構成するようにしてもよい。
同様に、その他各種動作に対して「〜（動作）手段」等の上位概念で実施の形態を構成するようにしてもよい。
【００４５】
【発明の効果】
本発明によれば、過去に行われた要約の条件設定パラメータに基づいて要約を行うので、精度が高く、対象文書の内容を把握し易い要約を作成することができる。
【図面の簡単な説明】
【図１】本発明の１実施の形態における文書処理装置の構成を表したブロック図である。
【図２】同上、実施の形態における文書ベクトルデータベースの内容を概念的に表した説明図である。
【図３】同上、実施の形態における自動要約処理のメイン動作を表したフローチャートである。
【図４】同上、実施の形態における、文書Ａに対する文書ベクトルを求めたところ示す図である。
【図５】同上、実施の形態における文書ベクトル作成処理の動作を表したフローチャートである。
【図６】同上、実施の形態における要約作成処理の動作を表したフローチャートである。
【符号の説明】
１１制御部
１１２ＲＯＭ
１１３ＲＡＭ
１１３１要約対象文書格納エリア
１１３２要約パラメータ格納エリア
１１３３検索原文＋要約格納エリア
１１３４文書ベクトル格納エリア
１１３５要約格納エリア
１２キーボード
１３マウス
１４表示装置
１５印刷装置
１６記憶装置
１６１仮名漢字変換辞書
１６２プログラム格納部
１６３データ格納部
１６４文書データベース
１６５要約データベース
１６６文書ベクトルデータベース
１７記憶媒体駆動装置
１８通信制御装置
１９入出力Ｉ／Ｆ
２０文字認識装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document processing apparatus, a document processing method, and a storage medium storing a document processing program, and more particularly, reference / utilization in newly creating a summary of a pair of past original text and summary that has been accumulated. Thus, the present invention relates to a technique for creating a more readable summary.
[0002]
[Prior art]
Conventionally, various types of processing such as automatic summarization (including abstracts) processing and association processing with other documents have been performed on various types of documents such as books, papers, reports, etc. Yes.
The automatic summarization of documents has been proposed, for example, in “Extraction and Processing of Semantic Information from Full Text Information” (Proceedings of the 38th National Convention of Information Processing Society, page 222; 1989). In this method, an important word in a document is first extracted from information such as character type and verb, and the most important word is determined from the appearance frequency of the important word. Next, it is possible to automatically create a summary by determining an important sentence from whether or not an important word and the most important word appear. In addition, a method described in Japanese Patent Laid-Open No. Hei 3-191475 for creating a summary more accurately by reflecting the properties of paragraphs of sentences has been proposed.
On the other hand, as associations with other data, hyperlinks on the Internet, associations in knowledge processing (expert systems, etc.) using a frame system or the like are performed.
[0003]
[Problems to be solved by the invention]
Such a conventional document processing apparatus does not perform summarization with the apparatus each time and refers to summaries made in the past. By the way, usually, when creating a document, for example, when creating a greeting card, a legal warning letter, etc., a similar document created in the past is often referred to as a sample. Similarly, when creating summaries, it is considered meaningful to refer to summaries created in the past. In particular, it is expected that it is useful to refer to a summary sentence with a track record used by many people when summarizing.
[0004]
Therefore, the present invention has been made to solve such conventional problems, and summaries and original texts made in the past have been developed. And condition setting parameters The set of Condition setting parameters The first object is to provide a document processing apparatus and a document processing method capable of creating a new summary with reference to FIG.
The present invention also provides a summary and original text made in the past. And condition setting parameters The set of Condition setting parameters A second object is to provide a storage medium storing a computer-readable document processing program capable of creating a new summary with reference to FIG.
[0005]
[Means for Solving the Problems]
In invention of Claim 1, A database for storing documents in a document processing apparatus together with a summary of the document and a condition setting parameter at the time of the summary; Document acquisition means for acquiring a document in a predetermined format, and a document similar to the document acquired by the document acquisition means Said Similar document search means for searching from database, and documents searched by this similar document search means Condition setting parameters corresponding to The From the database get Condition setting parameters Acquisition means and this Condition setting parameters Acquired by acquisition means Based on condition setting parameters And a summary means for creating a summary sentence of the document acquired by the document acquisition means to achieve the first object.
[0007]
Claim 2 In the invention described in claim 1 The document processing apparatus according to claim 1, further comprising: a document vector determining unit that determines a document vector that characterizes the document acquired by the document acquiring unit, wherein the similar document search unit is configured to store each document determined by the document vector determining unit. Search is performed by determining the similarity between the documents based on the document vector.
[0008]
Claim 3 In the invention described in claim 1, claim 1 or claim 2 In the document processing apparatus described above, when searching for a document by the similar document search means of Conditions for similarity judgment Is user by Setting Is possible It was decided.
[0010]
Claim 4 In the invention described in the above, the storage medium In a document processing apparatus having a database for storing a document together with a summary of the document and a condition setting parameter at the time of the summary, A document acquisition function for acquiring a document in a predetermined format, and a document similar to the document acquired by the document acquisition function. Said Similar document search function to search from database and documents searched by this similar document search function Condition setting parameters corresponding to The From the database get Condition setting parameters Get this and this Condition setting parameters Acquired with the acquisition function Based on condition setting parameters The second object is achieved by storing a computer readable document processing program for realizing a summary function for creating a summary sentence of a document acquired by the document acquisition function.
[0012]
Claim 5 In the invention described in claim 4 The document storage function includes a document vector determination function for determining a document vector that characterizes the document acquired by the document acquisition function, and the similar document search function includes the document vector of each document determined by the document vector determination function. Thus, the second object is achieved by storing a computer readable document processing program for causing a computer to realize a function of determining similarity between documents and performing a search.
[0013]
Claim 6 In the invention described in claim 4 Or claims 5 When searching for a document with the similar document search function in the described storage medium of Conditions for similarity judgment Is user by Setting Is possible A computer-readable document processing program for causing a computer to realize the function is stored to achieve the second object.
[0015]
Claim 7 In the invention described in In a document processing apparatus comprising: a database for storing a document together with a summary of the document and a condition setting parameter at the time of summarization; a document acquisition unit; a similar document search unit; a condition setting parameter acquisition unit; A document processing method used when performing document processing, wherein the document acquisition means includes: Get a document in a predetermined format And the similar document search means performs the first step by Documents that are similar to the retrieved document Said Search from database A second step in which the condition setting parameter acquisition means acquires a condition setting parameter corresponding to the document searched in the second step from the database, and the summarization means in the third step. Get did Based on condition setting parameters , In the first step Create a summary of the retrieved document A document processing method comprising: a fourth step; Thus, the first object is achieved.
[0016]
Claim 8 In the invention described in claim 7 In the document processing method of Further, a document processing method used when document processing is performed in a document processing apparatus provided with a document vector determination means, wherein the document vector determination means is the first step. A document vector that characterizes the retrieved document A second step of determining each of the documents determined in the fifth step. Search by judging similarity between documents by document vector Provide document processing methods Thus, the first object is achieved.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
A preferred embodiment of a document processing apparatus, a document processing method, and a storage medium storing a document processing program according to the present invention will be described in detail below with reference to FIGS.
(1) Outline of the embodiment
In this embodiment, the document vector of the target document is obtained, and the difference of the document vector from each document to be searched is obtained. The similarity of contents is determined based on whether the cosine value between these two consecutive documents is high or low. A summary sentence and a condition setting parameter of a document whose similarity is high (that is, when the similarity is high) are acquired, and a new summary sentence is generated with reference to the summary sentence.
[0018]
(2) Details of the embodiment
FIG. 1 is a block diagram showing the configuration of the document processing apparatus.
The document processing apparatus according to the present embodiment is configured as a computer system including a personal computer and a word processor, and is also configured as a LAN (local area network) server and a computer (personal computer) communication host including the Internet. Is possible.
As shown in FIG. 1, the document processing apparatus includes a control unit 11 for controlling the entire apparatus. The control unit 11 includes a keyboard 12 and a mouse 13 as input devices, a display device 14, a printing device 15, a storage device 16, a storage medium driving device 17, and a communication control device 18 via a bus line 21 such as a data bus. , And an input / output I / F 19 and a character recognition device 20 are connected.
The control unit 11 includes a CPU 111, a ROM 112, and a RAM 113.
The ROM 112 is a read-only memory in which various programs and data for the CPU 111 to perform various controls and calculations are stored in advance.
[0019]
The RAM 113 is a random access memory used as a working memory by the CPU 111. In the RAM 113, as a summary processing area according to the present embodiment, a summary target document storage area 1131, a summary parameter storage area 1132, a search original text and summary storage area 1133, a document vector storage area 1134, and a summary storage area 1135 are stored. Various other areas have been secured.
The document vector storage area 1134 stores a document vector for the summary target document and a document vector for each sub-document described later.
The summary storage area 1135 stores a subsummary for each subdocument group including each topic discovered by the present embodiment and a summary for the entire summary target document.
[0020]
The keyboard 12 is provided with various keys such as a kana key and a numeric keypad for inputting kana characters, function keys for executing various functions, and a cursor key.
The mouse 13 is a pointing device, and is an input device that designates a corresponding function by left-clicking a key, an icon, or the like displayed on the display device 14.
For example, a CRT or a liquid crystal display is used as the display device 14. This display device displays the contents of the document to be summarized, the contents of the summary automatically generated according to the present embodiment, and the like.
The printing device 15 is for printing texts displayed on the display device 14, documents stored in the document storage unit 164 of the storage device 16, and the like. As this printing apparatus, various printing apparatuses such as a laser printer, a dot printer, an ink jet printer, a page printer, a thermal printer, and a thermal transfer printer are used.
[0021]
The storage device 16 includes a readable / writable storage medium and a drive device for reading / writing various information such as programs and data from / to the storage medium. As a storage medium used for the storage device 16, a hard disk is mainly used. However, a readable / writable storage medium among various storage media used in the storage medium driving device 17 described later may be used. Good.
The storage device 16 includes a kana-kanji conversion dictionary 161, a program storage unit 162, a data storage unit 163, a document database 164, a summary database 165, a document vector database 166, and other storage units (not shown) (for example, stored in the storage device 16). A storage unit for backing up programs, data, etc., etc.
In the program storage unit 162, in addition to various programs such as an automatic summarization processing program, a document vector creation processing program, and a summary creation processing program in the present embodiment, a kana character string input using the kana-kanji conversion dictionary 161 is stored. Various programs such as a kana-kanji conversion program for converting into kanji mixed sentences are stored.
The data storage unit 163 stores various data such as default values of summary parameters. As default values of summary parameters, for example, the ratio of summaries to all documents = “25%”, quantity emphasis such as date / time, price information, physical quantity (size, weight, temperature, etc.) = “No”, URL ( Uniform Resource Locator) Stores values such as “important” = “no” or “is / mass / is” = “no”.
[0022]
The document database 164 stores a document created by a kana-kanji conversion program, and a document created by another device and read from the storage medium driving device 17 or the communication control device 18. The format of each document stored in the document database 164 is not particularly limited, and various types of documents such as a text document, an HTML (Hyper Text Markup Language) document, and a JIS document can be stored. Is possible. The document database 164 stores document data of these formats.
The summary database 165 and the document vector database 166 store summaries and document vectors corresponding to the respective documents stored in the document database 164.
[0023]
FIG. 2 conceptually shows the contents of the document vector database 166.
As shown in FIG. 2, the element value f (x) obtained for the keyword x automatically extracted from the document is stored as an element of the document vector. This document vector is stored for each document (A, B, C...) And is associated with each document stored in the document database 164.
The dimension of each document vector is the number of keywords x (important phrases) to be adopted. When the similarity between two documents is obtained from both document vectors, the number of unions of the keywords of both documents is the number of both document vectors. It becomes a dimension. In this case, the element value of the other document vector for a keyword included only in one document vector is defined as “0”.
[0024]
For example, in FIG. 2, the keyword of document B is “important, important words, importance,...”, The keyword of document C is “important,..., Politics, ...”, and the document vectors of both documents are as follows. It is.
Document vector of document B = (1, 18, 19,...)
Document vector of document C = (18,..., 21,...)
On the other hand, when calculating the similarity between document B and document C, the keywords of both documents are “important, important words, importance,..., Politics,...”, And the document vectors of both documents are as follows. Defined.
Document vector of document A = (1,18,19, ..., 0, ...),
Document vector of document C = (18, 0, 0,..., 21,...)
[0025]
The storage medium drive device 17 is a drive device for the CPU 111 to read data including computer programs and documents from an external storage medium. The computer program stored in the storage medium includes a program for various processes executed by the document processing apparatus according to the present embodiment, a dictionary used in the program, data, and the like.
Here, the storage medium refers to a storage medium in which computer programs, data, and the like are stored. Specifically, a magnetic storage medium such as a floppy disk, a hard disk, and a magnetic tape, and a semiconductor storage medium such as a memory chip and an IC card. , CD-ROM, MO, PD (phase change rewritable optical disc) and other optical storage media that can read information, and paper such as paper cards and paper tapes (and media with functions equivalent to paper) were used. Storage media and other storage media in which computer programs and the like are stored by various methods are included. As a storage medium used in the document processing apparatus according to the present embodiment, a CD-ROM or a floppy disk is mainly used.
The storage medium driving device 17 can read data stored in the RAM 113 and the storage device 16 in a writable storage medium such as a floppy disk in addition to reading the computer program from these various storage media. It is.
[0026]
In the document processing apparatus according to the present embodiment, the CPU 111 of the control unit 11 reads a computer program from an external storage medium set in the storage medium driving device 17 and stores it in each unit of the storage device 16. When various processing such as automatic summarization processing according to the present embodiment is executed, the corresponding program is read from the storage device 16 into the RAM 113 and executed.
However, it is also possible to read the program directly from the external storage medium into the RAM 113 by the storage medium driving device 17 instead of from the storage device 16 and execute it. Further, depending on the document processing apparatus, the automatic summarization processing program or the like according to the present embodiment may be stored in the ROM 112 in advance and executed by the CPU 111.
[0027]
The communication control device 18 can send and receive various types of data such as text format and HTML format and various data such as bitmap data to and from other personal computers and word processors.
The input / output I / F 19 is an interface for connecting various devices such as a speaker for outputting voice or music.
The character recognition device 20 is a device for recognizing characters written on paper or the like in various formats such as a text format or HTML, and is composed of an image scanner, a character recognition program, and the like.
[0028]
In this embodiment, in addition to a document created by an input operation of the keyboard 12 (stored in a predetermined storage area of the RAM 113), a document created externally and stored in a predetermined storage medium and read from the storage medium driving device 17 Various documents such as documents stored in the document database in advance, documents downloaded from the communication control device 18, and characters recognized by the character recognition device 20 can be acquired as target documents (character acquisition means). is there.
[0029]
The operation of automatic summarization processing for creating a summary from a plurality of documents by the document processing apparatus of the present embodiment configured as described above will be described with reference to FIGS.
FIG. 3 shows the main operation of the automatic summarization process. The document vectors shown in FIG. 4 are two-dimensionally displayed for easy understanding conceptually, but are actually N-dimensional vectors.
The CPU 111 acquires the summarization target document A (FIG. 4A) that is the subject of the summary, and stores it in the summarization target document storage area 1131 of the RAM 113 (step 10). The summarization target document is a RAM 113 (in the case of a document created in its own apparatus) according to a user instruction, a document database 164 in the storage device 16 (in the case of a document for which a summary has not yet been created), and a storage medium driving device 17. Acquired from the communication control device 18 (in the case of communication using a personal computer communication, the Internet, etc.)
[0030]
Next, the CPU 111 acquires an input value when a summary parameter is input from the keyboard 12 or the like by the user, and acquires a default value of the summary parameter stored in the data storage unit 163 when there is no input by the user. Then, it is stored in the summary parameter storage area 1132 (step 11).
[0031]
Next, the CPU 111 obtains a document vector V (FIG. 4) for each sentence of the summary target document stored in the summary target document storage area 1131.
FIG. 5 is a flowchart showing the operation of the document vector creation process.
The CPU 111 extracts independent words from the text of the summary target document by performing morphological analysis (step 131), and extracts candidate words (phrases) including noun phrases, compound noun phrases, and the like from the summary target document A and the RAM 113. (Step 132).
Then, each candidate word (phrase) importance f (x) is determined from the appearance frequency of the extracted candidate word (phrase) in the document to be summarized and the evaluation function (step 133). Here, as the evaluation function, for example, when a predetermined important word is designated in advance, weighting for the important word, weighting by the type of candidate word (phrase) such as a word, noun phrase, compound noun phrase, etc. used.
Further, the CPU 111 determines keywords a, b,... Of the summary target document A from the determined importance f (x) (step 134). Then, the document vector V = (f (a), f (b),...) Is stored in the document vector storage area 1134 of the RAM 113 with the importance f (x) of each keyword as an element (step 135).
[0032]
When the document vector V is obtained for the summary target, the CPU 111 obtains the similarity between the original sentence stored in the database and the summary sentence group (step 12). This database stores past pairs of original texts and summary sentences as history. This summary sentence is accumulated regardless of whether it is performed automatically or manually. In addition, when automatically summarized, condition setting parameters such as the compression ratio are also stored. Furthermore, a document document for each document can be obtained in advance as a vector and stored as data.
[0033]
The similarity s between the document stored in the database and the document to be summarized is obtained by cosine depending on the angle between the document vector bn and the document vector bn + 1. That is, when the angle between both document vectors bn and bn + 1 is q, the inner product of both document vectors is bn · bn + 1, and the sizes of both document vectors are | bn | and | bn + 1 | The degree s is obtained by the following formula 1.
[0034]
[Expression 1]
Similarity s = COS (q) = (bn · bn + 1) / (| bn | × | bn + 1 |)
[0035]
The value of the similarity s is a value up to −1 ≦ s ≦ 1, and the closer to 1, the two document vectors are closer to each other and can be considered to be similar to each other.
Thereafter, a condition setting parameter and a summary sentence are extracted from the combination of the original sentence and the summary sentence determined to be similar (step 13). Then, a summary sentence is generated with reference to the extracted condition setting parameter and summary sentence (step 14). In this embodiment, when a summary sentence is generated, if there is a document with a similar document structure, for example, a summary sentence reflecting the similarity of newspaper articles, legal documents, scientific and technical papers, etc. can be created. .
Also, a higher quality summary can be created by referring to the condition setting parameters. For example, if the summary compression rate is stored as a past condition setting parameter, it can be referred to whether 25% is good or 15% is good. Furthermore, when summarizing, it is possible to refer to long text priority, short text priority, quantity priority, and the like.
If there are many documents that can be referenced, they can be displayed and the user can select them. In particular, it is highly accurate to prioritize documents that have been extensively reused in the past, documents that have been read and affirmed, and documents that have been referred to by important persons in the company. It can be expected that a summary sentence can be generated.
[0036]
FIG. 6 is a flowchart showing the operation of the summary creation process.
The CPU 111 first extracts independent words included in each document group by performing morphological analysis (step 221), and extracts candidate words (phrases) including noun phrases and compound noun phrases from the document A to be summarized. The data is stored in a predetermined work area of the RAM 113 (step 222).
Then, from the summary parameters stored in the summary parameter storage area 1132 of the RAM 16, the appearance frequency of the extracted candidate words (phrases) in each document group, the evaluation function, etc., each candidate word (phrase) importance f (y) Is determined (step 223). Here, as the evaluation function, for example, when a predetermined important word is designated in advance, weighting for the important word, weighting according to the type of candidate word (phrase) such as a word, noun phrase, compound noun phrase, etc. used.
[0037]
Further, the CPU 111 determines the importance F (z) for each sentence included in each document group from the determined importance f (y), the summary parameter stored in the summary parameter storage area relay, and the like (step 224). Then, the sentences that fall within the summary ratio of the summary parameters (for example, the top 25% of the total number of sentences in the document group) from the top of the sentence having the high importance of the determined sentence importance F (z) are listed. (Step 225).
Then, the CPU 111 arranges the listed sentences in the order of appearance in the document group to obtain a summary of the document, and stores this summary in a predetermined area of the summary storage area 1135 of the RAM 113 (step 226). Returning to the automatic summarization processing routine, the automatic summarization processing according to the present embodiment is terminated.
[0038]
As described above, according to the automatic summarization process according to the present embodiment, since a summary sentence is created with reference to a summary made in the past, it is possible to create a summary with high accuracy and easy to read.
[0039]
When the above automatic summarization process is completed, the CPU 111 performs a storage process for each data stored in the RAM 113 in accordance with a user instruction.
That is, the summary target document is read from the summary target document storage area 1131 and stored in the document database 164 of the storage device 16. The created summary is read from the summary storage area 1135, and stored in the summary database 165 of the storage device 16 with an association with the summary target document stored in the document database 164. Further, the document vector V obtained in the document vector creation process is read from the document vector storage area 1135 and stored in the document vector database 166 of the storage device 16 with the relevance to the summary target document stored in the document database 164.
[0040]
The configuration of the present embodiment and the automatic summarization processing have been described above. However, the present invention is not limited to these embodiments, and various modifications are made within the scope of the invention described in the claims. Is possible.
For example, in the embodiment, the morphological analysis and extraction of candidate words (phrases) are independent in the document vector creation process (step 131 and step 132 in FIG. 5) and the summary creation process (step 221 and step 222 in FIG. 6). In the present invention, the candidate words (phrases) extracted in the document vector creation process may be stored in a predetermined area of the RAM 16 and used in the summary creation process. .
[0041]
In the embodiment described above, in the storage process after the automatic summarization process is completed, only the summary target document, the summary, and the document vector V are stored and stored in the respective databases 164, 165, and 166 of the storage device 16. However, in the present invention, the candidate word (phrase) extracted from the document to be summarized in step 132 of the document vector creation process (FIG. 5) and stored in the predetermined work area of the RAM 113 is related to the document to be summarized A, You may make it store in the database 164 or an exclusive candidate word (phrase) database.
The summary parameter may be read from the summary parameter storage area 1132 and stored in the summary database 166 or a dedicated summary parameter database in association with the summary.
[0042]
Furthermore, in the described embodiment, morphological analysis (steps 131 and 221) and candidate word (phrase) extraction (steps 132 and 221) in both the document vector creation process (and the summary creation process (step 22, FIG. 6)). 222).
However, since the processing is for the same sentence, the extracted candidate words (phrases) are the same. Therefore, in the present invention, the candidate words (phrases) extracted in the document vector creation process are stored in a predetermined area of the RAM 113, and the candidate words (phrases) stored in the summarization process are used to perform steps 221 and 222. It may be omitted.
This candidate word (phrase) may also be stored as a candidate word (phrase) for the document to be summarized in the document database 164 or a dedicated candidate word (phrase) database.
[0043]
In the embodiment described above, the method according to the flowchart of FIG. 5 has been described as an example of a method for creating a document vector. However, the present invention is not limited to this method. The extraction method and the determination method of the importance (= element value of the document vector) for the extracted keyword can be replaced by various known methods.
Similarly, the summary creation processing for each sub-document group is not limited to the method shown in the flowchart of FIG. 6, and various known summary methods, abstract creation methods, and the like can be used as materials.
Furthermore, the calculation method of the similarity between two document vectors is to calculate the similarity according to Equation 1. However, the similarity is not limited to this equation, and the similarity between vectors can be expressed. If there is, it is possible to calculate the degree of similarity using another mathematical expression.
[0044]
In the described embodiment, the document is not limited to a document created in Japanese but can be a document created in any language. In that case, it is only necessary to change a part that does not affect the configuration of the present invention, such as using a morphological analysis algorithm for a language in which a target document is created.
In addition, about each apparatus, each part, each operation | movement, each process etc. which were demonstrated in the above embodiment, embodiment can be comprised by each means (... means) as a high-order concept containing them. Is possible.
For example, “keyword determination means” is configured for the description “determine keywords a, b,... Of summary target document A from determined importance f (x) (step 134)”. Sentences that fall within the summarization ratio of summary parameters (for example, the top 25% of the total number of sentences in the sub-document group) from the top of the sentences with high importance of the sentence importance F (z) are listed (step 225) "may be configured as" sentence list-up means ".
Similarly, the embodiment may be configured by a superordinate concept such as “to (operation) means” for other various operations.
[0045]
【The invention's effect】
According to the present invention, summaries made in the past Based on the condition setting parameters Since summarization is performed, it is possible to create a summary with high accuracy and easy to understand the contents of the target document. so wear.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a document processing apparatus according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram conceptually showing the contents of a document vector database according to the embodiment.
FIG. 3 is a flowchart showing the main operation of automatic summarization processing in the embodiment.
FIG. 4 is a diagram showing a document vector obtained for a document A in the embodiment.
FIG. 5 is a flowchart showing the operation of document vector creation processing in the embodiment;
FIG. 6 is a flowchart showing the operation of summary creation processing in the embodiment.
[Explanation of symbols]
11 Control unit
112 ROM
113 RAM
1131 Summary target document storage area
1132 Summary parameter storage area
1133 Original text + summary storage area
1134 Document vector storage area
1135 Summary storage area
12 Keyboard
13 mouse
14 Display device
15 Printing device
16 Storage device
161 Kana-Kanji conversion dictionary
162 Program storage
163 Data storage unit
164 Document database
165 summary database
166 Document Vector Database
17 Storage medium drive
18 Communication control device
19 Input / output I / F
20 Character recognition device

Claims

A database for storing documents together with a summary of the documents and the condition setting parameters at the time of the summarization;
Document acquisition means for acquiring a document in a predetermined format;
A similar document retrieval means for retrieving a document similar to the acquired document by the document obtaining means from the database,
Condition setting parameter acquisition means for acquiring a condition setting parameter corresponding to the document searched by the similar document search means from the database ;
Based on the condition setting parameters obtained by this condition setting parameter acquisition unit, a summary means for creating a summary of a document acquired by the document obtaining means,
A document processing apparatus comprising:

Document vector determining means for determining a document vector characterizing the document acquired by the document acquiring means,
The similar document retrieval means document processing apparatus according to claim 1, wherein the searching to determine the similarity between each document by the document vector for each document determined by the document vector determining means.

3. The document processing apparatus according to claim 1 , wherein a condition for similarity determination when searching for a document by the similar document search means can be set by a user.

In a document processing apparatus having a database for storing a document together with a summary of the document and a condition setting parameter at the time of the summary,
A document acquisition function for acquiring a document in a predetermined format;
A similar document search function to search for documents that are similar to the acquired document by the document acquisition function from the database,
A condition setting parameter acquisition function for acquiring a condition setting parameter corresponding to the document searched by the similar document search function from the database ;
Based on the condition setting parameter acquired by the condition setting parameter acquisition function, a summary function for creating a summary sentence of the document acquired by the document acquisition function;
A storage medium storing a computer-readable document processing program for realizing the above.

A document vector determination function for determining a document vector characterizing the document acquired by the document acquisition function;
5. The document processing program according to claim 4, wherein the similar document search function performs a search by determining a similarity between documents based on a document vector of each document determined by the document vector determination function . Storage medium.

The condition of similarity determination for searching documents in the similar document search function, a storage medium according to claim 4 or claim 5, wherein the document processing program is stored and characterized in that it is set by the user.

In a document processing apparatus comprising: a database for storing a document together with a summary of the document and a condition setting parameter at the time of summarization; a document acquisition unit; a similar document search unit; a condition setting parameter acquisition unit; A document processing method used for document processing,
A first step in which the document acquisition means acquires a document in a predetermined format ;
The similar document retrieval means, a second step of searching for documents that are similar documents and acquired by the first step from the database,
A third step in which the condition setting parameter acquisition means acquires a condition setting parameter corresponding to the document searched in the second step from the database;
A fourth step in which the summarizing means creates a summary sentence of the document acquired in the first step based on the condition setting parameter acquired in the third step ;
Document processing method characterized by comprising a.

Further, a document processing method used when document processing is performed in a document processing apparatus provided with a document vector determining means,
The document vector determining means includes a fifth step of determining a document vector characterizing the document acquired in the first step ;
8. The document processing method according to claim 7 , wherein the second step performs a search by determining the similarity between the documents based on the document vector of each document determined in the fifth step .