JP4005925B2

JP4005925B2 - Document processing method, document processing apparatus, and program

Info

Publication number: JP4005925B2
Application number: JP2003012201A
Authority: JP
Inventors: 由美市村
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-01-21
Filing date: 2003-01-21
Publication date: 2007-11-14
Anticipated expiration: 2023-01-21
Also published as: JP2004227141A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書中のプライバシーにかかわる固有名詞部分等を伏字で置き換えて特定不可能にする文書処理方法および装置に関する。
【０００２】
【従来の技術】
電子メール、社内報告書、日報・週報などの既存の電子化文書を共有あるいは流用しようとすると、文書によっては固有名詞のもつプライバシー情報が侵害される恐れがある。そのため、人手で文書中のプライバシー情報に関する固有名詞部分を削除・隠蔽するなどして対処する必要があった。
【０００３】
これに対して、文書からプライバシーにかかわる固有名詞部分を抽出し、抽出された固有名詞部分を伏字加工するものがある（例えば、特許文献１、特許文献２参照）。この手法では、単語辞書に特定不可能にしたい単語を登録しておき、その辞書を利用して形態素解析することにより、プライバシーに関する固有名詞部分を抽出する。
【０００４】
しかしながら、この手法は、抽出された固有名詞部分が誤っているときの修正手段や、単語辞書の更新手段は提供していない。
【０００５】
【特許文献１】
特開２００２−２５９３６３公報
【０００６】
【特許文献２】
特開２００２−２５９３６８公報
【０００７】
【発明が解決しようとする課題】
このように、従来は、文書中から固有名詞部分を検出し伏字加工する際、検出された固有名詞部分が誤っているときの修正、固有名詞部分の検出に利用する単語辞書を更新できないという問題点があった。
【０００８】
そこで、本発明は上記問題点に鑑み、文書中から検出した固有名詞部分の確認と修正が容易に行える文書処理方法および装置を提供することを目的とする。
【０００９】
また、文書中の固有名詞部分の検出に利用する単語辞書を容易に更新することができる文書処理方法および装置を提供することを目的とする。
【００１０】
【課題を解決するための手段】
本発明は、マスキングすべき文字列またはその一部を記憶する辞書を基に、入力文書からマスキング対象箇所を検出し、この検出されたマスキング対象箇所を記憶手段に記憶するとともに、マスキング対象箇所を表示画面上に表示し、記憶手段に記憶されたマスキング対象箇所を、表示画面上でユーザにより修正されたマスキング対象箇所に書き換え、入力文書中の当該記憶手段に記憶されたマスキング対象箇所をマスキングすることにより、文書中から検出した固有名詞部分の確認と修正が容易に行える。
【００１１】
また、表示画面上に表示された文書中で、ユーザにより指示された新たなマスキング対象箇所を上記記憶手段に記憶し、当該記憶手段に記憶された新たなマスキング対象箇所の文字列を上記辞書に記憶することにより、文書中の固有名詞部分の検出に利用する単語辞書を容易に更新することができる。
【００１２】
また、入力した複数の文字列の中から、各文字列の文字数と、各文字列を構成する文字種と、各文字列の既存文書中の出現頻度のうちの少なくとも１つを基に、前記辞書に記憶する文字列を選択し、この選択された文字列のうち形態素解析できない文字列と、選択された文字列のうち各文字列を形態素解析した結果得られた各文節と上記辞書を基にマスキング対象箇所として検出することができない文字列を上記辞書に記憶することにより、企業内の既存データベースや広く入手可能な市販データベースを利用して、文書中の固有名詞部分の検出に利用する単語辞書の構築、更新が容易に行える。
【００１３】
【発明の実施の形態】
以下、本発明の実施形態について図面を参照して説明する。
【００１４】
図１は、本実施形態に係る文書処理装置を適用した文書マスキング装置の構成を示すブロック図である。なお、本実施形態における文書マスキング装置は、たとえば磁気ディスクなどの記録媒体に記録されたプログラムを読み込み、このプログラムによって動作が制御されるコンピュータによって実現可能である。
【００１５】
文書マスキング装置は、入力部１０１と、制御部１０２と、出力部１０３と、一時記憶部１０４と、辞書登録部１０５と、データ選別部１０６と、頻度算出部１０７と、既存文書記憶部１０８と、マスキング対象特定部１０９と、形態素解析部１１０と、マスキングルール記憶部１１１と、単語辞書１１２と、マスキング修正部１１３と、マスキング確定部１１４と、伏字加工部１１５とから構成されている。
【００１６】
入力手段としての入力部１０１は、処理対象となる文書やデータを、たとえばメモリや磁気ディスク、光ディスクなどから取り込む（入力する）。また、ユーザの指示やキー入力を、たとえばキーボードやマイクなどから取り込む。
【００１７】
制御部１０２は、入力部１０１から入力した情報を受け取り解析した後、当該入力した情報を処理するために必要な各構成部へ、その処理のために必要な情報を送る。各構成部での処理結果は、再び制御部１０２に返されて出力部１０３を介して出力される。出力部１０３では、出力すべき情報（出力情報）を、たとえばディスプレイに表示したり、スピーカから音声にて出力したりする。制御部１０２の処理動作の詳細は後述する。
【００１８】
一時記憶部１０４は、処理結果などを一時的に記憶する記憶領域であり、たとえばＲＡＭや磁気ディスクなどからなる。一時記憶部１０４には、マスキング結果リスト１０４ａ、登録候補リスト１０４ｂが記憶される。各リストに記憶される情報については後述する。
【００１９】
辞書登録部１０５は、制御部１０２を介して、入力部１０１から入力した登録候補データを受け取ると、データ選別部１０６を起動して、登録候補データをデータ選別部１０６へ渡す。
【００２０】
データ選別部１０６は、頻度算出部１０７、形態素解析部１１０、マスキング対象特定部１０９を起動して、受け取った登録候補データの中から、単語辞書１１２への登録候補として有効なデータ（単語）を選別する。データ選別部１０６の処理動作の詳細は後述する。
【００２１】
頻度算出部１０７は、既存文書記憶部１０８に記憶されている文書（既存文書）を参照して、登録候補データの既存文書中の出現頻度を算出する。
【００２２】
マスキング対象特定部１０９は、制御部１０２を介して文書を受け取り、形態素解析部１１０を起動し、マスキングルール記憶部１１１に記憶されているマスキングルールと、マスキングすべき（伏字で置き換えるべき）文字列またはその一部を記憶する単語辞書１１２を参照して、文書中のマスキング対象箇所（伏字で置き換える文字列）を検出（特定）する。マスキング対象特定部１０９の処理動作の詳細は後述する。
【００２３】
形態素解析部１１０は、単語辞書１１２を参照して、形態素解析を行う。形態素解析部１１０の処理動作は広く公知であるので、説明を省略する。
【００２４】
マスキング対象特定部１０９の処理結果は、制御部１０２を介して、マスキング結果リスト１０４ａとして記憶され、出力部１０３を介して、たとえばディプレイなどに表示される。
【００２５】
マスキング修正部１１３は、制御部１０２を介して、マスキング対象箇所に対するユーザの修正指示を受け取り、その情報をマスキング結果リスト１０４ａに記憶する。
【００２６】
マスキング確定部１１４は、制御部１０２を介して、マスキング対象箇所に対するユーザの確定指示を受け取り、その情報をマスキング結果リスト１０４ａに記憶する。
【００２７】
伏字加工部１１５は、制御部１０２を介して、ユーザの伏字加工指示を受け取り、確定されたマスキング対象箇所をあらかじめ設定された文字や記号や塗り潰し、空白等の伏字で置換えて、その結果を出力部１０３を介して、たとえばディプレイなどに表示する。
【００２８】
次に、上記各部の詳細についてフローチャートを用いて説明する。
【００２９】
（ａ）制御部１０２の処理動作
図２は制御部１０２の処理動作を示すフローチャートである。
【００３０】
まず、ステップＳ２０１では、ユーザの指示がマスキング対象箇所の特定であるか否か判定する。マスキング対象箇所の特定である場合はステップＳ２０２に進む。そうでない場合は、ステップＳ２１１に進む。
【００３１】
ステップＳ２０２では、入力部１０１を介して処理対象の文書を取り込み、ステップＳ２０３に進む。
【００３２】
ステップＳ２０３では、マスキング対象特定部１０９を起動し、ステップＳ２０４に進む。
【００３３】
ステップＳ２０４では、ユーザの指示がマスキング対象箇所の修正であるか否か判定する。マスキング対象箇所の修正である場合はステップＳ２０５に進む。そうでない場合は、ステップＳ２０６に進む。
【００３４】
ステップＳ２０５では、マスキング修正部１１３を起動し、ステップＳ２０６に進む。ステップＳ２０６では、ユーザの指示がマスキング対象箇所の確定であるか否か判定する。マスキング対象箇所の確定である場合はステップＳ２０７に進む。そうでない場合は、処理を終了する。
【００３５】
ステップＳ２０７では、マスキング確定部１１４を起動し、ステップＳ２０８に進む。ステップＳ２０８では、ユーザの指示がマスキング対象箇所の伏字加工であるか否か判定する。マスキング対象箇所の伏字加工である場合はステップＳ２０９に進む。そうでない場合は、ステップＳ２１０に進む。
【００３６】
ステップＳ２０９では、伏字加工部１１５を起動し、ステップＳ２１０に進む。ステップＳ２１０では、ユーザの指示がマスキング対象箇所の辞書登録であるか否か判定する。マスキング対象箇所の辞書登録である場合には、ステップＳ２１３に進む。そうでない場合は、処理を終了する。
【００３７】
一方、ステップＳ２０１からステップＳ２１１に進んだ場合は、ステップＳ２１１では、ユーザの指示が一括辞書登録であるか否か判定する。一括辞書登録である場合はステップＳ２１２に進む。そうでない場合は処理を終了する。
【００３８】
ステップＳ２１２では、入力部１０１を介して登録候補データを取り込み、ステップＳ２１３に進む。ここで入力部１０１を介して入力する登録候補データとしては、例えば、電子化された電話帳等のデータに含まれている氏名、住所等である。一括登録の場合、電話帳などとして記録されている氏名、住所等に含まれている氏名や地名などの単語を単語辞書１１２に自動的に一括登録することができる。
【００３９】
ステップＳ２１３では、辞書登録部１０５を起動する。辞書登録部１０５は、データ選別部１０６を起動する。データ選別部１０６は、入力した登録候補データの中から、単語辞書１１２への登録候補として有効なデータ（単語）を選別するので、辞書登録部１０５はその選別された単語を単語辞書１１２に登録し、処理を終了する。
【００４０】
図５は、単語辞書１１２のデータ構造の一例を示す。単語辞書には、各単語の表記、読み、品詞、属性等の情報が記憶されている。特に、単語辞書１１２に記憶されている各単語の属性は、（ステップＳ２０３において）マスキング対象特定部１０９が、入力された文書中の単語がマスキング対象であるか否かを判定する際に用いられる。
【００４１】
このようにして、制御部１０２は、ユーザの指示に基づき、入力部１０１が取り込んだ情報を必要な処理部に送り、各処理部の起動の制御を行う。
【００４２】
（ｂ）マスキング対象特定部の処理動作
マスキング対象特定部１０９の処理動作について、図３に示すフローチャートを参照して説明する。まず、ステップＳ３０１では、マスキング対象特定部１０９では、入力部１０１から入力した処理対象となる文書を読み込み、ステップＳ３０２に進む。
【００４３】
ステップＳ３０２では、変数Ｎに文書件数を、文書数をカウントする変数ｉに初期値としての「１」をセットし、ステップＳ３０３に進む。ステップＳ３０３で、ｉがＮ以下であるか否か判定する。ｉがＮ以下である場合、ステップＳ３０４に進む。ｉがＮより大きい場合は、処理を終了する。
【００４４】
ステップＳ３０４で、形態素解析部１１０を起動し、ｉ番目に読み込まれた文書、すなわち、文書［ｉ］の形態素解析を行い、ステップＳ３０５に進む。
【００４５】
ここで、図７を参照して形態素解析について簡単に説明する。例えば、図７（ａ）に示すような文が、形態素解析の処理対象であるとする。この文を文節、単語に分割して、各単語に対する品詞を付加することにより、図７（ｂ）に示すような形態素解析結果が得られる。なお、文節、単語に分割できないときは、形態素解析ができない、あるいは形態素解析が失敗した、ということであり、文節、単語に分割できたときは形態素解析が成功した、ということである。図７（ｂ）において、記号「／」は文節の区切り、記号「＋」は文節内での単語の区切り、記号＜＞で囲まれた文字列は品詞を示している。
【００４６】
なお、文節、単語など少なくとも１つの文字からなるものを、簡単に文字列とも呼ぶ。
【００４７】
図３の説明に戻り、ステップＳ３０５では、変数Ｓに文節数を、文節数をカウントする変数ｋに初期値として「１」をセットし、ステップＳ３０６に進む。ステップＳ３０６では、ｋがＳ以下であるか否か判定する。ｋがＳ以下である場合は、ステップＳ３０７に進む。ｋがＳより大きい場合は、ステップＳ３１０に進む。なお、以下の説明で、第ｋ番目の文節を文節［ｋ］と呼ぶ。
【００４８】
ステップＳ３０７では、文節［ｋ］と、その次の文節［ｋ＋１］は、マスキングルール記憶部１１１に記憶されているマスキングルールの条件を満たすか否か判定する。その際、単語辞書１１２に登録された、「属性」情報等を参照する。２つの文節のそれぞれに含まれている各単語を単語辞書１１２から検索し、そのような単語が単語辞書１１２に存在するときには、その属性を読み出す。この２つの文節に含まれる単語の属性の対応関係がマスキングルールとして記憶されているときには（条件を満たすルールが存在する場合は）、ステップＳ３０８に進む。そうでない場合は、ステップＳ３０９に進む。
【００４９】
図６は、マルキングルール記憶部１１１に記憶されているマスキングルールの一例を示したものである。各ルールは、条件と結果により記述されている。たとえば、１番目のルールでは、文節［ｋ］の属性が企業名であったら、文節［ｋ］は企業名であると特定する。文節［ｋ＋１］の欄が空欄であるときは、そのルールが１文節のルールであることを示している。また、たとえば、４番目のルールでは、文節［ｋ］の品詞が未知語、文節［ｋ＋１］の属性が人名共起語であったら、文節［ｋ］は人名と特定する。ここでは、１文節または２文節のルールの例を示したが、３文節以上のルールであってもよい。なお、３文節以上のルールの場合は、その数に応じた文節（文節［ｋ］、文節［ｋ＋１］、文節［ｋ＋２］、…）とルールとのマッチングを行う。
【００５０】
図３の説明に戻り、ステップＳ３０８で、条件を満たす文節（ルールにマッチした少なくとも１つの文節）をマスキング対象箇所と特定し、マスキング結果リスト１０４ａとして記憶し、ステップＳ３０９に進む。
【００５１】
図８に、マスキング結果リスト１０４ａの一例を示す。マスキング結果リスト１０４ａには、マスキング対象箇所として特定された文節の表記、すなわち、マスキング対象表記と、当該文節の前側３文字、その後側３文字、マスキング対象表記の開始位置、文字数、種類、確定の有無、伏字加工の有無、登録の有無の９個の情報が記憶される。ステップＳ３０８が終了した段階では、図８に示すように、確定、伏字加工、登録の有無の欄は未記入である。
【００５２】
図３の説明に戻り、ステップＳ３０９では、ｋを１つインクリメントし、ステップＳ３０６へ戻り、文書［ｉ］中の全ての文節について、ステップＳ３０７〜ステップＳ３０８の処理を繰り返す。また、ステップＳ３１０では、ｉを１つインクリメントし、ステップＳ３０３へ戻り、入力された全ての文書について、ステップＳ３０４〜ステップＳ３０９の処理を繰り返す。
【００５３】
このようにして、マスキング対象特定部１０９は、形態素解析部１１０における形態素解析およびマスキングルールを用いて、各文書中のマスキング対象箇所を特定する。特定されたマスキング対象箇所は、マスキング結果リスト１０４ａとして、図８に示すように記憶され、出力部１０３を介して、たとえばディスプレイなどに表示される。
【００５４】
（ｃ）データ選別部の処理動作
図２のステップＳ２１３における辞書登録処理のデータ選別部１０６の処理動作について、図４に示すフローチャートを参照して説明する。
【００５５】
まず、ステップ４０１では、データ選別部１０６は、辞書登録部１０５に入力した処理対象となるデータを読み込み、登録候補リスト１０４ｂとして記憶し、ステップＳ４０２に進む。処理対象となるデータとは、一括辞書登録の場合には、入力部１０１から読み込まれる登録候補データであり、マスキング対象箇所の辞書登録の場合（すなわち、後述する、マスキング対象の修正結果に基づく単語辞書の更新の場合）には、マスキング結果リスト１０４ａに記憶されるデータのうち登録指示のあるデータ（たとえば、図９に示すように、「登録」欄に「○」印の付加されている単語）である。
【００５６】
ここでは、ユーザの指示が一括辞書登録である場合（図２のステップＳ２１１）の図２のステップＳ２１３における処理動作、すなわち、図２のステップＳ２１２において入力部１０１から読み込まれる登録候補データを基に単語辞書１１２を更新する場合を例にとり説明する。
【００５７】
図１０は、登録候補リスト１０４ｂとして記憶される情報の一例である。登録候補リスト１０４ｂには、入力部１０１から読み込まれた各登録候補データについて、その表記、種類、出現頻度、形態素解析が成功か否か、マスキング対象特定が成功か否か、選別結果の６個の情報が記述される。ステップＳ４０１が終了した段階では、図１０に示すように、各登録候補データの表記と種類以外の欄は未記入である。
【００５８】
図４の説明に戻り、ステップＳ４０２では、変数Ｎにデータ件数を、データ数をカウントする変数ｉに初期値「１」をセットし、ステップＳ４０３に進む。ステップＳ４０３で、ｉがＮ以下であるか否か判定する。ｉがＮ以下である場合は、ステップＳ４０４に進む。ｉがＮより大きい場合は、処理を終了する。なお、ここでは、第ｉ番目の登録候補データをデータ［ｉ］と呼ぶ。
【００５９】
ステップＳ４０４で、データ［ｉ］の表記の文字列長（文字数）は所定値α以上であるか否か判定する。ここで、αとはあらかじめ設定しておく閾値で、たとえばαは「２」と設定されているとする。文字列長がα以上である場合は、ステップＳ４０５に進む。文字列長がαより小さい場合は、当該データ［ｉ］は、単語辞書１１２への登録対象から除くべく、ステップＳ４１３に進む。
【００６０】
ステップＳ４０５では、データ［ｉ］の表記の文字列構成は平仮名のみであるか否か判定する。平仮名のみである場合は、当該データ［ｉ］は、単語辞書１１２への登録対象から除くべく、ステップＳ４１３に進む。平仮名以外の文字種を含む場合はステップＳ４０６に進む。
【００６１】
ステップＳ４０６では、まず頻度算出部１０７を起動する。頻度算出部１０７は、既存文書記憶部１０８に記憶されている既存文書中の、データ［ｉ］の表記の出現頻度を算出し、その情報を登録候補リスト１０４ｂの「出現頻度」の欄に記憶し、ステップＳ４０７に進む。ステップＳ４０７で、出現頻度が所定値β以上であるか否か判定する。ここで、βとはあらかじめ設定しておく閾値で、たとえばβは「３」と設定されているとする。出現頻度がβ以上である場合は、ステップＳ４０８に進む。出現頻度がβより小さい場合は、当該データ［ｉ］は、単語辞書１１２への登録対象から除くべく、ステップＳ４１３に進む。
【００６２】
ステップＳ４０８では、形態素解析部１１０を起動して、出現頻度がβ以上のデータ［ｉ］の表記を形態素解析し、ステップＳ４０９に進む。ステップＳ４０９で、形態素解析に成功したか否か判定する。成功した場合はステップＳ４１０に進む。失敗した場合は、当該データ［ｉ］を単語辞書１１２へ登録すべく、ステップＳ４１２に進む。なお、形態素解析の結果は、図１１に示すように、登録候補リスト１０４ｂの「形態素解析」欄に記憶される（図１１では、形態素解析に成功したときは「○」印、失敗したときは「×」印で表している）。
【００６３】
ステップＳ４１０では、マスキング対象特定部１０９を起動して、図３に示すように、マスキングルールを基に、データ［ｉ］の表記からマスキング対象箇所を特定し、ステップＳ４１１に進む。
【００６４】
ステップＳ４１１で、データ［ｉ］の表記をマスキング対象箇所として特定できたか否か判定する。特定できた場合は、当該データ［ｉ］は、現状の単語辞書１１２からマスキング対象箇所として特定可能であり、今回わざわざ単語辞書１１２へ新規登録する必要はないので、単語辞書１１２への登録対象から除くべく、ステップＳ４１３に進む。特定できなかった場合は、データ［ｉ］を単語辞書１１２へ登録すべく、ステップＳ４１２に進む。なお、ここでの判定結果は、図１１に示すように、登録候補リスト１０４ｂの「マスキング対象特定」欄に記憶される（図１１では、特定可能なときは「○」印、特定できないときは「×」印で表している）。
【００６５】
ステップＳ４０９またはステップＳ４１１から、ステップＳ４１２に進んだ場合、ステップＳ４１２で、データ［ｉ］は単語辞書１１２に登録すると判定し、その結果を登録候補リスト１０４ｂに記憶し（図１１では、「選別結果」欄に「○」印で表している）、ステップＳ４１４に進む。
【００６６】
一方、ステップＳ４０４、ステップＳ４０５、ステップＳ４０７、ステップＳ４１１から、ステップＳ４１３に進んだ場合、ステップＳ４１３では、データ［ｉ］は単語辞書１１２に登録しないと判定し、その結果を登録候補リスト１０４ｂに記憶し（図１１では、「選別結果」欄に「×」印で表している）、ステップＳ４１４に進む。
【００６７】
ステップＳ４１４で、ｉを１つインクリメントし、ステップＳ４０３に戻り、全てのデータの登録候補データについて、ステップＳ４０４〜ステップＳ４１３の処理を繰り返す。
【００６８】
このようにして、データ選別部１０６は、頻度算出部１０７、形態素解析部１１０、マスキング対象特定部１０９を用いて、文字列長がα以上で平仮名以外の文字を含む、既存文書中の出現頻度がβ以上である登録候補データのうち、形態素解析ができない、現状の単語辞書１１２でマスキング対象として特定することができない、のうちのいずれか１つを満たすものを、単語辞書１１２に登録する、有効な文字列として選別する。その結果は、図１１に示すように、登録候補リスト１０４ｂの「選別結果」欄に記憶される。
【００６９】
データ選別部１０６のステップＳ４１２の処理が終了した段階では、図１１に示すようになる。すなわち、「山田太朗」と「林政治」の２つが、単語辞書１１２への有効な登録候補として選別されている。この選別された登録候補データは、種類から得られる属性や品詞とともに、単語辞書１１２へ図１２に示したように追加登録される。その際、新規に登録する単語を、辞書登録部１０５，制御部１０２を介して、出力部１０３から、例えば、図１３に示すように表示して、ユーザに、読み方や、品詞、属性、登録の有無を問い合わせてから単語辞書１１２に登録してもよい。また、この画面上でユーザにより登録指示のあったものだけを単語辞書１１２に登録してもよい。その際、図１３に示した画面上に入力された「読み」や「品詞」、「属性」を、単語辞書１１２に登録する。
【００７０】
（ｄ）マスキング対象の修正結果に基づく単語辞書の更新
さて、図１に示した文書マスキング装置に対し、ユーザから、マスキング対象箇所の特定が指示されて、図２のステップＳ２０３のマスキング対象特定処理（図３参照）により、入力部１０１から入力した処理対象の文書から、図８に示したような、マスキング結果リスト１０４ａが得られたとする。
【００７１】
図１４は、マスキング結果リスト１０４ａの表示例であって、文書表示画面の一例を示したものである。
【００７２】
出力部１０３は、当該処理対象の文書から得られた図８に示したマスキング結果リスト１０４を基に、入力部１０１から入力した処理対象の文書を、図１４に示したように表示する。
【００７３】
図１４では、マスキング対象として求められた語、「Ａ社」「通信研究所」「山田太郎」などが、処理対象の文書中で、他の箇所と区別できるよう、反転表示や強調表示などの特殊表示が施される。
【００７４】
例えば、図２のステップＳ２０３のマスキング対象特定処理により、図１４に示したような画面が表示されたとき、図２のステップＳ２０４において、ユーザが、マスキング対象箇所の修正を指示（例えば、当該画面上に設けられた所定のボタンを選択する等）したとき、ステップＳ２０５において、マスキング修正部１１３が起動される。そして、このとき、ユーザが、たとえば「通信研究所」をマウス等を用いて選択し、その選択指示がマスキング修正部１１３に送られる。ユーザは、この選択したマスキング対象の文字列を「情報通信研究所」となるように、処理対象の文書中の当該マスキング対象箇所の直前直後の少なくとも１つの文字を追加する修正の指示をマスキング修正部１１３から入力すると、マスキング修正部１１３を介して制御部１０２により、マスキング結果リスト１０４ａは、図９に示すように、「マスキング対象表記」欄のユーザにより修正されたマスキング対象箇所が「通信研究所」から「情報通信研究所」に書き換えられる。図９に示したマスキング結果リストに基づき、画面表示も、図１５に示すように更新される。
【００７５】
なお、ユーザによるマスキング対象箇所の修正としては、上記のような修正の他に、当該マスキング対象箇所の文字列から少なくとも１つの文字を削除する修正もある。この修正は、例えば、処理対象の文書中から検出されたマスキング対象箇所「通信研究所」から先頭の２文字「通信」を削除して、「研究所」に修正するような場合である。この場合も、やはり、上記同様にして、マスキング結果リスト１０４ａは、「マスキング対象表記」欄のユーザにより修正されたマスキング対象箇所が「通信研究所」から「研究所」に書き換えられる。そして、この書き換えられたマスキング結果リストに基づき、画面表示も更新される。
【００７６】
また、ユーザによるマスキング対象箇所の修正としては、上記２例の他に、さらに、新たなマスキング対象箇所を追加指定する場合もある。例えば、図１４に示した画面上で、図１４には、図示されていないが、「正月一日」という人名が当該処理対象の文書中に存在するが、これが、マスキング対象箇所としてマスキング対象特定部１０９により検出（特定）されなかったとする。この場合、ユーザは、この文字列を指定すると、マスキング結果リスト１０４ａの「マスキング対象表記」欄にユーザにより追加されたマスキング対象箇所「正月一日」が書き加えられる。その結果としてのマスキング結果リストに基づき、画面表示も、上記同様にして更新される。
【００７７】
次に、ユーザがマスキング対象箇所の確定を指示（例えば、当該画面上に設けられた所定のボタンを選択する等）すると（図２のステップＳ２０６）、マスキング確定部１１４が起動し（図２のステップＳ２０７）、マスキング結果リスト１０４ａは、図９に示すように、各マスキング対象の「確定」欄に確定された旨が記録され（図９では、「○」印で表されている）、画面表示も、図１６に示すように、反転表示されていた箇所が下線表示に変わり、修正された個所を含めて、マスキング対象箇所が確定されたことを示している。
【００７８】
次に、ユーザがマスキング対象箇所の伏字加工を指示（例えば、当該画面上に設けられた所定のボタンを選択する等）すると（図２のステップＳ２０８）、伏字加工部１１５が起動し（図２のステップＳ２０９）、マスキング結果リスト１０４ａは、図９に示すように、各マスキング対象の「伏字加工」欄に伏字加工が指示された旨が記録され（図９では、「○」印で表されている）、画面表示も図１７に示すように、各マスキング対象箇所が、たとえば記号「×」で置き換えられる。
【００７９】
なお、図２の上記ステップＳ２０４〜ステップＳ２１０において、マスキング箇所の修正、マスキングの確定指示、伏字加工の指示、単語辞書１１２への登録指示は、図１８に示すような画面上からでも可能である。
【００８０】
図１８は、図１のステップＳ２０３で得られた、図８に示したマスキング結果リスト１０４ａの他の表示例を示したものである。
【００８１】
図１８に示す画面へは、図１４から図１７の文書表示画面上に設けられた「リスト一覧画面へ」ボタンＢ１のいずれかをマウス等を用いて選択する（押す）ことにより遷移することができる。
【００８２】
図１８に示す画面表示例では、処理対象の文書から検出されて、マスキング結果リスト１０４ａに記憶されたマスキング対象箇所の表記「Ａ社」「通信研究所」「山田太郎」などが、文脈を示す前後の文字列とともに、種類別にリストとして表示されている。なお、図１８に示すようなリスト一覧表示は、その表示指示がなされた時点におけるマスキング結果リスト１０４ａの内容を基に表示されるので、この表示指示がなされる以前に、ユーザによりマスキング対象箇所の修正がなされた場合には、その修正結果がマスキング結果リスト１０４ａに記憶されているので、その修正後のマスキング対象箇所が図１８に示すように表示されることになる。
【００８３】
さて、図１８に示すリスト一覧表示画面には、マスキング確定指示のための「確定」指示領域と、単語辞書への「登録」指示領域と、「伏字加工」指示領域とが、各マスキング対象に設けられている。
【００８４】
図１８に示した画面上で、ユーザがマスキング対象として検出された「通信研究所」をマウス等を用いて選択し、その選択指示が入力部１０１を介して制御部１０２に送られる。ユーザは、この選択したマスキング対象の単語を「情報通信研究所」となるように、マスキングする範囲を変更して（「情報通信研究所」に修正して）、「Ａ社」「情報通信研究所」「山田太郎」の「確定」指示領域と「伏字加工」指示領域にチェック（ここでは、「×」印）を入力し、「情報通信研究所」の「登録」指示領域にチェックを入力すると、画面は図１９のようになる。
【００８５】
図１９に示した画面表示の状態において、画面中央下「確定実行」ボタンＢ３を押すと、チェックを入れた表記が確定される。すなわち、マスキング修正部１１３が起動して（図２のステップＳ２０５）、マスキング結果リスト１０４ａは、図９に示すように、「マスキング対象表記」欄のユーザにより修正されたマスキング対象箇所が「通信研究所」から「情報通信研究所」に変更される。また、マスキング確定部１１４が起動して（図２のステップＳ２０７）、マスキング結果リスト１０４ａは、図９に示すように、各マスキング対象の「確定」欄に確定された旨が記録され（図９では、「○」印で表されている）。ここで、さらに、「文書画面へ」ボタンＢ２を押すと、画面は図１６に示したように、前述同様、反転表示されていた箇所が下線表示に変わり、修正された個所を含めて、マスキング対象箇所が確定されたことを示している。
【００８６】
また、図１９に示した画面表示の状態において、画面中央下「伏字加工実行」ボタンＢ５を押すと、チェックを入れた表記が伏字加工される。すなわち、伏字加工部１１５が起動し（図２のステップＳ２０９）、マスキング結果リスト１０４ａは、図９に示すように、各マスキング対象の「伏字加工」欄に伏字加工が指示された旨が記録される（図９では、「○」印で表されている）。ここで、さらに、「文書画面へ」ボタンＢ２を押すと、画面は図１７に示したように、前述同様、各マスキング対象箇所が、たとえば記号「×」で置換される。
【００８７】
また、図１９に示した画面表示の状態において、画面中央下「登録実行」ボタンＢ４を押すと（図２のステップＳ２１０）、辞書登録部１０５が起動し、単語辞書１１２に、登録欄にチェック（記号「×」）を入れた表記が登録される。このとき、図４に示したフローチャートに従って、チェックが入力された表記は、辞書登録部１０５を解してデータ選別部１０６へ入力し、この入力した単語のうち、文字列長、既存文書中の出現頻度、形態素解析の結果、マスキング対象特定結果を基に、有効な登録候補と判定されたものだけを単語辞書１１２に登録する。なお、マスキング結果リスト１０４ａの「登録」欄に登録する旨が記憶された（「○」印が記録された）語は、データ選別部１０６での図４に示した処理を経ずに、そのまま単語辞書１１２に登録するようにしてもよい。また、図１８や図１９に示したような、単語辞書への登録指示のための「登録」欄を設けずに、確定の指示のあった、マスキング対象箇所の文字列は全て、データ選別部１０６での図４に示した処理を経て、選別された文字列を単語辞書に登録するようにしてもよい。
【００８８】
マスキング結果リスト１０４ａは、マスキング修正部１１３、マスキング確定部１１４、伏字加工部１１５、辞書登録部１０５の各構成部の処理動作により、図９のように更新されている。すなわち、図１４〜図１７に示した画面上で操作したときと同様に、修正されたマスキング対象表記とその前後の表記が変更になり、「確定」欄、「伏字加工」欄、「登録」欄に、ユーザの指示に応じた情報が記入されている。
【００８９】
このように、図１４〜図１７に示した文書表示画面上、図１８〜図１９に示したリスト一覧表示画面上にて、マスキング対象箇所の確認、修正、確定、伏字加工を行うことができる。また、リスト一覧表示画面上からは、マスキング対象の修正内容を反映させた辞書登録を指示することができる。
【００９０】
なお、上記のように、マスキング対象の修正結果で単語辞書１１２を更新する場合、例えば、上記例の場合、表記「通信研究所」を「情報通信研究所」に修正する場合も、図１３と同様に、この「情報通信研究所」という語をユーザに表示するとともに、その読み方や、品詞、属性、登録の有無を問い合わせてもよい。そして、この画面上でユーザにより登録指示のあったものだけを単語辞書１１２に登録してもよい。その際、図１３に示した画面上に入力された「読み」や「品詞」、「属性」を、単語辞書１１２に登録してもよい。
【００９１】
なお、マスキング箇所の修正により、単語辞書から単語を削除する場合も、上記同様に行うことができる。例えば、図１５や図１８に示した画面に表示されたマスキング対象のうち、マスキング対象から除きたい語の反転表示や強調表示などの特殊表示を解除する操作を行い、その後、「確定実行」ボタンＢ３を操作したり、「登録実行」ボタンＢ４を操作するなどして、マスキングリスト結果リスト１０４ａ上で、削除の旨を記録する。例えば、マスキング結果リスト１０４ａには、このために、「削除」欄が設けられていてもよい。「削除」欄に「○」印が記録されている単語は、その後、辞書登録部１０５により、単語辞書１１２から削除する。
【００９２】
また、リスト一覧表示画面では、マスキング対象として特定された文字列を、その読み方、種類毎にソートして表示したり、同じ文字列が複数あるときは、そのうちの１つを表示するようにしてもよい。
【００９３】
以上説明したように、上記実施形態によれば、マスキングすべき文字列またはその一部を記憶する単語辞書を基に、入力部１０１から入力した文書からマスキング対象特定部１０９において、マスキング対象箇所を特定（検出）し、この検出されたマスキング対象箇所をマスキング結果リスト１０４ａ（記憶手段）に記憶するとともに、このマスキング結果リストに記憶されたマスキング対象箇所を表示画面上に表示する。表示されたマスキング対象箇所のいずれかがユーザにより修正されると、マスキング結果リストに記憶されたマスキング対象箇所を、ユーザにより修正されたマスキング対象箇所に書き換え、この書き換えられたマスキング結果リストに記憶されたマスキング対象箇所を基に、文書中の当該マスキング対象箇所をマスキングすることにより、検出されたマスキング対象箇所（伏字に置き換えるべき固有名詞等）の確認と修正が容易に行える。
【００９４】
また、表示画面上に表示された文書中で、ユーザにより指示された新たなマスキング対象箇所をマスキング結果リストに記憶し、後にこのリストに記憶された新たなマスキング対象箇所の文字列を単語辞書に記憶することにより、表示画面上に表示されたマスキング対象箇所の確認、修正とともに、この修正内容に基づき、固有名詞等の検出に利用する単語辞書を容易に更新することができる。
【００９５】
また、入力部１０１から複数の文字列を入力し、データ選別部１０６において、この複数の文字列の中から、各文字列の文字数と、各文字列を構成する文字種と、各文字列の既存文書中の出現頻度と、各文字列の形態素解析の結果のうちの少なくとも１つを基に、単語辞書に記憶する文字列を選択し、この選択された文字列のうち形態素解析できない文字列、選択された文字列のうち各文字列を形態素解析した結果得られた各文節と単語辞書を基にマスキング対象箇所として検出することができない文字列を、単語辞書に記憶することにより、マスキング対象箇所の検出に用いる単語辞書１１２が容易に構築、更新することができる。
【００９６】
このように、上記実施形態によれば、入力文書から検出されたマスキング対象箇所の確認と修正が容易に行えるとともに、入力文書中から検出されたマスキング対象箇所を高精度に検出することのできる単語辞書の構築と更新が容易に行える。
【００９７】
従って、上記実施形態に係る文書処理装置によれば、文書からマスキング対象箇所を高精度に検出し、マスキング対象箇所を伏字に置き換えたり、塗り潰す等して秘匿することができ、当該文書中の固有名詞等のプライバシー情報の侵害を事前に防止し、文書の共有および流通を容易にする。
【００９８】
本発明の実施の形態に記載した本発明の手法は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フレキシブルディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、半導体メモリなどの記録媒体に格納して頒布することもできる。
【００９９】
なお、本発明は、上記実施形態に限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で種々に変形することが可能である。さらに、上記実施形態には種々の段階の発明は含まれており、開示される複数の構成要件における適宜な組み合わせにより、種々の発明が抽出され得る。例えば、実施形態に示される全構成要件から幾つかの構成要件が削除されても、発明が解決しようとする課題の欄で述べた課題（の少なくとも１つ）が解決でき、発明の効果の欄で述べられている効果（のなくとも１つ）が得られる場合には、この構成要件が削除された構成が発明として抽出され得る。
【０１００】
【発明の効果】
以上説明したように本発明によれば、文書中から検出された固有名詞部分の確認と修正が容易に行える。
【０１０１】
また、文書中の固有名詞部分の検出に利用する単語辞書を容易に更新することができる。
【図面の簡単な説明】
【図１】本発明の一実施形態にかかる文書マスキング装置の構成例を示す図。
【図２】制御部の処理動作を説明するためのフローチャート。
【図３】マスキング対象特定部の処理動作を説明するためのフローチャート。
【図４】データ選別部の処理動作を説明するためのフローチャート。
【図５】単語辞書のデータ構造の一例を示した図。
【図６】マスキングルールの一例を示した図。
【図７】形態素解析について説明するための図。
【図８】マスキング結果リストに記憶される情報を説明するための図。
【図９】マスキング結果リストに記憶される情報を説明するための図。
【図１０】登録候補リストに記憶される情報を説明するための図。
【図１１】登録候補リストに記憶される情報を説明するための図。
【図１２】単語辞書の更新結果の一例を示した図。
【図１３】単語辞書に記憶する情報を入力するための画面表示例を示した図。
【図１４】マスキング対象箇所を表示する文書表示画面の一例を示した図。
【図１５】修正されたマスキング対象箇所を表示する文書表示画面の一例を示した図。
【図１６】マスキング対象箇所が確定されたときの文書表示画面上の一例を示した図。
【図１７】マスキング対象箇所をマスキングしたときの文書表示画面の一例を示した図。
【図１８】マスキング対象箇所を表示するリスト一覧表示画面の一例を示した図。
【図１９】修正されたマスキング対象箇所等を表示するリスト一覧表示画面の一例を示した図。
【符号の説明】
１０１…入力部、１０２…制御部、１０３…出力部、１０４…一時記憶部、１０５…辞書登録部、１０６…データ選別部、１０７…頻度算出部、１０８…既存文書記憶部、１０９…マスキング対象特定部、１１０…形態素解析部、１１１…マスキングルール記憶部、１１２…単語辞書、１１３…マスキング修正部、１１４…マスキング確定部、１１５…伏字加工部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document processing method and apparatus for replacing a proper noun portion related to privacy in a document with a prone character to make it unidentifiable.
[0002]
[Prior art]
If you try to share or divert existing electronic documents such as e-mails, internal reports, daily reports, weekly reports, etc., the privacy information of proper nouns may be infringed depending on the document. For this reason, it has been necessary to deal with it by deleting or hiding the proper nouns related to the privacy information in the document manually.
[0003]
On the other hand, there is a technique in which a proper noun part related to privacy is extracted from a document, and the extracted proper noun part is processed as a letter (for example, refer to Patent Document 1 and Patent Document 2). In this method, a word to be made unidentifiable is registered in a word dictionary, and a proper noun portion related to privacy is extracted by performing morphological analysis using the dictionary.
[0004]
However, this method does not provide a correction means when the extracted proper noun part is incorrect or a word dictionary update means.
[0005]
[Patent Document 1]
JP 2002-259363 A
[0006]
[Patent Document 2]
JP 2002-259368 A
[0007]
[Problems to be solved by the invention]
Thus, conventionally, when detecting a proper noun part from a document and processing it in a prose, it is impossible to correct when the detected proper noun part is incorrect, and to update the word dictionary used for detecting the proper noun part. There was a point.
[0008]
In view of the above problems, an object of the present invention is to provide a document processing method and apparatus capable of easily confirming and correcting a proper noun portion detected from a document.
[0009]
Another object of the present invention is to provide a document processing method and apparatus capable of easily updating a word dictionary used for detecting proper noun parts in a document.
[0010]
[Means for Solving the Problems]
The present invention detects a masking target location from an input document based on a dictionary storing a character string to be masked or a part thereof, stores the detected masking target location in a storage means, and stores the masking target location. The masking target portion displayed on the display screen and stored in the storage means is rewritten to the masking target portion corrected by the user on the display screen, and the masking target portion stored in the storage means in the input document is masked. As a result, the proper noun part detected from the document can be easily confirmed and corrected.
[0011]
Further, in the document displayed on the display screen, the new masking target location designated by the user is stored in the storage means, and the character string of the new masking target location stored in the storage means is stored in the dictionary. By storing the word dictionary, the word dictionary used for detecting the proper noun part in the document can be easily updated.
[0012]
Further, the dictionary is based on at least one of the number of characters of each character string, the character type constituting each character string, and the appearance frequency of each character string in an existing document among a plurality of input character strings. The character string to be stored in the selected character string, based on the character string that cannot be morphologically analyzed, the clauses obtained as a result of the morphological analysis of each character string in the selected character string, and the dictionary By storing character strings that cannot be detected as masking target parts in the above dictionary, a word dictionary used for detecting proper nouns in documents using existing databases in the company or commercially available databases Can be easily constructed and updated.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0014]
FIG. 1 is a block diagram showing the configuration of a document masking apparatus to which the document processing apparatus according to this embodiment is applied. The document masking apparatus according to the present embodiment can be realized by a computer that reads a program recorded on a recording medium such as a magnetic disk and whose operation is controlled by this program.
[0015]
The document masking apparatus includes an input unit 101, a control unit 102, an output unit 103, a temporary storage unit 104, a dictionary registration unit 105, a data selection unit 106, a frequency calculation unit 107, and an existing document storage unit 108. The masking target specifying unit 109, the morpheme analyzing unit 110, the masking rule storage unit 111, the word dictionary 112, the masking correcting unit 113, the masking determining unit 114, and the prone character processing unit 115 are configured.
[0016]
An input unit 101 as an input unit takes in (inputs) a document or data to be processed from, for example, a memory, a magnetic disk, or an optical disk. Also, user instructions and key inputs are captured from, for example, a keyboard or a microphone.
[0017]
After receiving and analyzing the information input from the input unit 101, the control unit 102 sends information necessary for the processing to each component necessary for processing the input information. The processing results in each component are returned to the control unit 102 and output via the output unit 103. In the output unit 103, information to be output (output information) is displayed on, for example, a display, or is output by sound from a speaker. Details of the processing operation of the control unit 102 will be described later.
[0018]
The temporary storage unit 104 is a storage area for temporarily storing processing results and the like, and includes, for example, a RAM or a magnetic disk. The temporary storage unit 104 stores a masking result list 104a and a registration candidate list 104b. Information stored in each list will be described later.
[0019]
When the dictionary registration unit 105 receives registration candidate data input from the input unit 101 via the control unit 102, the dictionary registration unit 105 activates the data selection unit 106 and passes the registration candidate data to the data selection unit 106.
[0020]
The data selection unit 106 activates the frequency calculation unit 107, the morpheme analysis unit 110, and the masking target specifying unit 109, and selects valid data (words) as registration candidates in the word dictionary 112 from the received registration candidate data. Sort out. Details of the processing operation of the data selection unit 106 will be described later.
[0021]
The frequency calculation unit 107 refers to the document (existing document) stored in the existing document storage unit 108 and calculates the appearance frequency of the registration candidate data in the existing document.
[0022]
The masking target specifying unit 109 receives the document via the control unit 102, activates the morphological analysis unit 110, and the masking rules stored in the masking rule storage unit 111 and the character string to be masked (replaced by the abbreviation) Alternatively, by referring to the word dictionary 112 that stores a part of it, a masking target portion (a character string to be replaced with a letter) in the document is detected (specified). Details of the processing operation of the masking target specifying unit 109 will be described later.
[0023]
The morpheme analyzer 110 refers to the word dictionary 112 and performs morpheme analysis. Since the processing operation of the morphological analysis unit 110 is widely known, a description thereof will be omitted.
[0024]
The processing result of the masking target specifying unit 109 is stored as a masking result list 104 a via the control unit 102 and displayed on a display or the like via the output unit 103.
[0025]
The masking correction unit 113 receives a user's correction instruction for the masking target portion via the control unit 102, and stores the information in the masking result list 104a.
[0026]
The masking confirmation unit 114 receives a user confirmation instruction for the masking target portion via the control unit 102, and stores the information in the masking result list 104a.
[0027]
The script processing unit 115 receives a user's script processing instruction via the control unit 102, replaces the determined masking target portion with preset characters, symbols, fills, blanks, etc., and outputs the result. For example, it is displayed on a display via the unit 103.
[0028]
Next, details of each of the above-described units will be described using a flowchart.
[0029]
(A) Processing operation of the control unit 102
FIG. 2 is a flowchart showing the processing operation of the control unit 102.
[0030]
First, in step S201, it is determined whether or not the user's instruction specifies a masking target portion. If the masking target part is specified, the process proceeds to step S202. Otherwise, the process proceeds to step S211.
[0031]
In step S202, a document to be processed is acquired via the input unit 101, and the process proceeds to step S203.
[0032]
In step S203, the masking target specifying unit 109 is activated, and the process proceeds to step S204.
[0033]
In step S204, it is determined whether or not the user's instruction is correction of a masking target portion. When it is correction of the masking target part, the process proceeds to step S205. Otherwise, the process proceeds to step S206.
[0034]
In step S205, the masking correction unit 113 is activated, and the process proceeds to step S206. In step S206, it is determined whether or not the user's instruction is to confirm the masking target portion. If it is determined that the masking target portion is determined, the process proceeds to step S207. If not, the process ends.
[0035]
In step S207, the masking determination unit 114 is activated, and the process proceeds to step S208. In step S208, it is determined whether or not the user's instruction is a masking process for a portion to be masked. If it is the masking process for the masking target part, the process proceeds to step S209. Otherwise, the process proceeds to step S210.
[0036]
In step S209, the letter processing unit 115 is activated, and the process proceeds to step S210. In step S210, it is determined whether or not the user's instruction is dictionary registration of a masking target portion. If it is the dictionary registration of the masking target part, the process proceeds to step S213. If not, the process ends.
[0037]
On the other hand, if the process proceeds from step S201 to step S211, it is determined in step S211 whether or not the user instruction is batch dictionary registration. If it is batch dictionary registration, the process proceeds to step S212. Otherwise, the process is terminated.
[0038]
In step S212, registration candidate data is taken in via the input unit 101, and the process proceeds to step S213. Here, the registration candidate data input through the input unit 101 is, for example, a name, an address, and the like included in data such as an electronic telephone book. In the case of batch registration, words such as names and place names included in names, addresses, etc. recorded as a telephone book can be automatically registered in the word dictionary 112 in a batch.
[0039]
In step S213, the dictionary registration unit 105 is activated. The dictionary registration unit 105 activates the data selection unit 106. Since the data selection unit 106 selects valid data (words) as registration candidates for the word dictionary 112 from the input registration candidate data, the dictionary registration unit 105 registers the selected words in the word dictionary 112. Then, the process ends.
[0040]
FIG. 5 shows an example of the data structure of the word dictionary 112. The word dictionary stores information such as notation, reading, part of speech, and attribute of each word. In particular, the attribute of each word stored in the word dictionary 112 is used when (in step S203) the masking target specifying unit 109 determines whether or not a word in the input document is a masking target. .
[0041]
In this manner, the control unit 102 sends information captured by the input unit 101 to a necessary processing unit based on a user instruction, and controls activation of each processing unit.
[0042]
(B) Processing operation of masking target specifying unit
The processing operation of the masking target specifying unit 109 will be described with reference to the flowchart shown in FIG. First, in step S301, the masking target specifying unit 109 reads a document to be processed input from the input unit 101, and proceeds to step S302.
[0043]
In step S302, the number of documents is set in the variable N, and “1” as an initial value is set in the variable i for counting the number of documents, and the process proceeds to step S303. In step S303, it is determined whether i is N or less. If i is N or less, the process proceeds to step S304. If i is greater than N, the process ends.
[0044]
In step S304, the morpheme analysis unit 110 is activated to perform morphological analysis of the i-th read document, that is, the document [i], and the process proceeds to step S305.
[0045]
Here, the morphological analysis will be briefly described with reference to FIG. For example, it is assumed that a sentence as shown in FIG. 7A is a morphological analysis processing target. By dividing this sentence into phrases and words and adding parts of speech for each word, a morphological analysis result as shown in FIG. 7B is obtained. Note that when it cannot be divided into clauses and words, it means that morphological analysis cannot be performed or morphological analysis has failed, and when it can be divided into phrases and words, it means that morphological analysis has succeeded. In FIG. 7B, the symbol “/” is a phrase delimiter, the symbol “+” is a word delimiter in the phrase, and the character string enclosed by the symbol <> indicates the part of speech.
[0046]
In addition, what consists of at least 1 character, such as a clause and a word, is also called a character string simply.
[0047]
Returning to the description of FIG. 3, in step S305, the number of clauses is set in the variable S, and “1” is set as an initial value in the variable k for counting the number of clauses, and the process proceeds to step S306. In step S306, it is determined whether k is S or less. If k is equal to or less than S, the process proceeds to step S307. If k is larger than S, the process proceeds to step S310. In the following description, the kth clause is referred to as clause [k].
[0048]
In step S307, it is determined whether or not the clause [k] and the next clause [k + 1] satisfy the conditions of the masking rules stored in the masking rule storage unit 111. At that time, “attribute” information and the like registered in the word dictionary 112 are referred to. Each word included in each of the two clauses is searched from the word dictionary 112, and when such a word exists in the word dictionary 112, its attribute is read. When the correspondence relationship between the attributes of the words included in the two phrases is stored as a masking rule (when a rule that satisfies the condition exists), the process proceeds to step S308. Otherwise, the process proceeds to step S309.
[0049]
FIG. 6 shows an example of a masking rule stored in the marking rule storage unit 111. Each rule is described by a condition and a result. For example, in the first rule, if the attribute of the clause [k] is a company name, the clause [k] is specified as the company name. When the phrase [k + 1] field is blank, it indicates that the rule is a one-sentence rule. For example, in the fourth rule, if the part of speech of the clause [k] is an unknown word and the attribute of the clause [k + 1] is a personal name co-occurrence word, the clause [k] is specified as a personal name. Here, an example of a rule of one or two clauses is shown, but a rule of three or more clauses may be used. In the case of a rule having three or more clauses, matching of the rules with clauses (phrase [k], clause [k + 1], clause [k + 2],...) Corresponding to the number is performed.
[0050]
Returning to the description of FIG. 3, in step S308, a clause that satisfies the condition (at least one clause that matches the rule) is identified as a masking target portion, stored as a masking result list 104a, and the process proceeds to step S309.
[0051]
FIG. 8 shows an example of the masking result list 104a. In the masking result list 104a, the notation of the clause specified as the masking target portion, that is, the masking target notation, the front three characters of the clause, the subsequent three characters, the start position of the masking target notation, the number of characters, the type, and the confirmation Nine pieces of information including presence / absence, presence / absence processing, and registration / non-registration are stored. At the stage where step S308 has been completed, as shown in FIG.
[0052]
Returning to the description of FIG. 3, in step S309, k is incremented by one, and the process returns to step S306, and the processes in steps S307 to S308 are repeated for all the clauses in the document [i]. In step S310, i is incremented by 1, and the process returns to step S303 to repeat the processes in steps S304 to S309 for all input documents.
[0053]
In this way, the masking target specifying unit 109 uses the morpheme analysis and masking rules in the morpheme analysis unit 110 to specify the masking target part in each document. The identified masking target portion is stored as a masking result list 104a as shown in FIG. 8 and displayed on the display or the like via the output unit 103, for example.
[0054]
(C) Processing operation of the data selection unit
The processing operation of the data selection unit 106 in the dictionary registration process in step S213 in FIG. 2 will be described with reference to the flowchart shown in FIG.
[0055]
First, in step 401, the data selection unit 106 reads data to be processed input to the dictionary registration unit 105, stores it as a registration candidate list 104b, and proceeds to step S402. The data to be processed is registration candidate data read from the input unit 101 in the case of batch dictionary registration, and in the case of dictionary registration of a masking target portion (that is, a word based on the correction result of the masking target described later) In the case of updating the dictionary, among the data stored in the masking result list 104a, data for which a registration instruction has been given (for example, as shown in FIG. ).
[0056]
Here, when the user instruction is batch dictionary registration (step S211 in FIG. 2), the processing operation in step S213 in FIG. 2, that is, based on the registration candidate data read from the input unit 101 in step S212 in FIG. A case where the word dictionary 112 is updated will be described as an example.
[0057]
FIG. 10 is an example of information stored as the registration candidate list 104b. In the registration candidate list 104b, for each registration candidate data read from the input unit 101, the notation, type, appearance frequency, whether morphological analysis is successful, whether masking target identification is successful, and six selection results are displayed. Is described. At the stage where step S401 is completed, fields other than the notation and type of each registration candidate data are not filled in as shown in FIG.
[0058]
Returning to the description of FIG. 4, in step S402, the number of data is set in the variable N, and the initial value “1” is set in the variable i for counting the number of data, and the process proceeds to step S403. In step S403, it is determined whether i is N or less. If i is N or less, the process proceeds to step S404. If i is greater than N, the process ends. Here, the i-th registration candidate data is referred to as data [i].
[0059]
In step S404, it is determined whether the character string length (number of characters) in the notation of data [i] is equal to or greater than a predetermined value α. Here, α is a threshold value set in advance. For example, α is set to “2”. If the character string length is greater than or equal to α, the process proceeds to step S405. If the character string length is less than α, the data [i] proceeds to step S413 so as to be excluded from the registration target in the word dictionary 112.
[0060]
In step S405, it is determined whether or not the character string structure of the data [i] is only hiragana. In the case of hiragana only, the data [i] proceeds to step S413 to be excluded from the registration target in the word dictionary 112. If a character type other than hiragana is included, the process proceeds to step S406.
[0061]
In step S406, first, the frequency calculation unit 107 is activated. The frequency calculation unit 107 calculates the appearance frequency of the notation of the data [i] in the existing document stored in the existing document storage unit 108, and stores the information in the “appearance frequency” column of the registration candidate list 104b. Then, the process proceeds to step S407. In step S407, it is determined whether the appearance frequency is greater than or equal to a predetermined value β. Here, β is a threshold value set in advance. For example, β is set to “3”. When the appearance frequency is β or more, the process proceeds to step S408. If the appearance frequency is lower than β, the data [i] proceeds to step S413 to be excluded from the registration target in the word dictionary 112.
[0062]
In step S408, the morpheme analysis unit 110 is activated to analyze the notation of data [i] whose appearance frequency is β or more, and the process proceeds to step S409. In step S409, it is determined whether the morphological analysis is successful. If successful, the process proceeds to step S410. If unsuccessful, the process advances to step S412 to register the data [i] in the word dictionary 112. As shown in FIG. 11, the result of the morpheme analysis is stored in the “morpheme analysis” column of the registration candidate list 104b (in FIG. 11, when the morpheme analysis is successful, “◯” is marked, and when it is unsuccessful, "X").
[0063]
In step S410, the masking target specifying unit 109 is activated, and as shown in FIG. 3, the masking target part is specified from the notation of the data [i] based on the masking rule, and the process proceeds to step S411.
[0064]
In step S411, it is determined whether or not the notation of the data [i] can be specified as the masking target portion. If it can be specified, the data [i] can be specified as a masking target part from the current word dictionary 112 and does not need to be newly registered in the word dictionary 112 this time. In order to remove it, the process proceeds to step S413. If it cannot be specified, the process proceeds to step S412 to register the data [i] in the word dictionary 112. The determination result here is stored in the “masking target specification” field of the registration candidate list 104b as shown in FIG. 11 (in FIG. 11, when it can be specified, “◯” mark, when it cannot be specified) "X").
[0065]
If the process proceeds from step S409 or step S411 to step S412, in step S412, it is determined that the data [i] is registered in the word dictionary 112, and the result is stored in the registration candidate list 104b (in FIG. "" In the "" column), the process proceeds to step S414.
[0066]
On the other hand, if the process proceeds from step S404, step S405, step S407, or step S411 to step S413, it is determined in step S413 that the data [i] is not registered in the word dictionary 112, and the result is stored in the registration candidate list 104b. (In FIG. 11, “x” is indicated in the “selection result” column), and the process proceeds to step S414.
[0067]
In step S414, i is incremented by one, and the process returns to step S403, and the processing of step S404 to step S413 is repeated for registration candidate data of all data.
[0068]
In this way, the data selection unit 106 uses the frequency calculation unit 107, the morpheme analysis unit 110, and the masking target specifying unit 109 to generate an appearance frequency in an existing document that includes characters other than hiragana and a character string length of α. Among the registration candidate data having a value equal to or larger than β, the one that satisfies any one of the following cannot be specified as a masking target in the current word dictionary 112, which cannot be morphologically analyzed, is registered in the word dictionary 112. Select as a valid string. The result is stored in the “selection result” column of the registration candidate list 104b as shown in FIG.
[0069]
At the stage where the process of step S412 of the data selection unit 106 is completed, the process is as shown in FIG. That is, “Taro Yamada” and “Hayashi Politics” are selected as valid registration candidates in the word dictionary 112. This selected registration candidate data is additionally registered in the word dictionary 112 as shown in FIG. 12 together with attributes and parts of speech obtained from the types. At that time, a word to be newly registered is displayed from the output unit 103 via the dictionary registration unit 105 and the control unit 102, for example, as shown in FIG. 13, and is read to the user, part of speech, attribute, registration. It may be registered in the word dictionary 112 after inquiring about the presence or absence. In addition, only those instructed by the user on this screen may be registered in the word dictionary 112. At that time, “reading”, “part of speech”, and “attribute” input on the screen shown in FIG. 13 are registered in the word dictionary 112.
[0070]
(D) Updating the word dictionary based on the correction result of the masking target
Now, the user is instructed to specify the masking target part to the document masking apparatus shown in FIG. 1, and the process is input from the input unit 101 by the masking target specifying process (see FIG. 3) in step S203 of FIG. It is assumed that a masking result list 104a as shown in FIG. 8 is obtained from the target document.
[0071]
FIG. 14 is a display example of the masking result list 104a and shows an example of a document display screen.
[0072]
Based on the masking result list 104 shown in FIG. 8 obtained from the processing target document, the output unit 103 displays the processing target document input from the input unit 101 as shown in FIG.
[0073]
In FIG. 14, words such as “Company A”, “Communication Research Laboratories”, “Taro Yamada”, and the like, which are required as masking targets, are highlighted or highlighted so that they can be distinguished from other parts in the processing target document. Special indication is given.
[0074]
For example, when the screen as shown in FIG. 14 is displayed by the masking target specifying process in step S203 of FIG. 2, in step S204 of FIG. 2, the user instructs the correction of the masking target portion (for example, the screen). In step S205, the masking correction unit 113 is activated. At this time, for example, the user selects “communication laboratory” using a mouse or the like, and the selection instruction is sent to the masking correction unit 113. The user corrects the masking correction instruction to add at least one character immediately before and immediately after the masking target portion in the processing target document so that the selected character string to be masked becomes “Information and Communication Laboratory”. When input from the unit 113, the masking result list 104a is displayed in the masking result list 104a by the control unit 102 via the masking correction unit 113, as shown in FIG. To “Information and Communication Laboratories”. Based on the masking result list shown in FIG. 9, the screen display is also updated as shown in FIG.
[0075]
Note that the correction of the masking target portion by the user includes correction for deleting at least one character from the character string of the masking target portion in addition to the above correction. This correction is, for example, a case where the first two characters “communication” are deleted from the masking target location “communication laboratory” detected in the document to be processed and corrected to “laboratory”. Also in this case, in the same manner as described above, in the masking result list 104a, the masking target portion modified by the user in the “masking target notation” column is rewritten from “communication laboratory” to “laboratory”. The screen display is also updated based on the rewritten masking result list.
[0076]
In addition to the above two examples, the user may additionally specify a new masking target location as the masking target location correction. For example, on the screen shown in FIG. 14, although not shown in FIG. 14, a person name “New Year's Day” exists in the document to be processed. It is assumed that no detection (specification) is performed by the unit 109. In this case, when the user designates this character string, the masking target portion “New Year's Day” added by the user is added to the “masking target notation” field of the masking result list 104a. Based on the resulting masking result list, the screen display is also updated in the same manner as described above.
[0077]
Next, when the user gives an instruction to confirm the masking target location (for example, by selecting a predetermined button provided on the screen) (step S206 in FIG. 2), the masking confirmation unit 114 is activated (in FIG. 2). In step S207), as shown in FIG. 9, the masking result list 104a records that it has been confirmed in the “determined” column for each masking target (indicated by “◯” in FIG. 9), and is displayed on the screen. As shown in FIG. 16, the display also shows that the highlighted part has been changed to underlined display, and the masking target part including the corrected part has been confirmed.
[0078]
Next, when the user gives an instruction to process a masked portion (for example, a predetermined button provided on the screen is selected) (step S208 in FIG. 2), the script processing unit 115 is activated (FIG. 2). Step S209), in the masking result list 104a, as shown in FIG. 9, the fact that the masking process is instructed is recorded in the “masking process” column of each masking target (in FIG. 9, the masking result list 104a is represented by a mark “◯”). As shown in FIG. 17, each masking target portion is replaced with, for example, the symbol “x”.
[0079]
In step S204 to step S210 in FIG. 2, a masking location correction, a masking confirmation instruction, a prosecution processing instruction, and a registration instruction to the word dictionary 112 can be performed from the screen as shown in FIG. .
[0080]
FIG. 18 shows another display example of the masking result list 104a shown in FIG. 8 obtained in step S203 of FIG.
[0081]
The screen shown in FIG. 18 can be changed by selecting (pressing) one of the “to list list screen” buttons B1 provided on the document display screens shown in FIGS. it can.
[0082]
In the screen display example shown in FIG. 18, the notation “Company A”, “Communication Research Laboratories”, “Taro Yamada”, etc. of the masking target part detected from the document to be processed and stored in the masking result list 104a indicate the context. It is displayed as a list by type, along with the preceding and following character strings. The list list display as shown in FIG. 18 is displayed based on the contents of the masking result list 104a at the time when the display instruction is given. When the correction is made, the correction result is stored in the masking result list 104a, and the masking target portion after the correction is displayed as shown in FIG.
[0083]
In the list list display screen shown in FIG. 18, a “confirmation” instruction region for masking confirmation instruction, a “registration” instruction region to the word dictionary, and a “bend processing” instruction region are provided for each masking target. Is provided.
[0084]
On the screen shown in FIG. 18, the user selects “communication laboratory” detected as a masking target using a mouse or the like, and the selection instruction is sent to the control unit 102 via the input unit 101. The user changes the masking range so that the selected masking target word becomes “Information and Communication Laboratory” (corrected to “Information and Communication Laboratory”), and “Company A” and “Information and Communication Research” Enter “Check” (in this case, “×”) in the “Confirm” instruction area and “Future processing” instruction area of “Taro” and “Taro Yamada”, and enter a check in the “Register” instruction area of “Information and Communication Laboratories” Then, the screen becomes as shown in FIG.
[0085]
In the state of the screen display shown in FIG. 19, when the “confirm execution” button B3 is pressed at the bottom center of the screen, the checked notation is confirmed. That is, the masking correction unit 113 is activated (step S205 in FIG. 2), and the masking result list 104a includes a masking target portion corrected by the user in the “masking target notation” column as shown in FIG. Will be changed to “Information and Communication Laboratories”. Further, the masking confirmation unit 114 is activated (step S207 in FIG. 2), and the masking result list 104a is recorded in the “confirmation” column for each masking target as shown in FIG. 9 (FIG. 9). In this case, it is indicated by “○”.) Here, when the “To Document Screen” button B2 is further pressed, as shown in FIG. 16, on the screen, the highlighted part is changed to an underlined display as described above, and the masked part including the corrected part is masked. This indicates that the target location has been confirmed.
[0086]
Further, in the screen display state shown in FIG. 19, when the “bottom processing” button B5 at the center bottom of the screen is pressed, the checked notation is processed. That is, the prone processing unit 115 is activated (step S209 in FIG. 2), and the masking result list 104a records that the proficiency processing is instructed in the “prone processing” column for each masking target, as shown in FIG. (In FIG. 9, it is represented by “◯” mark). Here, when the “To Document Screen” button B2 is further pressed, as shown in FIG. 17, the masked portion is replaced with, for example, the symbol “x” as described above.
[0087]
Further, in the screen display state shown in FIG. 19, when the “registration execution” button B4 is pressed at the bottom center of the screen (step S210 in FIG. 2), the dictionary registration unit 105 is activated and the word dictionary 112 checks the registration column. A notation including (symbol “×”) is registered. At this time, according to the flowchart shown in FIG. 4, the notation in which the check is input is input to the data selection unit 106 through the dictionary registration unit 105, and among the input words, the character string length, Based on the appearance frequency, the result of morphological analysis, and the masking target specifying result, only those determined as valid registration candidates are registered in the word dictionary 112. It should be noted that the word stored in the “registration” field of the masking result list 104a (recorded with “◯” mark) is not subjected to the processing shown in FIG. It may be registered in the word dictionary 112. Further, all the character strings of the masking target parts for which the confirmation instruction is given without providing the “registration” field for the instruction to register in the word dictionary as shown in FIG. 18 or FIG. The selected character string may be registered in the word dictionary through the processing shown in FIG.
[0088]
The masking result list 104a is updated as shown in FIG. 9 by processing operations of the respective components of the masking correction unit 113, the masking determination unit 114, the concealment processing unit 115, and the dictionary registration unit 105. That is, in the same manner as when operating on the screen shown in FIGS. 14 to 17, the modified masking target notation and the notation before and after it are changed, and the “Confirm” column, “Fuzzy processing” column, “Register” Information corresponding to the user's instruction is entered in the column.
[0089]
As described above, the masking target portion can be confirmed, corrected, confirmed, and processed in the letter form on the document display screens shown in FIGS. 14 to 17 and the list list display screens shown in FIGS. . Further, from the list list display screen, it is possible to instruct dictionary registration reflecting the correction contents to be masked.
[0090]
As described above, when the word dictionary 112 is updated with the correction result of the masking target, for example, in the case of the above example, the notation “communication research institute” may be modified to “information communication research institute”. Similarly, the word “information and communication research institute” may be displayed to the user and inquired about how to read it, the part of speech, the attribute, and the presence / absence of registration. Then, only those instructed by the user on this screen may be registered in the word dictionary 112. At this time, “reading”, “part of speech”, and “attribute” input on the screen shown in FIG. 13 may be registered in the word dictionary 112.
[0091]
It should be noted that when a word is deleted from the word dictionary by correcting the masking portion, the same operation as described above can be performed. For example, among the masking targets displayed on the screen shown in FIG. 15 or FIG. 18, an operation for canceling special display such as reverse display or highlighting of a word to be excluded from the masking target is performed, and then a “confirm execution” button. The fact of deletion is recorded on the masking list result list 104a by operating B3 or operating the “registration execution” button B4. For example, the masking result list 104a may be provided with a “delete” column for this purpose. Words with “o” marks recorded in the “delete” column are then deleted from the word dictionary 112 by the dictionary registration unit 105.
[0092]
Also, on the list list display screen, character strings specified as masking targets are sorted and displayed for each reading method and type, or when there are multiple identical character strings, one of them is displayed. Also good.
[0093]
As described above, according to the above embodiment, the masking target location is determined in the masking target specifying unit 109 from the document input from the input unit 101 based on the word dictionary storing the character string to be masked or a part thereof. The detected masking target location is stored in the masking result list 104a (storage means), and the masking target location stored in the masking result list is displayed on the display screen. When one of the displayed masking target locations is corrected by the user, the masking target location stored in the masking result list is rewritten with the masking target location corrected by the user, and stored in the rewritten masking result list. By masking the masking target part in the document based on the masking target part, it is possible to easily confirm and correct the detected masking target part (such as a proper noun to be replaced with a letter).
[0094]
Further, in the document displayed on the display screen, the new masking target portion designated by the user is stored in the masking result list, and the character string of the new masking target portion stored later in this list is stored in the word dictionary. By storing the information, it is possible to easily update the word dictionary used for detecting proper nouns and the like based on the contents of the correction along with the confirmation and correction of the masking target portion displayed on the display screen.
[0095]
Also, a plurality of character strings are input from the input unit 101, and the data selection unit 106 selects the number of characters of each character string, the character type constituting each character string, and the existing character strings from the plurality of character strings. Based on at least one of the appearance frequency in the document and the result of morphological analysis of each character string, a character string to be stored in the word dictionary is selected, and a character string that cannot be morphologically analyzed among the selected character strings, By storing in the word dictionary a character string that cannot be detected as a masking target location based on each phrase and word dictionary obtained as a result of morphological analysis of each character string in the selected character string, The word dictionary 112 used for the detection can be easily constructed and updated.
[0096]
As described above, according to the above-described embodiment, the masking target portion detected from the input document can be easily confirmed and corrected, and the masking target portion detected from the input document can be detected with high accuracy. Dictionaries can be easily constructed and updated.
[0097]
Therefore, according to the document processing apparatus according to the above-described embodiment, the masking target portion can be detected from the document with high accuracy, and the masking target portion can be concealed by replacing or filling the masking target portion. Prevent infringement of privacy information such as proper nouns in advance and facilitate document sharing and distribution.
[0098]
The method of the present invention described in the embodiment of the present invention is a program that can be executed by a computer, such as a magnetic disk (flexible disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a semiconductor memory, etc. It can be stored in a medium and distributed.
[0099]
In addition, this invention is not limited to the said embodiment, In the implementation stage, it can change variously in the range which does not deviate from the summary. Furthermore, the above embodiments include inventions at various stages, and various inventions can be extracted by appropriately combining a plurality of disclosed constituent elements. For example, even if some constituent elements are deleted from all the constituent elements shown in the embodiment, the problem (at least one of them) described in the column of problems to be solved by the invention can be solved, and the column of the effect of the invention If at least one of the effects described in (1) is obtained, a configuration in which this configuration requirement is deleted can be extracted as an invention.
[0100]
【The invention's effect】
As described above, according to the present invention, the proper noun part detected from the document can be easily confirmed and corrected.
[0101]
In addition, the word dictionary used for detecting proper noun parts in the document can be easily updated.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of a document masking apparatus according to an embodiment of the present invention.
FIG. 2 is a flowchart for explaining a processing operation of a control unit.
FIG. 3 is a flowchart for explaining a processing operation of a masking target specifying unit;
FIG. 4 is a flowchart for explaining a processing operation of a data selection unit.
FIG. 5 is a diagram showing an example of a data structure of a word dictionary.
FIG. 6 is a diagram showing an example of a masking rule.
FIG. 7 is a diagram for explaining morphological analysis.
FIG. 8 is a diagram for explaining information stored in a masking result list;
FIG. 9 is a diagram for explaining information stored in a masking result list;
FIG. 10 is a diagram for explaining information stored in a registration candidate list.
FIG. 11 is a diagram for explaining information stored in a registration candidate list.
FIG. 12 is a diagram showing an example of a word dictionary update result.
FIG. 13 is a diagram showing a screen display example for inputting information stored in a word dictionary.
FIG. 14 is a diagram showing an example of a document display screen that displays a portion to be masked.
FIG. 15 is a diagram showing an example of a document display screen that displays a corrected masking target portion.
FIG. 16 is a diagram showing an example on a document display screen when a masking target portion is confirmed.
FIG. 17 is a diagram showing an example of a document display screen when a masking target portion is masked.
FIG. 18 is a diagram showing an example of a list list display screen for displaying masking target portions.
FIG. 19 is a diagram showing an example of a list list display screen that displays a corrected masking target portion and the like.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 101 ... Input part 102 ... Control part 103 ... Output part 104 ... Temporary memory | storage part 105 ... Dictionary registration part 106 ... Data selection part 107 ... Frequency calculation part 108 ... Existing document memory | storage part 109 ... Masking object Specific part 110: Morphological analysis part 111 ... Masking rule storage part 112 ... Word dictionary 113 ... Masking correction part 114 ... Masking confirmation part 115 ... Prone word processing part

Claims

First storage means for storing a dictionary used for morphological analysis, including notation and attributes of each word;
Second storage means for storing an array of word attributes included in each clause in the character string to be masked;
A third storage means;
Document processing means;
A document processing method in a document processing apparatus comprising:
A first registration step in which the document processing means registers dictionary data including the notation of each word and its attributes in the dictionary stored in the first storage means;
An input step in which the document processing means inputs a document;
The document processing means divides the document into phrases and words by the morphological analysis,
A step of detecting, from the document, one or a plurality of consecutive clauses having the attribute array stored in the second storage unit as a masking target portion from the document;
A first storage step in which the document processing means stores the masking target portion detected in the detection step in the third storage means;
The document processing means displaying the document on a display screen and displaying a masking target portion stored in the third storage means in the document; and
(A) When the masking target location in the document displayed on the display screen is corrected by the user, the document processing means displays the masking target location stored in the third storage means. (B) when a new masking target location is designated by the user in the document displayed on the display screen, the document processing means performs the new masking target location. A correction step of storing the target portion in the third storage means;
A masking step in which the document processing means masks a masking target portion stored in the third storage means in the document;
Among the new masking target locations stored in the third storage unit, the document processing unit stores new dictionary data including the notation of the character string of the masking target location designated by the user and its attributes in the dictionary. A second registration step to register;
A document processing method.

First storage means for storing a dictionary used for morphological analysis, including notation and attributes of each word;
Second storage means for storing an array of word attributes included in each clause in the character string to be masked;
A third storage means;
Document processing means;
A document processing method in a document processing apparatus comprising:
A first registration step in which the document processing means registers dictionary data including the notation of each word and its attributes in the dictionary stored in the first storage means;
An input step in which the document processing means inputs a document;
The document processing means divides the document into phrases and words by the morphological analysis,
A step of detecting, from the document, one or a plurality of consecutive clauses having the attribute array stored in the second storage unit as a masking target portion from the document;
A first storage step in which the document processing means stores the masking target portion detected in the detection step in the third storage means;
The document processing means displaying the document on a display screen and displaying a masking target portion stored in the third storage means in the document; and
(A) When the masking target location in the document displayed on the display screen is corrected by the user, the document processing means displays the masking target location stored in the third storage means. (B) when a new masking target location is designated by the user in the document displayed on the display screen, the document processing means performs the new masking target location. A correction step of storing the target portion in the third storage means;
A masking step in which the document processing means masks a masking target portion stored in the third storage means in the document;
Among the new masking target locations stored in the third storage means, the document processing means has a character string length equal to or greater than a predetermined value from the character string of the masking target location designated by the user. And a character string that includes characters other than hiragana and has an appearance frequency in an existing document that is equal to or higher than a predetermined value, and that cannot be detected as a masking target location in the current dictionary, A first selection step of selecting as a character string to be stored in the dictionary;
A second registration step of registering in the dictionary new dictionary data including the notation of the character string selected in the first selection step and its attributes;
A document processing method.

The correction to the masking target location displayed on the display screen by the user includes the correction of adding at least one character immediately before and after the masking target location in the document to the masking target location, 3. The document processing method according to claim 1, wherein the document processing method is at least one of corrections for deleting at least one character from a character string at a location.

The first registration step includes:
Entering a list containing multiple strings;
Of the plurality of character strings, a character string length is equal to or greater than a predetermined value from a character string group excluding a character string including a phrase having an arrangement stored in the second storage unit , and other than a hiragana character And a character string having an appearance frequency in an existing document that is equal to or higher than a predetermined value and that cannot be morphologically analyzed or that cannot be detected as a masking target location in the current dictionary. A second selection step for selecting as a character string to be stored;
Registering new dictionary data including the notation of the character string selected in the second selection step and its attributes in the dictionary;
The document processing method according to claim 1, further comprising:

The first registration step includes:
Entering a list containing multiple strings;
A character string whose character string length is equal to or greater than a predetermined value and includes characters other than hiragana and whose appearance frequency in an existing document is equal to or greater than a predetermined value is selected from the plurality of character strings. 3 selection steps;
Of the character strings selected in the third selection step, registering new dictionary data including the notation of the character string that cannot be morphologically analyzed and its attributes in the dictionary;
Of the character strings selected in the third selection step, the character attribute array included in each clause obtained as a result of morphological analysis of each character string is not stored in the second storage means. Registering new dictionary data including column notation and its attributes in the dictionary;
The document processing method according to claim 1, further comprising:

First storage means for storing a dictionary used for morphological analysis, including notation and attributes of each word;
Second storage means for storing an array of word attributes included in each clause in the character string to be masked;
A means of entering a document;
Means for dividing the document into clauses and words by the morphological analysis;
Detecting means for detecting, from the document, one or a plurality of continuous clauses having the attribute array stored in the second storage means as masking target portions;
A third storage means for storing the masking target portion detected by the detection means;
Display means for displaying the document on a display screen and displaying a masking target portion stored in the third storage means in the document;
When the masking target part in the document displayed on the display screen is corrected by the user, the masking target part stored in the third storage unit is rewritten with the corrected masking target part. (B) a correction means for storing the new masking target location in the third storage means when a new masking target location is designated by the user in the document displayed on the display screen; ,
Means for masking a masking target portion stored in the third storage means in the document;
First registration for registering new dictionary data including notation and attribute of a character string of a masking target location designated by a user among new masking target locations stored in the third storage means in the dictionary Means,
A document processing apparatus comprising:

First storage means for storing a dictionary used for morphological analysis, including notation and attributes of each word;
Second storage means for storing an array of word attributes included in each clause in the character string to be masked;
A means of entering a document;
Means for dividing the document into clauses and words by the morphological analysis;
Detecting means for detecting, from the document, one or a plurality of continuous clauses having the attribute array stored in the second storage means as masking target portions;
A third storage means for storing the masking target portion detected by the detection means;
Display means for displaying the document on a display screen and displaying a masking target portion stored in the third storage means in the document;
When the masking target part in the document displayed on the display screen is corrected by the user, the masking target part stored in the third storage unit is rewritten with the corrected masking target part. (B) a correction means for storing the new masking target location in the third storage means when a new masking target location is designated by the user in the document displayed on the display screen; ,
Means for masking a masking target portion stored in the third storage means in the document;
Among the new masking target locations stored in the third storage means, characters other than the hiragana character string whose length is longer than a predetermined value from the character string of the masking target location designated by the user A character string that is included in the existing document and that has a frequency of appearance equal to or higher than a predetermined value, and that cannot be detected as a masking target location in the current dictionary, First selection means for selecting as a column;
First registration means for registering new dictionary data including the notation of the character string selected by the first selection means and its attributes in the dictionary;
A document processing apparatus comprising:

The correction to the masking target location displayed on the display screen by the user includes the correction of adding at least one character immediately before and after the masking target location in the document to the masking target location, 8. The document processing apparatus according to claim 6, wherein the document processing apparatus is at least one of corrections for deleting at least one character from a character string at a location.

Means for entering a list containing multiple strings;
Of the plurality of character strings, a character string length is equal to or greater than a predetermined value from a character string group excluding a character string including a phrase having an arrangement stored in the second storage unit, and other than a hiragana character And a character string having an appearance frequency in an existing document that is equal to or higher than a predetermined value and that cannot be morphologically analyzed or that cannot be detected as a masking target location in the current dictionary. Second selection means for selecting as a character string to be stored;
Second registration means for registering new dictionary data including the notation of the character string selected by the second selection means and its attributes in the dictionary;
The document processing apparatus according to claim 7, further comprising:

Means for entering a list containing multiple strings;
A character string whose character string length is equal to or greater than a predetermined value and includes characters other than hiragana and whose appearance frequency in an existing document is equal to or greater than a predetermined value is selected from the plurality of character strings. 3 selection means;
Of the character strings selected by the third selection means, means for registering new dictionary data including the notation of the character string that cannot be morphologically analyzed and its attributes in the dictionary;
Among the character strings selected by the third selection means, the character attribute array included in each clause obtained as a result of morphological analysis of each character string is not stored in the second storage means. Means for registering in the dictionary new dictionary data including column notation and attributes thereof;
The document processing apparatus according to claim 7, further comprising:

Computer
First storage means for storing a dictionary used for morphological analysis, including the notation and attributes of each word;
Second storage means for storing an array of word attributes included in each clause in the character string to be masked;
A means of entering documents,
Means for dividing the document into clauses and words by the morphological analysis;
Detecting means for detecting, from the document, one or a plurality of consecutive clauses having the attribute arrangement stored in the second storage means as masking target portions;
A third storage means for storing the masking target portion detected by the detection means;
Display means for displaying the document on a display screen and displaying a masking target portion stored in the third storage means in the document;
When the masking target part in the document displayed on the display screen is corrected by the user, the masking target part stored in the third storage unit is rewritten with the corrected masking target part. (B) When a new masking target location is instructed by the user in the document displayed on the display screen, a correction unit that stores the new masking target location in the third storage unit,
Means for masking a portion to be masked stored in the third storage means in the document;
Of the new masking target locations stored in the third storage means, registration means for registering new dictionary data including the notation of the character string of the masking target location designated by the user and its attributes in the dictionary,
Program to function as.

Computer
First storage means for storing a dictionary used for morphological analysis, including the notation and attributes of each word;
Second storage means for storing an array of word attributes included in each clause in the character string to be masked;
A means of entering documents,
Means for dividing the document into clauses and words by the morphological analysis;
Detecting means for detecting, from the document, one or a plurality of consecutive clauses having the attribute arrangement stored in the second storage means as masking target portions;
A third storage means for storing the masking target portion detected by the detection means;
Display means for displaying the document on a display screen and displaying a masking target portion stored in the third storage means in the document;
When the masking target part in the document displayed on the display screen is corrected by the user, the masking target part stored in the third storage unit is rewritten with the corrected masking target part. (B) When a new masking target location is instructed by the user in the document displayed on the display screen, a correction unit that stores the new masking target location in the third storage unit,
Means for masking a portion to be masked stored in the third storage means in the document;
Among the new masking target locations stored in the third storage means, characters other than the hiragana character string whose length is longer than a predetermined value from the character string of the masking target location designated by the user A character string that is included in the existing document and that has a frequency of appearance equal to or higher than a predetermined value, and that cannot be detected as a masking target location in the current dictionary, A first selection means for selecting as a column;
Registration means for registering new dictionary data including notation of the character string selected by the first selection means and its attributes in the dictionary;
Program to function as.