JP3687118B2

JP3687118B2 - Related word dictionary creation device and related word dictionary creation method

Info

Publication number: JP3687118B2
Application number: JP32120894A
Authority: JP
Inventors: 誠安藤; 明男山下; 一雄相原; 辰臣喜多; 裕子松尾; 真司川本; 浩山口
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1994-12-01
Filing date: 1994-12-01
Publication date: 2005-08-24
Anticipated expiration: 2020-08-24
Also published as: JPH08161343A

Description

【０００１】
【産業上の利用分野】
本発明は、テキスト検索装置のシソーラス作成保守に関し、特にテキスト検索装置に登録文書中のキーワードの共起関係に基づいてシソーラスの構築や保守を支援する装置に関する。
【０００２】
【従来の技術】
検索では漏れなく文書の検索を行わせるための一つの手段として、シソーラス辞書を用いてキーワードを展開して検索キーとするものがある。シソーラス辞書は、語とそれに関連する語を登録しており、検索において有効な手段である。しかし、それの構築が複雑であるという問題があり、従来より、個人シソーラスの更新作業を支援する装置、あるいは動的なシソーラスを自動的に作成する装置などが、例えば、下記のような提案がなされてきた。
【０００３】
（１）特開平４−３９７６９号公報「シソーラス生成装置」には、入力文書を形態素解析そして文構造解析し、そして単語間の意味関係を規則辞書を用いて決定し、得られた単語間の関係から関係がある単語群をツリー構造として自動的にまとめるシソーラスの自動生成技術に関する提案がなされている。
【０００４】
（２）特開平４−６４１７１号公報「キーワード連想生成装置」には、検索対象文のサンプルから、シソーラス上のノード重みを算出し、シソーラスに付与して作成された動的なシソーラスのノード重みを使用して連想キーワードを生成することが提案されている。
【０００５】
（３）特開平４−１２３２６４号公報「関連語テーブル作成装置および文書検索装置」には、自然言語文から意味表現テーブル、そしてそれから文解析・意味解析等をおこなって、関連語テーブルを作成する技術に関する提案がなされている。
【０００６】
（４）特開平４−２２２０５５号公報「個人シソーラス作成支援装置」には、検索時にシソーラスを使って解析・照合に失敗した文字列または単語に対して、シソーラス候補として登録しておき、ユーザのシソーラス更新時に使用するか否かの判断をするための整理された材料を提供し、シソーラス作成の支援をする技術に関する提案がなされている。
【０００７】
【発明が解決しようとする課題】
かしながら、前記従来技術（１）（特開平４−３９７６９号公報）や従来技術（３）（特開平４−１２３２６４号公報）のように文構造あるいは意味関係まで解析して、キーワード間の関連度、上位・下位関係、または反対語等を算出する方法は、文書登録時におこなう場合、処理が重くなり、大量の文書を登録する場合、時間が掛かり過ぎるという欠点があった。そして、従来技術（１）に関していえば、関連する単語群はツリー構造でシソーラスを構築するという制限があり構造の自由度にも制限があった。また、前記従来技術（２）（特開平４−６４１７１号公報）で連想キーワードを生成するために動的なシソーラスを作成する点に着目すると、抽出する文書が検索対象文すべてではなく、一部の検索対象文という限定では、シソーラス候補として必要である単語を漏らしてしまう可能性がある欠点があった。そして、前記従来技術（４）（特開平４−２２２０５５号公報）のように検索時の結果をフィードバックさせる方法も、検索式として与えたもののみしかシソーラス作成に反映されない、あるいはシソーラスとして登録すべき語は抽出できても、それと同じ単語群の他の語との関係が分からないという欠点があった。
【０００８】
本発明は、検索対象文書のデータから、キーワードとその関連語を比較的簡単に抽出判定でき、それを自動的に関連語辞書に登録することができるようにすることを目的とする。
本発明は、処理が重くならないようにすることにより、ユーザ自身が実使用に耐える程度のパフォーマンスで効率よく、上記キーワードの関連語辞書を作成できるようにすることを目的とする。
また、本発明は、まったく使わないような無関係なキーワードは省き、ユーザの関心の高い用語を選別して関連語として登録することができるようにすることを目的とする。
【０００９】
【課題を解決するための手段】
本発明は、検索対象となる文書のキーワードを抽出するキーワード抽出手段（２）と、抽出したキーワード群から、同一の文書単位を抽出の出所とするキーワード群を共起の関係にあるキーワードとし、その共起の関係にあるキーワード群からキーワードの対を求めるとともに、それらの各キーワード対の出現頻度を求め、共起ファイルに保持する共起ファイル手段（７、８）と、前記共起ファイルを参照し、前記出現頻度が設定された閾値を越えたキーワード対を、その一方を見出し語とし、他方をその関連語として、関連語辞書に登録する関連語辞書登録手段（９）とを備えた関連語辞書作成装置である。そして、前記共起ファイル手段は、前記キーワード対の出現頻度のほかに、各キーワードの出現頻度をも保持し、前記関連語辞書登録手段は、共起ファイル手段を参照して、キーワード対の出現頻度が設定された閾値を越えるとともに、そのキーワード対における見出し語とすべき一方のキーワードの出現頻度が他の設定された閾値を越えた場合に関連語辞書に登録するものである。
【００１０】
本発明の一態様では、前記関連語辞書登録手段はユーザによる関連語辞書の編集機能と、前記キーワード対の出現頻度の閾値または前記キーワードの出現頻度の閾値を変更する機能とを有し、また、ユーザの指定により、前記共起ファイル手段から、任意のキーワードに関する関連語の候補を表示する表示手段を有する。
【００１１】
また、本発明の他の態様では、前記関連語辞書登録手段はユーザによる関連語辞書の編集機能と、前記キーワード対の出現頻度の閾値または前記キーワードの出現頻度の閾値を変更する機能とを有し、また、ユーザの指定により、前記共起ファイル手段から、任意のキーワードに関する関連語の候補を表示する手段（１２）を有する。
【００１２】
【作用】
共起ファイル手段は、キーワード抽出手段により抽出したキーワード群から共起の関係にあるキーワード対を求める。同一の文書から抽出したキーワード群は共起の関係にあるものとする。あるいは、構造化文書のように文書の構成要素の単位でキーワード群の出所が判別できるときには、構成要素の単位から抽出されたキーワード群を共起の関係にあるものとする。このように、抽出の出所である文書単位が同一であるか否かによって共起の関係を判別するので、その判断処理が簡単となる。そして、共起の関係にあるキーワード対のそれぞれに対して、文書に現れた頻度を記録する。関連語辞書登録手段は、作成された共起ファイルを参照して、関連語辞書を作成する。その際、共起ファイルのキーワード対の出現頻度は、キーワード対のキーワード同士の関連の強さを表しているので、この出現頻度が適宜に設定した閾値よりも大きいときに、これらは関連語であるとみなし、その一方を見出し語とし、他方をその関連語として関連語辞書に登録する。
【００１３】
キーワードは、検索対象文中に頻繁に現れるものほど利用者により検索用キーワードとして使用される可能性が強い。従って、各キーワードの単独での出現頻度をも求め、これを前述のキーワード対の出現頻度のほかに、関連語を決定するための基準として加えることにより、あまり使われることのないキーワードが関連語辞書の見出し語として登録されてしまうということはなくなり、効率的な関連語辞書が得られる。
【００１４】
前述のように、関連語辞書登録手段には、ユーザによる関連語辞書の編集機能および前記各閾値の設定をユーザにより変更可能にする機能を持たせるようにすると共に、共起ファイル手段から関連語の候補を表示する表示手段を設けるようにすることができる。この場合には、ユーザは各閾値の設定を調節しながら、表示される関連語候補の範囲の変化を表示手段で確認することができ、どの範囲を関連語辞書に登録する語とするかを容易に判定することができる。
【００１５】
【実施例】
図１（ａ）〜図１（ｃ）は本発明の実施例の装置の概略構成を示す機能ブロック図である。図１（ａ）は文書を登録する際に、キーワードを抽出してインデックスファイルを作成するとともに、インデックスファイルに登録されたキーワードを基に、キーワードの共起の関係を保持しておく共起ファイルの作成登録のために必要な構成の概略を示している。この装置は、文書登録等の指示を与える入力手段１と、テキストベータベース部３に登録する文書からキーワードを抽出するキーワード抽出手段２と、文書等のテキストデータを保持するテキストデータベース部３と、抽出したキーワードを登録文書の識別子と対応させたインデックスをインデックスファイルに登録するインデックス登録手段４と、インデックスを保持するインデックスファイル部５と、抽出したキーワードと同一テキスト中のその他のキーワード群を、共起の関係にある語であるとして抽出する共起キーワード抽出手段６と、キーワードの共起およびその頻度を共起ファイルに登録する共起ファイル登録手段７と、キーワードの共起およびその頻度を保持する共起ファイル部８を備えている。
【００１６】
図１（ｂ）は共起ファイルから、関連語（シソーラス）の編集可能辞書ファイルを作成するための構成の概略を示しており、関連語（シソーラス）編集可能辞書ファイル構築手段９と、関連語編集可能辞書ファイル部１０を有している。
【００１７】
図１（ｃ）はユーザが関連語辞書（シソーラス辞書）を自ら構築する際、関連語候補を表示するための構成部分の概略を示している。共起ファイル部８から関連語候補を取り出して、表示可能な形式にして表示部１２に渡す関連語候補表示手段１１と，関連語候補の表示を行う表示部１２とを備えている。
【００１８】
以上のように構成された実施例の動作について説明する。
まず、本実施例では、同義語（関連語）の設定基準を以下のように設定する。
【００１９】
テキストデータベースに検索対象文書が登録されるたびに形態素解析を行いキーワード抽出をおこなうシステムの場合、
▲１▼ 形態素解析によりキーワード抽出された文字列に対して、その文字列が存在する登録文書より抽出されたその他の文字列を同義語・関連語の候補としてそれらを表すユニークなＩＤとともに関連語構築用の共起ファイルに保持しておく。
【００２０】
▲２▼ 同時に出現頻度を求めるため、既に抽出された文字列が共起ファイルあるいはインデックスファイルに存在している場合は出現頻度を１つインクリメントしておく。
【００２１】
▲３▼ また、その他の文字列が既に共起ファイルの抽出キーワードの同義語・関連語の候補として存在していれば一致回数のインクリメントを行い、存在していなければ一致回数を１とする。
【００２２】
▲４▼ 上記の操作を登録文書あるごとに繰り返す。
【００２３】
▲５▼ このように動的に同義語・関連語の候補の文字列が更新される共起ファイルの中で、キーワード抽出された文字列の出現頻度がある閾値（例えば１０箇所）以上あり、一致回数がある閾値（例えば５個）以上同一ファイルにある場合、同義語あるいは関連語とみなしそのまま、関連語ファイルに自動登録するか、あるいはユーザに提示し、ユーザの指示に応じて登録する。
【００２４】
▲６▼ 出現頻度ならびに一致回数の閾値はデフォルト値を与えかつユーザが変更可能とする。図４に出現頻度ならびに一致回数の閾値の内容の例を示す。
【００２５】
なお、出現頻度とは、あるキーワードに関して、すべての登録文書ファイルに対して、出現した回数の総和である。
またインデックスは図２（ａ）に示すようにキーワード（文字列あるいはＩＤ）とそれに対する文書ｆｉｌｅＩＤのエントリーを持つ構造のものと、図２（ｂ）に示すように文書ｆｉｌｅＩＤとそれに対するキーワード（文字列あるいはＩＤ）のエントリーを持つ構造のものとがある。
そして、関連語構築用の共起ファイルのエントリーの構造は、図４（ａ）（ｂ）に示すようにユーザ定義関連語ＩＤ、キーワード、出現頻度、対応キーワードペア（［対応キーワード，一致回数］［，］，［，］，．．．．）からなる。なお、図４（ａ）（ｂ）には関連語構築用の共起ファイルのエントリーの内容例も示されている。
【００２６】
図６は抽出したキーワードと同義語・関連語との関係を保持しておく共起ファイルを作成登録する処理のフローチャートである。まず、入力手段１により、文書登録の指示が行われる（Ｓ６０）。ここでインデックス登録をおこなう（Ｓ６１）がこのときの処理を図７のフローに示す。
【００２７】
図７において形態素解析を用いたキーワード抽出手段２により、キーワードが抽出され（Ｓ７０）、抽出されたキーワードを、インデックス登録手段４によりインデックスファイル部５における図２（ａ）に示すようなキーワード対文書ｆｉｌｅＩＤのインデックスに登録し（Ｓ７１）、また図２（ｂ）に示すような文書ｆｉｌｅＩＤ対キーワードのインデックスに登録をする（Ｓ７２）。次に関連語用の共起ファイル部８に抽出キーワードが見出し語としてあるかチェックを行い（Ｓ７３）、存在しなければ共起ファイル登録手段７は、ユーザ定義関連語ＩＤを設定しその抽出キーワードを見出し語として登録し、出現頻度を１に設定する（Ｓ７４）。また、すでに存在していれば、対応する共起ファイル部８のレコードの出現頻度をインクリメントする（Ｓ７５）。上記の操作を抽出するキーワードがなくなるまで繰り返す（Ｓ７６）。これが終了すると、図６のステップＳ６２に進む。
【００２８】
共起キーワード抽出手段６は、上記文書ｆｉｌｅＩＤ対キーワードのインデックスより、前記登録指示された対象文書のｆｉｌｅＩＤに係る１つのキーワードを対象登録文書のキーワードとして抽出する（Ｓ６２）。そして、抽出されたキーワード以外の対象登録文書に係るキーワードを共起関係にある対応キーワードとして抽出する（Ｓ６３）。次に共起ファイル登録手段７は、共起ファイル部８を参照し、対象登録文書のキーワードが見出し語となっている共起ファイル部８のエントリーに上記対応キーワードが存在しているかチェックを行う（Ｓ６４）。もし存在しなければ共起ファイル登録手段７は、対応する共起ファイル部８のエントリーの［対応キーワード，一致回数＝１］のレコードを追加する（Ｓ６５）。また、すでに存在していれば対応する共起ファイル部８のエントリーの［対応キーワード，一致回数］のレコードの一致回数をインクリメントする（Ｓ６６）。文書ｆｉｌｅＩＤ対キーワードのインデックスに残りの対応キーワードが存在しているかどうかのチェックを行い（Ｓ６７）、まだ存在していれば、抽出されたキーワード以外のキーワードを対応キーワードとして抽出する処理（Ｓ６３）に戻る。存在していなければキーワード抽出は終了したかの判断を行い（Ｓ６８）、終了していなければインデックスよりキーワードを抽出する処理に戻る（Ｓ６２）。登録した文書の全てのキーワード抽出が終了した時点で、文書登録処理が終了したかどうかの判断を行い（Ｓ６９）、登録する文書がある場合は、再度文書登録の処理（Ｓ６０）に戻る。登録する文書がなくなった時点で、処理を終了する。
【００２９】
図８は関連語の編集可能辞書ファイルを作成する処理のフローチャートである。まず、図１（ｂ）の構成において、入力手段１により、関連語の編集可能辞書ファイル構築の指示をする（Ｓ８０）。関連語編集可能辞書ファイル構築手段９は出現頻度の閾値はデフォルト値でよいかの判断をユーザに確認し（Ｓ８１）、デフォルト値以外を選択したい場合は、ユーザが指定する出現頻度の閾値に変更し（Ｓ８２）、デフォルト値でよい場合にはそのまま閾値を保持する。次に関連語編集可能辞書ファイル構築手段９は、一致回数の閾値はデフォルト値でよいかの判断をユーザに確認し（Ｓ８３）、デフォルト値以外を選択したい場合は、ユーザが指定する出現頻度の閾値に変更し（Ｓ８４）、デフォルト値でよい場合にはそのまま閾値を保持する。関連語編集可能辞書ファイル構築手段９は設定された出現頻度と一致回数の閾値以上の、条件を満足する共起ファイルのエントリーを抽出し（Ｓ８５）、抽出されたエントリーから関連語の編集可能辞書ファイルを作成する（Ｓ８６）。
【００３０】
図９は関連語候補を表示する処理のフローチャートである。まず、図１（ｃ）の構成において、入力手段１により、関連語の候補表示の指示を行う（Ｓ９０）。関連語候補表示手段１１は出現頻度の閾値はデフォルト値でよいかの判断をユーザに確認し（Ｓ９１）、デフォルト値以外を選択したい場合は、ユーザが指定する出現頻度の閾値に変更し（Ｓ９２）、デフォルト値でよい場合にはそのまま閾値を保持する。次に関連語候補表示手段１１は、一致回数の閾値はデフォルト値でよいかの判断をユーザに確認し（Ｓ９３）、デフォルト値以外を選択したい場合は、ユーザが指定する出現頻度の閾値に変更し（Ｓ９４）、デフォルト値でよい場合にはそのまま閾値を保持する。関連語候補表示手段１１は設定された出現頻度と一致回数の閾値以上の、条件を満足する共起ファイルのエントリーを抽出し（Ｓ９５）、抽出されたエントリーから関連語の候補を表示する（Ｓ９６）。
【００３１】
以下、図２（ａ）〜図６を適宜参照しながら本実施例の具体的な動作例について説明する。
【００３２】
図２（ａ）のキーワード対文書ｆｉｌｅＩＤのインデックス、図２（ｂ）の文書ｆｉｌｅＩＤ対キーワードのインデックス、図３（ａ）の関連語構築用の共起ファイルがすでに存在しており、新たに、「パンナコッタ」「ナタデココ」「ティラミス」を含む文書を登録したとする。図６のにステップＳ６１において、インデックス登録手段４はキーワード対文書ｆｉｌｅＩＤのインデックスに、図３（ａ）のように「パンナコッタ」「ナタデココ」「ティラミス」と、登録する文書のｆｉｌｅＩＤ２０２２２とのレコードとして追加する。もし「パンナコッタ」「ナタデココ」「ティラミス」がすでに登録されていれば、対応する文書のｆｉｌｅＩＤ２０２２２のみを文書ｆｉｌｅＩＤのフィールドに追加する。次にインデックス登録手段４は文書ｆｉｌｅＩＤ対キーワードのインデックスに図３（ｂ）のように登録する文書のｆｉｌｅＩＤ２０２２２のレコードを追加し、「パンナコッタ」「ナタデココ」「ティラミス」をキーワードフィールドに追加しレコードを更新する。図４（ａ）のように「パンナコッタ」がすでに９回の出現頻度で登録され、「ナタデココ」がすでに４回の出現頻度で登録されている。「ティラミス」はまだ登録されていないので、図４（ｂ）のように「ティラミス」に対してはユーザ定義関連語ＩＤＮ３を設定し「ティラミス」を見出し語として登録し、出現頻度を１に設定する。また「パンナコッタ」に対してはユーザ定義関連語ＩＤＮ１が、「ナタデココ」に対してはユーザ定義関連語ＩＤＮ２がすでに登録されており出現頻度を１インクリメントしそれぞれ１０と５とする。
【００３３】
次に、図６のＳ６２〜Ｓ６７において、図３（ｂ）の文書ｆｉｌｅＩＤ対キーワードのインデックスよりｆｉｌｅＩＤ２０２２２にあるキーワード「パンナコッタ」「ナタデココ」「ティラミス」を順次抽出する。共起ファイル登録手段７は関連語用共起ファイル部８に抽出キーワードが見出し語としてあるかチェックを行い、上記設定により「パンナコッタ」「ナタデココ」「ティラミス」が存在するので、それぞれ対応キーワードペアにｆｉｌｅＩＤ２０２２２に存在する残りのキーワードを一致回数とともに登録する。「パンナコッタ」の場合残りのキーワードは「ナタデココ」「ティラミス」であり、対応キーワードペアのフィールドに「ナタデココ」「ティラミス」は、それぞれ出現頻度＝４，１で存在しているので、図４（ｂ）のように、それぞれのペアを１インクリメントし（ナタデココ、５）、（ティラミス，２）と更新する。また「ナタデココ」の場合、残りのキーワードは「パンナコッタ」「ティラミス」であり、対応キーワードペアのフィールドに「パンナコッタ」が出現頻度＝１で存在しているので、図４（ｂ）のように、「パンナコッタ」のペアを１インクリメントし（パンナコッタ、２）と更新し、「ティラミス」という対応キーワードペアは存在しないので、新たに「ティラミス」の場合残りのキーワードは「パンナコッタ」「ナタデココ」のキーワードをであり（ティラミス，１）を追加登録する。そして「ティラミス」の場合には対応キーワードペアが存在しないので、新たに「ティラミス」の対応キーワードペアのフィールドに（パンナコッタ，１）、（ナタデココ，１）を新規登録する。
【００３４】
次に、図５（ａ）は本実施例での出現頻度ならびに一致回数の閾値のデフォルト値を示したものである。入力手段１により、関連語の編集可能辞書ファイル構築の指示をした場合、出現頻度の閾値を１０、一致回数の閾値を５と、デフォルト値のままでよいという判断をした場合、上記実施例で「パンナコッタ」「ナタデココ」「ティラミス」に着目すると、関連語編集可能辞書ファイル構築手段９は図４（ｂ）の「パンナコッタ」「ナタデココ」「ティラミス」のエントリーから出現頻度が１０以上のキーワードは「パンナコッタ」だけであるので「パンナコッタ」のみを抽出し、さらに抽出された「パンナコッタ」エントリーの対応キーワードペアから一致回数が５以上の「菓子」と「ナタデココ」を抽出し、図５（ｂ）のように関連語の編集可能辞書ファイルに登録する。出現頻度の閾値、一致回数の閾値を変更することによって、各抽出されるエントリーのキーワードも変わってくる。
【００３５】
関連語の候補表示の場合も、上記関連語の編集可能辞書ファイルの例と同様の場合を例にとれば、候補表示の指示を受けた関連語候補表示手段１１は、出現頻度と一致回数のそれぞれの設定閾値を越える「菓子」と「ナタデココ」が抽出され、表示部１２により関連語の候補の表示を行う。
【００３６】
【発明の効果】
本発明では、ユーザが検索対象としている文書から抽出したキーワードを利用して関連語辞書を構築するため、ユーザの要求に合った有用な関連語辞書を構築することができる。
【００３７】
また、本発明によれば、キーワードの共起の関係の判断を抽出した出所が同一文書単位において出現したか否かにより行って、共起の関係にあるキーワード対を生成するので、共起の関係にある語を漏れがなくかつ簡単に求めることができ、また、その出現頻度を記録して、キーワード間の関連の程度を判断する材料とするので、関連語の判断処理が簡単、確実となる。従って、データ量が増えても、実使用に十分に耐える程度のパフォーマンスで実施することができる。
【００３８】
抽出される各キーワードの出現頻度に基づいて登録する関連語を決定するようにした本発明の構成によれば、ほとんど使わないような無関係なキーワードを省くことができる。
【００３９】
また、共起ファイルをユーザの指示により表示して、ユーザの編集により関連語辞書の登録を行う本発明の構成によれば、共起ファイルからユーザに提示される「同義である」あるいは「関連する」キーワード群にはユーザの興味のある、あるいは専門として扱うキーワードを含んでいる可能性が高いので、ユーザが独自の関連語辞書を構築するための作業が容易となり、また、ユーザの要求に合った有用な関連語辞書を構築することができる。また、その際、幾つか出現する可能性のある不要な関連語も、出現頻度の閾値の設定をユーザの判断で適宜に変更することにより、よりユーザの要求に即して押さえることができる。
【図面の簡単な説明】
【図１】（ａ）は抽出したキーワードとの同義語・関連語との関係を保持しておく共起ファイルの作成登録のための構成の概略を示す図、（ｂ）は関連語の編集可能辞書ファイルを作成するための構成の概略を示す図、（ｃ）はユーザが関連語を自ら構築する際、関連語候補を表示するための構成の概略を示す図
【図２】（ａ）はキーワード対文書ｆｉｌｅＩＤのインデックスの内容、（ｂ）は文書ｆｉｌｅＩＤ対キーワードのインデックスの内容を示す図（共に、実施例で実施前の状態）
【図３】（ａ）はキーワード対文書ｆｉｌｅＩＤのインデックスの内容、（ｂ）は文書ｆｉｌｅＩＤ対キーワードのインデックスの内容を示す図（共に、実施例で実施後の状態）
【図４】（ａ）は関連語構築用の共起ファイルのエントリーの内容（実施例で実施前の状態）、（ｂ）は関連語構築用の共起ファイルのエントリーの内容（実施例で実施後の状態）を示す図
【図５】（ａ）は出現頻度ならびに一致回数の閾値の内容の一例、（ｂ）は関連語の編集可能辞書ファイルの内容の一例を示す図
【図６】抽出したキーワードと同義語・関連語との関係を保持しておく共起ファイルを作成登録する処理のフローチャート
【図７】インデックス登録処理のフローチャート
【図８】関連語の編集可能辞書ファイルを作成する処理のフローチャート
【図９】関連語候補を表示する処理のフローチャート。
【符号の説明】
１…入力手段、２…キーワード抽出手段、３…テキストデータベース部、４…インデックス登録手段、５…インデックスファイル部、６…共起キーワード抽出手段、７…共起ファイル登録手段、８…共起ファイル、９…関連語編集可能辞書ファイル構築手段、１０…関連語編集可能辞書ファイル部、１１…関連語候補表示手段、１２…表示部。[0001]
[Industrial application fields]
The present invention relates to thesaurus creation maintenance of a text search apparatus, and more particularly to an apparatus that supports construction and maintenance of a thesaurus based on the co-occurrence relationship of keywords in a document registered in the text search apparatus.
[0002]
[Prior art]
One means for searching for a document without omission is to use a thesaurus dictionary to expand keywords and use them as search keys. The thesaurus dictionary registers words and related words, and is an effective means for searching. However, there is a problem that its construction is complicated, and conventionally, a device that supports the update operation of a personal thesaurus or a device that automatically creates a dynamic thesaurus has been proposed as follows, for example. Has been made.
[0003]
(1) Japanese Patent Laid-Open No. 4-39769 discloses a “thesaurus generation device” that performs morphological analysis and sentence structure analysis on an input document, determines a semantic relationship between words using a rule dictionary, and obtains a word space between the obtained words. Proposals have been made on automatic thesaurus generation techniques that automatically group related words into a tree structure.
[0004]
(2) Japanese Patent Laid-Open Publication No. Hei 4-64171 “Keyword Associative Generating Device” calculates a node weight on a thesaurus from a sample of a search target sentence, and assigns the node weight of a dynamic thesaurus created to the thesaurus. It has been proposed to generate associative keywords using.
[0005]
(3) Japanese Patent Application Laid-Open No. 4-123264 “Related Word Table Creation Device and Document Search Device” creates a related word table by performing a semantic expression table from a natural language sentence and then performing sentence analysis / semantic analysis and the like. Technical proposals have been made.
[0006]
(4) In Japanese Patent Application Laid-Open No. 4-222555 “Personal Thesaurus Creation Support Device”, a thesaurus or character string or word that failed to be analyzed or verified using a thesaurus at the time of search is registered as a thesaurus candidate. Proposals have been made regarding techniques for providing organized materials for determining whether to use a thesaurus or not, and for supporting thesaurus creation.
[0007]
[Problems to be solved by the invention]
However, as in the prior art (1) (Japanese Patent Laid-Open No. 4-39769) and the prior art (3) (Japanese Patent Laid-Open No. 4-123264), the sentence structure or semantic relationship is analyzed, The method of calculating the degree of association, the upper / lower relationship, or the antonym is disadvantageous in that it takes a long time when registering a large number of documents because the processing becomes heavy when it is performed at the time of document registration. As for the related art (1), related word groups are limited to construct a thesaurus with a tree structure, and the degree of freedom of the structure is also limited. Focusing on the point of creating a dynamic thesaurus for generating an associative keyword in the prior art (2) (Japanese Patent Laid-Open No. Hei 4-64171), the extracted document is not the entire search target sentence, but a part of it. However, there is a drawback that a word necessary as a thesaurus candidate may be leaked. And the method of feeding back the results at the time of retrieval as in the prior art (4) (Japanese Patent Laid-Open No. Hei 4-222555) also reflects only the one given as a retrieval formula in the thesaurus creation or should be registered as a thesaurus Even if a word can be extracted, there is a drawback that the relationship with other words in the same word group is not known.
[0008]
An object of the present invention is to make it possible to relatively easily extract and determine a keyword and its related word from data of a search target document and to automatically register it in a related word dictionary.
An object of the present invention is to make it possible to efficiently create a related word dictionary of the above keywords with performance sufficient to withstand actual use by preventing the processing from becoming heavy.
It is another object of the present invention to eliminate irrelevant keywords that are not used at all, and to select and register terms of high interest of the user as related terms.
[0009]
[Means for Solving the Problems]
The present invention provides a keyword extracting means (2) for extracting a keyword of a document to be searched, and a keyword group having the same document unit as the source of extraction from the extracted keyword group as a co-occurrence keyword, pairs keyword with determined Mel from keyword group in its co-occurrence relation, co-occurrence file means that they seek the frequency of occurrence of each keyword pair, held in the co-occurrence file (7,8), the co-occurrence Referring to file, the frequency of occurrence keyword pairs beyond the set threshold value, and while the a headword, the other as its related terms, related terms dictionary registration means for registering the related word dictionary (9) Is a related word dictionary creation device. The co-occurrence file means also holds the appearance frequency of each keyword in addition to the appearance frequency of the keyword pair, and the related word dictionary registration means refers to the co-occurrence file means to generate the occurrence of the keyword pair. When the frequency exceeds a set threshold and the appearance frequency of one keyword to be used as a headword in the keyword pair exceeds the other set threshold, it is registered in the related word dictionary.
[0010]
In one aspect of the present invention, the related word dictionary registration unit has a function of editing a related word dictionary by a user, and a function of changing a threshold value of the appearance frequency of the keyword pair or a threshold value of the appearance frequency of the keyword, And a display means for displaying related word candidates related to an arbitrary keyword from the co-occurrence file means in accordance with a user designation.
[0011]
In another aspect of the present invention, the related word dictionary registration means has a function of editing a related word dictionary by a user and a function of changing a threshold value of the appearance frequency of the keyword pair or a threshold value of the appearance frequency of the keyword. And a means (12) for displaying related word candidates relating to an arbitrary keyword from the co-occurrence file means in accordance with a user's designation.
[0012]
[Action]
The co-occurrence file means obtains a keyword pair having a co-occurrence relationship from the keyword group extracted by the keyword extracting means. It is assumed that keyword groups extracted from the same document have a co-occurrence relationship. Alternatively, when the origin of the keyword group can be determined by the unit of the component of the document as in the structured document, the keyword group extracted from the unit of the component is assumed to have a co-occurrence relationship. Thus, since the co-occurrence relationship is determined based on whether or not the document units that are the origins of extraction are the same, the determination process is simplified. Then, the frequency of appearing in the document is recorded for each keyword pair having a co-occurrence relationship. The related word dictionary registration means creates a related word dictionary with reference to the created co-occurrence file. At that time, the appearance frequency of the keyword pair in the co-occurrence file represents the strength of the relationship between the keywords of the keyword pair, so when the appearance frequency is larger than the threshold set appropriately, these are related words. Assuming that there is one, one of them is used as a headword, and the other is registered as a related word in the related word dictionary.
[0013]
The more frequently a keyword appears in a search target sentence, the more likely it is to be used as a search keyword by a user. Therefore, by calculating the frequency of appearance of each keyword alone and adding this as a criterion for determining related words in addition to the frequency of occurrence of the keyword pairs described above, keywords that are rarely used are related words. It is no longer registered as a dictionary entry word, and an efficient related word dictionary can be obtained.
[0014]
As described above, the related word dictionary registering unit has a function for editing the related word dictionary by the user and a function for allowing the user to change the setting of each threshold value, and from the co-occurrence file unit to the related word dictionary. Display means for displaying the candidates can be provided. In this case, the user can confirm the change in the range of the related word candidates to be displayed on the display means while adjusting the setting of each threshold, and determine which range is to be registered in the related word dictionary. It can be easily determined.
[0015]
【Example】
FIG. 1A to FIG. 1C are functional block diagrams showing a schematic configuration of an apparatus according to an embodiment of the present invention. FIG. 1A shows a co-occurrence file that creates an index file by extracting keywords when a document is registered, and maintains a co-occurrence relationship of keywords based on the keywords registered in the index file. The outline of the configuration necessary for creating and registering is shown. The apparatus includes an input unit 1 for giving an instruction for document registration, a keyword extraction unit 2 for extracting a keyword from a document registered in the text beta base unit 3, a text database unit 3 for holding text data of the document, An index registration unit 4 that registers an index in which an extracted keyword is associated with an identifier of a registered document in an index file, an index file unit 5 that holds the index, and another keyword group in the same text as the extracted keyword are shared. Co-occurrence keyword extraction means 6 that extracts words that are in the relationship of occurrence, co-occurrence file registration means 7 that registers the keyword co-occurrence and frequency in the co-occurrence file, and holds the keyword co-occurrence and frequency The co-occurrence file unit 8 is provided.
[0016]
FIG. 1B shows an outline of a configuration for creating an editable dictionary file of related words (thesaurus) from the co-occurrence file. The related word (thesaurus) editable dictionary file construction means 9 and related words An editable dictionary file unit 10 is provided.
[0017]
FIG. 1C shows an outline of components for displaying related word candidates when the user constructs a related word dictionary (thesaurus dictionary). Related word candidates are extracted from the co-occurrence file unit 8 and transferred to the display unit 12 in a displayable format. The display unit 12 displays the related word candidates.
[0018]
The operation of the embodiment configured as described above will be described.
First, in this embodiment, the setting criteria for synonyms (related words) are set as follows.
[0019]
For a system that extracts keywords by performing morphological analysis each time a search target document is registered in the text database,
(1) For a character string extracted by a keyword by morphological analysis, other character strings extracted from a registered document in which the character string exists are used as synonyms / related words as well as a unique ID representing the related word. Keep it in a co-occurrence file for construction.
[0020]
(2) In order to obtain the appearance frequency at the same time, the appearance frequency is incremented by one when an already extracted character string exists in the co-occurrence file or the index file.
[0021]
(3) If another character string already exists as a synonym / related word candidate of the extracted keyword of the co-occurrence file, the number of matches is incremented, and if it does not exist, the number of matches is set to 1.
[0022]
(4) Repeat the above operation every time there is a registered document.
[0023]
(5) In the co-occurrence file in which the synonym / related word candidate character strings are dynamically updated in this way, the frequency of appearance of the character strings extracted by keywords is greater than or equal to a threshold (for example, 10 locations), When the number of matches is greater than a certain threshold (for example, 5) in the same file, it is regarded as a synonym or related word and is automatically registered in the related word file as it is or presented to the user and registered in accordance with the user's instruction.
[0024]
(6) Default values are given for the appearance frequency and the threshold value of the number of matches, and the user can change them. FIG. 4 shows an example of the contents of the threshold values for the appearance frequency and the number of matches.
[0025]
The appearance frequency is the sum of the number of appearances for all registered document files for a certain keyword.
Further, the index has a structure having a keyword (character string or ID) and a document file ID entry as shown in FIG. 2A, and a document file ID and a keyword (character as shown in FIG. 2B). Some of them have a column or ID) entry.
Then, as shown in FIGS. 4A and 4B, the entry structure of the co-occurrence file for related word construction is as follows: user-defined related word ID, keyword, appearance frequency, corresponding keyword pair ([corresponding keyword, number of matches] [,], [,], ...). FIGS. 4 (a) and 4 (b) also show an example of the contents of a co-occurrence file entry for constructing a related word.
[0026]
FIG. 6 is a flowchart of processing for creating and registering a co-occurrence file that holds the relationship between the extracted keyword and the synonym / related word. First, a document registration instruction is given by the input means 1 (S60). Here, index registration is performed (S61), and the processing at this time is shown in the flow of FIG.
[0027]
In FIG. 7, keywords are extracted by the keyword extraction means 2 using morphological analysis (S 70), and the extracted keywords are converted into keyword-to-documents as shown in FIG. 2A in the index file unit 5 by the index registration means 4. It is registered in the index of file ID (S71), and is registered in the index of document file ID versus keyword as shown in FIG. 2B (S72). Next, it is checked whether or not the extracted keyword exists as a headword in the co-occurrence file portion 8 for the related word (S73). If it does not exist, the co-occurrence file registration means 7 sets the user-defined related word ID and extracts the extracted keyword. Is registered as a headword, and the appearance frequency is set to 1 (S74). If it already exists, the appearance frequency of the record of the corresponding co-occurrence file unit 8 is incremented (S75). The above operation is repeated until there are no keywords to be extracted (S76). When this is finished, the process proceeds to step S62 in FIG.
[0028]
The co-occurrence keyword extracting unit 6 extracts one keyword related to the file ID of the target document instructed for registration as a keyword of the target registered document from the document file ID versus keyword index (S62). Then, keywords related to the target registration document other than the extracted keywords are extracted as corresponding keywords having a co-occurrence relationship (S63). Next, the co-occurrence file registration unit 7 refers to the co-occurrence file unit 8 and checks whether the corresponding keyword exists in the entry of the co-occurrence file unit 8 in which the keyword of the target registration document is a headword. (S64). If it does not exist, the co-occurrence file registration means 7 adds a record of [corresponding keyword, number of matches = 1] of the entry of the corresponding co-occurrence file portion 8 (S65). If it already exists, the number of matches in the [corresponding keyword, number of matches] record of the corresponding co-occurrence file unit 8 is incremented (S66). It is checked whether or not the remaining corresponding keyword exists in the document file ID vs. keyword index (S67). If it still exists, a keyword other than the extracted keyword is extracted as the corresponding keyword (S63). Return. If it does not exist, it is determined whether or not the keyword extraction is completed (S68). If not, the process returns to the process of extracting the keyword from the index (S62). When all keywords are extracted from the registered document, it is determined whether the document registration process is completed (S69). If there is a document to be registered, the process returns to the document registration process (S60) again. When there are no more documents to register, the process ends.
[0029]
FIG. 8 is a flowchart of processing for creating an editable dictionary file of related words. First, in the configuration shown in FIG. 1B, the input unit 1 instructs the construction of an editable dictionary file of related words (S80). The related word editable dictionary file construction means 9 confirms with the user whether the threshold value of the appearance frequency may be a default value (S81), and if it is desired to select a value other than the default value, it is changed to the threshold value of the appearance frequency specified by the user. (S82) If the default value is acceptable, the threshold is held as it is. Next, the related word editable dictionary file construction means 9 confirms with the user whether the threshold value of the number of matches may be a default value (S83), and if it is desired to select a value other than the default value, the appearance frequency specified by the user is determined. The threshold value is changed (S84), and if the default value is acceptable, the threshold value is held as it is. The related word editable dictionary file construction means 9 extracts an entry of a co-occurrence file satisfying the condition that is equal to or greater than the set appearance frequency and the threshold value of the number of matches (S85), and the related word editable dictionary is extracted from the extracted entry. A file is created (S86).
[0030]
FIG. 9 is a flowchart of processing for displaying related word candidates. First, in the configuration of FIG. 1 (c), the input means 1 instructs the related word candidate display (S90). The related word candidate display means 11 confirms with the user whether the threshold value of the appearance frequency may be a default value (S91), and if it is desired to select a value other than the default value, it changes to the threshold value of the appearance frequency specified by the user (S92). ) If the default value is acceptable, the threshold value is held as it is. Next, the related word candidate display means 11 confirms with the user whether the threshold value for the number of matches may be a default value (S93), and if it is desired to select a value other than the default value, it changes to the appearance frequency threshold value specified by the user. (S94) If the default value is acceptable, the threshold value is held as it is. The related word candidate display means 11 extracts a co-occurrence file entry that satisfies the condition that is equal to or greater than the set appearance frequency and the matching frequency threshold (S95), and displays related word candidates from the extracted entry (S96). ).
[0031]
Hereinafter, a specific operation example of the present embodiment will be described with reference to FIGS. 2A to 6 as appropriate.
[0032]
The keyword-to-document fileID index in FIG. 2A, the document fileID-to-keyword index in FIG. 2B, and the co-occurrence file for related word construction in FIG. 3A already exist. It is assumed that a document including “Panna Cotta”, “Nata de Coco”, and “Tiramisu” is registered. In step S61 of FIG. 6, the index registration unit 4 records “Pannacotta”, “Natadecoko”, “Tiramisu”, and the fileID 20222 of the document to be registered in the keyword-to-document fileID index as shown in FIG. to add. If “Panna Cotta”, “Nata de Coco”, and “Tiramisu” have already been registered, only the file ID 20222 of the corresponding document is added to the field of the document file ID. Next, the index registration means 4 adds a file ID 20222 record of the document to be registered as shown in FIG. 3B to the document file ID vs. keyword index, and adds “Panna Cotta”, “Nata deco”, “Tiramisu” to the keyword field, and records Update. As shown in FIG. 4A, “Panna Cotta” has already been registered with an appearance frequency of 9 times, and “Nata Deco” has already been registered with an appearance frequency of 4 times. Since “tiramisu” is not yet registered, user-defined related word ID N3 is set for “tiramisu” as shown in FIG. 4B, and “tiramisu” is registered as a headword, and the appearance frequency is set to 1. Set. Also, the user-defined related word ID N1 is already registered for “Panna Cotta”, and the user-defined related word ID N2 is already registered for “Natacococo”, and the appearance frequency is incremented by 1 to 10 and 5, respectively.
[0033]
Next, in S62 to S67 of FIG. 6, the keywords “Pannacotta”, “Natadecoko”, and “Tiramisu” in the fileID 20222 are sequentially extracted from the document fileID versus keyword index of FIG. The co-occurrence file registration means 7 checks whether the extracted keyword is an entry word in the co-occurrence file portion 8 for related words, and “Panna Cotta”, “Nata de Coco”, and “Tiramisu” exist according to the above settings. The remaining keywords existing in the file ID 20222 are registered together with the number of matches. In the case of “Panna Cotta”, the remaining keywords are “Nata de Coco” and “Tiramisu”, and “Nata de Coco” and “Tiramisu” exist in the corresponding keyword pair fields with the appearance frequencies = 4 and 1, respectively. ), Each pair is incremented by 1 (Natadecoko, 5) and updated to (Tiramisu, 2). In the case of “Natadecoko”, the remaining keywords are “Panna Cotta” and “Tiramisu”, and “Panna Cotta” exists in the corresponding keyword pair field with the appearance frequency = 1, as shown in FIG. The “Pannacotta” pair is incremented by 1 (Pannacotta, 2) and updated, and there is no corresponding keyword pair “Tiramisu”. Therefore, in the case of “Tiramisu”, the remaining keywords are the keywords “Pannacotta” (Tiramisu, 1) is additionally registered. In the case of “Tiramisu”, since there is no corresponding keyword pair, (Panna Cotta, 1) and (Nata Deco, 1) are newly registered in the field of the corresponding keyword pair of “Tiramisu”.
[0034]
Next, FIG. 5A shows the default values of the threshold values for the appearance frequency and the number of matches in the present embodiment. When the input unit 1 instructs the construction of an editable dictionary file of related words, it is determined that the appearance frequency threshold is 10, the matching frequency threshold is 5, and the default value may be left as it is. Focusing on “Pannacotta”, “Natadecoko”, and “Tiramisu”, the related word editable dictionary file construction means 9 uses the keywords “Pannacotta”, “Natadecoko”, and “Tiramisu” in FIG. Since only “Panna cotta” is extracted, only “Panna cotta” is extracted, and “confectionery” and “Nata de coco” having a match count of 5 or more are extracted from the corresponding keyword pair of the extracted “Panna cotta” entry, as shown in FIG. Register the related words in the editable dictionary file. By changing the threshold value of the appearance frequency and the threshold value of the number of matches, the keyword of each entry to be extracted also changes.
[0035]
Also in the case of related word candidate display, if the case similar to the example of the above-mentioned related word editable dictionary file is taken as an example, the related word candidate display means 11 having received the candidate display instruction, “Sweets” and “Natadecoko” exceeding the respective set threshold values are extracted, and the display unit 12 displays related word candidates.
[0036]
【The invention's effect】
In the present invention, since the related word dictionary is constructed using the keywords extracted from the document that the user is searching for, it is possible to construct a useful related word dictionary that meets the user's request.
[0037]
In addition, according to the present invention, the keyword co-occurrence relationship determination is performed based on whether or not the extracted source has appeared in the same document unit, and the keyword pair having the co-occurrence relationship is generated. Related words can be easily obtained without omission, and the frequency of occurrence is recorded and used as a material for judging the degree of association between keywords, so the related word judgment process is simple and reliable. Become. Therefore, even if the amount of data increases, it can be implemented with performance sufficient to withstand actual use.
[0038]
According to the configuration of the present invention in which related words to be registered are determined based on the appearance frequency of each extracted keyword, it is possible to omit irrelevant keywords that are rarely used.
[0039]
In addition, according to the configuration of the present invention in which the co-occurrence file is displayed according to the user's instruction and the related word dictionary is registered by the user's editing, “synonymous” or “relevant” presented to the user from the co-occurrence file The keyword group is likely to contain keywords that the user is interested in or treats as specialized, so that it is easy for users to build their own related word dictionaries, A useful related word dictionary can be constructed. At that time, some unnecessary related words that may appear may be suppressed more in accordance with the user's request by appropriately changing the threshold value of the appearance frequency according to the user's judgment.
[Brief description of the drawings]
FIG. 1A is a diagram showing an outline of a configuration for creating and registering a co-occurrence file that holds a relationship between a synonym and a related word with an extracted keyword, and FIG. 1B is an editing of a related word. The figure which shows the outline of the structure for creating a possible dictionary file, (c) is the figure which shows the outline of the structure for displaying a related word candidate, when a user constructs | assembles a related word itself. Is the contents of the index of the keyword vs. document file ID, (b) is a diagram showing the contents of the index of the keyword of the file file ID vs. the keyword (both before implementation in the embodiment)
FIGS. 3A and 3B show the contents of an index of keyword vs. document file ID, and FIG. 3B shows the contents of an index of document file ID vs. keyword (both after implementation in the embodiment).
4A is a content of a co-occurrence file entry for constructing a related word (state before implementation in the embodiment), and FIG. 4B is a content of a co-occurrence file entry for constructing a related word (in the embodiment). FIG. 5A shows an example of the contents of the appearance frequency and the threshold value of the number of matches, and FIG. 5B shows an example of the contents of an editable dictionary file of related words. Flowchart of processing for creating and registering a co-occurrence file that maintains the relationship between extracted keywords and synonyms / related words [FIG. 7] Flow chart of index registration processing [FIG. 8] Creating an editable dictionary file of related words Flowchart of Processing FIG. 9 is a flowchart of processing for displaying related word candidates.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Input means, 2 ... Keyword extraction means, 3 ... Text database part, 4 ... Index registration means, 5 ... Index file part, 6 ... Co-occurrence keyword extraction means, 7 ... Co-occurrence file registration means, 8 ... Co-occurrence file , 9 ... Related word editable dictionary file construction means, 10 ... Related word editable dictionary file section, 11 ... Related word candidate display means, 12 ... Display section.

Claims

A keyword extracting means for extracting a keyword of a document to be searched;
From the extracted keyword group, a keyword group having the same document unit as the origin of extraction is set as a keyword having a co-occurrence relationship, and all keyword pairs are obtained from the keyword group having the co-occurrence relationship. A co-occurrence file means for determining the appearance frequency of the pair and holding it in the co-occurrence file;
Referencing the co-occurrence file, and a related word dictionary registering means for registering a keyword pair whose appearance frequency exceeds a set threshold value in a related word dictionary, with one of them as a head word and the other as a related word. A related word dictionary creation device comprising:
The co-occurrence file means holds the appearance frequency of each keyword in addition to the appearance frequency of the keyword pair,
The related word dictionary registration means refers to the co-occurrence file means, and the appearance frequency of the keyword pair exceeds the set threshold, and the appearance frequency of one keyword to be used as a headword in the keyword pair is set to the other A related word dictionary creation device, wherein a related word dictionary is registered when a specified threshold value is exceeded.

The related word dictionary registration means has a function of editing a related word dictionary by a user, and a function of changing a threshold value of the appearance frequency of the keyword pair or a threshold value of the appearance frequency of the keyword. 2. The related word dictionary creating apparatus according to claim 1 , further comprising display means for displaying candidate related words related to an arbitrary keyword from the co-occurrence file means.

The keyword extraction means for extracting the keywords of the document to be searched, and the keyword group having the same document unit as the origin of extraction from the extracted keyword group are set as co-occurrence keywords, and the co-occurrence relationship is established. Obtain all keyword pairs from the keyword group, obtain the appearance frequency of each keyword pair, store the co-occurrence file means in the co-occurrence file, and refer to the co-occurrence file, and the threshold at which the appearance frequency is set A related word dictionary creating method by a related word dictionary creating device provided with a related word dictionary registering means for registering in a related word dictionary, with one of the keyword pairs exceeding as a head word and the other as a related word,
Extracting a keyword of a document to be searched by the keyword extracting means;
The co-occurrence file means obtains a keyword group having the same document unit as a source of extraction from a keyword group extracted from a document to be searched as a keyword having a co-occurrence relationship from the keyword group having the co-occurrence relationship. Creating and registering co-occurrence files that hold keyword pairs, the frequency of occurrence of each keyword pair, and the frequency of occurrence of each keyword;
The related word dictionary registration means refers to the co-occurrence file, the appearance frequency of the keyword pair exceeds a set threshold value, and the appearance frequency of one keyword to be used as a headword in the keyword pair is another setting. Registering a keyword pair that exceeds the threshold value in a related word dictionary, with one of the keyword pairs as a headword and the other as a related word;
A related word dictionary creation method comprising: