JP2000148770A

JP2000148770A - Device and method for classifying question documents and record medium where program wherein same method is described is recorded

Info

Publication number: JP2000148770A
Application number: JP10315625A
Authority: JP
Inventors: Daijiro Mori; 大二郎森; Masakatsu Okubo; 雅且大久保; Masayuki Sugizaki; 正之杉崎; Kazuo Tanaka; 一男田中
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-11-06
Filing date: 1998-11-06
Publication date: 2000-05-30

Abstract

PROBLEM TO BE SOLVED: To retrieve a category matching the question contents of a question document with high precision by using a feature quantity extracted from constituent elements of the question document corresponding to an answer document included in the category when the feature quantity of the extracted category is calculated. SOLUTION: Individual documents are extracted from an answer document set and documents are put together form a question document set (S1, S2). Those individual documents are classified (S3) after they are decomposed into words and morpheme analysis is carried out to calculate feature vectors. After the classified answer documents are held (S4), feature vectors are calculated as to the question documents and classification category feature quantities are calculated (S6) while made to correspond to the answer documents. Then a new question document is extracted (S7), a feature vector is calculated as to the new document to calculate the adaptivity to the held classification category (S8), and a specific number of matching categories from the top are gathered (S9).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、大量の問合せに対
して回答を行う業務において、過去の問合せ・回答履歴
の中から内容が類似する組合せを抽出し、回答作業の支
援ないしは、自動化を行う問合せ文書の分類装置および
方法ならびに当該方法を記述したプログラムを記録した
記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention provides a method of answering a large number of inquiries, extracting combinations of similar contents from past inquiries and answer histories, and supporting or automating the answering operation. The present invention relates to an apparatus and method for classifying an inquiry document and a recording medium on which a program describing the method is recorded.

【０００２】[0002]

【従来の技術】自然言語処理技術の発達と計算機の処理
能力の向上とに伴い、大量の蓄積文書集合の中から、内
容が類似する文書を抽出し、複数のカテゴリに分類する
ことが可能となっている。新規に与えられた文書につい
て、既存のカテゴリの中から最も適合度が高いカテゴリ
を選び出す技術も公知のものとなっている。2. Description of the Related Art With the development of natural language processing technology and the improvement of computer processing capability, it is possible to extract documents having similar contents from a large set of stored documents and classify them into a plurality of categories. Has become. A technique for selecting a category having the highest matching degree from existing categories for a newly given document is also known.

【０００３】文書集合の分類方法としては、以下の手法
が知られている。まず、分類対象となる文書を、文字列
や単語や文節を単位とする要素に分解し、該要素の組合
せに基づいて特徴量を計算する。次に、全ての文書の組
合せについて、特徴量の類似度を求め、類似度の高い組
合せから順番にクラスタを構成する。この過程を一定の
適合度あるいはクラスタのサイズに達するまで繰り返
し、該クラスタを以って分類結果とする。The following method is known as a method for classifying a document set. First, a document to be classified is decomposed into elements in units of character strings, words, and phrases, and a feature amount is calculated based on a combination of the elements. Next, the similarities of the feature amounts are obtained for all the combinations of documents, and clusters are formed in order from the combination having the highest similarity. This process is repeated until a certain fitness or a cluster size is reached, and the cluster is used as a classification result.

【０００４】特徴量の計算方法としては様々な方式が考
案されている。例えば、前述のように文書を、文字列や
単語や文節を単位とする要素に分解した後に、各要素の
文書集合における出現頻度と該文書における出現頻度と
に基づいて要素の重みを求めて、各要素とその重みによ
って構成されるベクトルによって特徴量を表現する方法
が知られている。Various methods have been devised as a method of calculating the feature amount. For example, as described above, after decomposing a document into elements in units of character strings, words, and phrases, the weights of the elements are obtained based on the frequency of occurrence of each element in a document set and the frequency of occurrence in the document. There is known a method of expressing a feature amount by a vector formed by each element and its weight.

【０００５】あるいは、文書に含まれる要素間の関連度
を所定の方法で算出し、関連度の高い要素が近傍となる
ように、ｎ次元のベクトル空間上に要素を適宜配置し、
文書に含まれる要素のなすベクトルの和によって文書の
特徴ベクトルを計算する方法も知られている。[0005] Alternatively, the degree of relevance between elements included in the document is calculated by a predetermined method, and the elements are appropriately arranged in an n-dimensional vector space so that the element having a high degree of relevance is located nearby.
There is also known a method of calculating a feature vector of a document based on a sum of vectors formed by elements included in the document.

【０００６】特徴量の間の類似度を計算する方法として
は、ベクトルとして特徴量が表現される場合において
は、２つのベクトルの成す内積あるいは余弦によって適
合度を算出する方法が広く用いられている。[0006] As a method of calculating the similarity between feature quantities, when a feature quantity is expressed as a vector, a method of calculating a fitness based on an inner product or cosine of two vectors is widely used. .

【０００７】[0007]

【発明が解決しようとする課題】従来の技術では、いず
れも、分類対象である文書そのものを情報源とし、該文
書を構成する要素に基づいて分類を行っている。しか
し、不特定多数の人から受ける問合せ文書においては、
使用される語彙や表現が人によって異なるため、問合せ
の内容が同一であっても、文書の構成要素が異なる場合
がある。In each of the conventional techniques, a document to be classified is used as an information source, and classification is performed based on elements constituting the document. However, in an inquiry document received from an unspecified number of people,
Since the vocabulary and expressions used vary from person to person, the components of the document may differ even if the contents of the inquiry are the same.

【０００８】このため、問合せ文書集合に対して、問合
せの内容に即した分類を高精度に行うことが困難となっ
ている。For this reason, it is difficult to classify a set of inquiry documents with high accuracy according to the contents of the inquiry.

【０００９】本発明は、上述したような従来の技術に見
られる課題に鑑みてなされたもので、問合せの内容によ
り即した分類を行うことを目的とする。SUMMARY OF THE INVENTION The present invention has been made in view of the above-described problems in the prior art, and has as its object to perform classification according to the contents of an inquiry.

【００１０】[0010]

【課題を解決するための手段】従来の技術では、いずれ
も、分類対象である文書そのものを情報源とし、該文書
を構成する要素に基づいて分類を行っていたのに対し
て、本発明では、該問合せ文書に対応する回答文書を構
成する要素に基づいて分類を行い、分類結果として得ら
れた各カテゴリの特徴量を計算するにあたっては、問合
せ文書を構成する要素を用いている。Means for Solving the Problems In the prior art, in each case, a document itself to be classified is used as an information source, and classification is performed based on elements constituting the document. The classification is performed based on the elements constituting the answer document corresponding to the inquiry document, and the elements constituting the inquiry document are used in calculating the feature amount of each category obtained as a result of the classification.

【００１１】前述のように、問合せ文書においては、使
用される語彙や表現が多様であるため、そこから抽出さ
れる特徴量が、問合せ内容と合致しない場合がある。As described above, since the vocabulary and expressions used in the inquiry document are various, the feature amounts extracted therefrom may not match the contents of the inquiry.

【００１２】一方、不特定多数の人から受ける問合せに
回答する業務においては、問合せ文書を作成する人の数
に比べて、回答文書を作成する人の数の方が小さく、ま
た用語の使用について回答者の間で統一されている場合
が多いため、回答文書に現れる語彙や表現は、問合せ文
書に現れる語彙や表現と比べてより一様であり、同一の
問合せ内容に対しては、同一の語彙や表現を用いた回答
文書が作成される傾向が強い。従って、問合せ文書に含
まれる語彙がまちまちであったとしても、その問合せ内
容が同一であれば、それに対応する回答文書から抽出さ
れる特徴量は高い類似度を示すことが期待できる。On the other hand, in the task of answering an inquiry received from an unspecified number of people, the number of people who create answer documents is smaller than the number of people who create inquiry documents, and the use of terms The vocabulary and expressions that appear in the answer document are more uniform than the vocabulary and expressions that appear in the query document because they are often unified among the respondents. Answer documents using vocabulary and expressions tend to be created. Therefore, even if the vocabulary included in the inquiry document is different, if the contents of the inquiry are the same, it can be expected that the feature amount extracted from the corresponding answer document shows a high similarity.

【００１３】また、回答文書と問合せ文書とでは、使用
される語彙や表現の傾向が異なるため、回答文書から抽
出された特徴量を用いて、新規の問合せ文書に適合する
カテゴリを検索することはできないが、本発明では、抽
出されたカテゴリの特徴量を計算する際には、該カテゴ
リに含まれる回答文書に対応する問合せ文書の構成要素
から抽出された特徴量を用いるため、新規の問合せ文書
を入力とし、該問合せ文書の問合せ内容に適合するカコ
デリを高精度に検索することができる。Further, since the vocabulary and the expression tend to be different between the answer document and the query document, it is not possible to search for a category that matches the new query document by using the features extracted from the response document. However, in the present invention, when calculating the feature amount of the extracted category, since the feature amount extracted from the component of the query document corresponding to the answer document included in the category is used, a new query document is used. With the input as an input, it is possible to search for a cacodeli conforming to the contents of the inquiry document with high accuracy.

【００１４】[0014]

【発明の実施の形態】本発明の問合せ文書の分類装置の
実現例について説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A description will now be given of an implementation example of a query document classification apparatus according to the present invention.

【００１５】本実現例のシステム構成を図１に示す。FIG. 1 shows a system configuration of this embodiment.

【００１６】図中の符号１は回答文書集合であって過去
の回答文書の集まり、２は問合せ文書集合であって過去
の問合せ文書の集まり、３は回答文書分類手段であって
回答文書集合１の内容について形態素解析を行って特徴
ベクトルを計算した上で回答文書集合１の内容を分類す
るもの、４は回答文書分類カテゴリ、５は分類カテゴリ
特徴量、６は分類カテゴリ特徴量計算手段、７は新規問
合せ文書、８は適合カテゴリ選択手段、９は適合カテゴ
リ選択結果を表わしている。以下、図１に示される各構
成要素について順に説明してゆく。In the figure, reference numeral 1 denotes a set of answer documents, a set of past answer documents, 2 denotes a set of inquiry documents, a set of past query documents, and 3 denotes a set of answer document classifying means and 1 , A feature vector is calculated by performing a morphological analysis on the contents of the answer document, and the contents of the answer document set 1 are classified, 4 is an answer document classification category, 5 is a category feature amount, 6 is a category feature amount calculating means, 7 Indicates a new inquiry document, 8 indicates a matching category selection unit, and 9 indicates a matching category selection result. Hereinafter, each component shown in FIG. 1 will be described in order.

【００１７】回答文書分類手段３は、図２に示すフロー
によって実現する。The answer document classifying means 3 is realized by the flow shown in FIG.

【００１８】図２において、符号１，３，４は図１に対
応し、３１は形態素解析部、３２は回答文書特徴ベクト
ル計算部、３３は分類対象リスト、３４は類似文書抽出
部を表わしている。In FIG. 2, reference numerals 1, 3, and 4 correspond to those in FIG. 1, 31 is a morphological analysis unit, 32 is a response document feature vector calculation unit, 33 is a classification target list, and 34 is a similar document extraction unit. I have.

【００１９】図２に示す如く、回答文書集合１の個々の
内容について、形態素解析部３１にて形態素解析を行っ
て回答文書の各文を単語に分解し、回答文書特徴ベクト
ル計算部３２にて各単語に基づいて特徴ベクトルFVを計
算する。その結果が、分類対象リスト３３として保持さ
れる。次いで、類似文書抽出部３４にて、分類対象リス
ト３３中の２つの要素を取り出して類似度を計算してゆ
き、類似度の大きい組合せを抽出して、クラスタを得る
ようにして、回答文書分類カテゴリ４を得る。更に具体
的に述べる。As shown in FIG. 2, the individual contents of the answer document set 1 are subjected to morphological analysis in the morphological analysis unit 31 to decompose each sentence of the answer document into words. A feature vector FV is calculated based on each word. The result is stored as a classification target list 33. Next, the similar document extracting unit 34 extracts two elements from the classification target list 33, calculates the similarity, extracts a combination having a large similarity, and obtains a cluster. Get category 4. This will be described more specifically.

【００２０】形態素解析部３１は、回答文書中の各文
を、周知の技術を用いて単語に分解する。回答文書特徴
ベクトル計算部３２は、回答文書に含まれる各単語に基
づいて、特徴ベクトルFVを計算する。The morphological analysis unit 31 decomposes each sentence in the answer document into words using a known technique. The answer document feature vector calculation unit 32 calculates a feature vector FV based on each word included in the answer document.

【００２１】 FV(i) ＝(w(i,1),...,w(i,j),...,w(i,N)) w(i,j)＝tf(i,j) ^*log ( M / df(j)) ここで、i は文書番号、N は回答文書集合における全単
語数、tf(i,j) は文書i における単語j の出現回数、M
は回答文書の総数、df(j) は文書集合における単語j の
出現回数である。FV (i) = (w (i, 1), ..., w (i, j), ..., w (i, N)) w (i, j) = tf (i, j ) ^* log (M / df (j)) where i is the document number, N is the total number of words in the answer document set, tf (i, j) is the number of occurrences of word j in document i, M
Is the total number of answer documents, and df (j) is the number of occurrences of word j in the document set.

【００２２】文書i の特徴ベクトルFV(i) は、文書集合
を構成する各単語の数だけの次元を持つ空間上のベクト
ルとして表現される。各次元の要素は、その文書がその
次元の要素とどれだけ強い関係を持っているのかを示
す。tf(i,j) の項は、文書i における単語j の出現回数
であるが、これは、その単語が文書中に何度も繰り返し
出て来れば、その文書の特徴を示す重要な単語であると
位置付けることを意味する。The feature vector FV (i) of the document i is expressed as a vector on a space having dimensions equal to the number of words constituting the document set. The elements of each dimension indicate how strong the document is with the elements of that dimension. The term tf (i, j) is the number of occurrences of word j in document i, which is an important word that characterizes a document if the word appears repeatedly in the document. It means to be positioned.

【００２３】log ( M / df(j))の項は、単語j の出現頻
度の逆数の対数値を取った数であるが、これは、その単
語が文書全体を通して出現する頻度が少ない程重要な単
語だと解釈することを意味する。ただし、出現頻度の逆
数をそのまま重要度とみなすと、出現頻度が１００分の
１である単語は１００倍重要であるということになり不
自然であるため、対数を取っている。なお、この計算方
法は、Ｇ．Saltonの開発したtf^*idf 法によるものであ
り、テキスト検索においては広く一般的に使用されてい
る。The term log (M / df (j)) is a number obtained by taking the logarithm of the reciprocal of the frequency of occurrence of word j. This term becomes more important as the frequency of occurrence of the word throughout the entire document decreases. It means to interpret it as a word. However, if the reciprocal of the appearance frequency is regarded as the importance as it is, a word whose appearance frequency is 1/100 is 100 times more important, which is unnatural, and is therefore logarithmic. This calculation method is described in G. It is based on the tf ^* idf method developed by Salton, and is widely used in text search.

【００２４】なお、本実施例では、単語全般を要素とす
るベクトルによって特徴量を表現しているが、ベクトル
の構成要素の選び方としては、自立語のみを要素とする
方法や、接辞や複合語を含めて要素とする方法、名詞句
に含まれる単語列を要素とする方法も考えられる。その
他にも様々な特徴量の計算方法が考えるが、文書の構成
要素に基づいて算出される数量であり、また、該数量の
間の類似度を計算することが可能であれば、いずれの方
法においても本発明を適用することが可能である。In this embodiment, the feature quantity is expressed by a vector having elements as words in general. However, the components of the vector can be selected by a method using only independent words as an element, an affix or a compound word. And a method of using a word string included in a noun phrase as an element. Various other calculation methods of the feature amount are considered. Any method can be used as long as it is a quantity calculated based on the components of the document and the similarity between the quantities can be calculated. The present invention can also be applied to

【００２５】類似文書抽出部３４は、分類対象リスト３
３から２つの要素を取り出して全ての組合せについて類
似度を計算する。The similar document extracting unit 34 sorts the classification target list 3
Two elements are extracted from 3 and the similarity is calculated for all combinations.

【００２６】要素i と要素k との類似度 R(i,k) は、以
下の式により求める。The similarity R (i, k) between the element i and the element k is obtained by the following equation.

【００２７】 R(i,k)＝ FV(i)・ FV(k)／ (｜ FV(i)｜・｜ FV(k)｜) FV(i)と FV(k)とはそれぞれ文書の特徴を示すベクトル
であり、 FV(i)・ FV(k)はベクトルの内積を、｜ FV(i)
｜・｜ FV(k)｜はそれぞれのベクトルの大きさの積を意
味している。すなわち、R(i,k)は二つのベクトルの余弦
を表している。二つのベクトルの成す角が小さいほど類
似度が高い、という考えに依っている。R (i, k) = FV (i) · FV (k) / (| FV (i) | · | FV (k) |) FV (i) and FV (k) are the features of the document, respectively. FV (i) · FV (k) is the vector inner product, | FV (i)
| · | FV (k) | means the product of the magnitudes of the respective vectors. That is, R (i, k) represents the cosine of the two vectors. This is based on the idea that the smaller the angle between two vectors, the higher the similarity.

【００２８】初期状態では、全ての回答文書を分類対象
リストとする。分類対象リストの中で、最も類似度の大
きい組合せを抽出し、該組合せの要素から構成するクラ
スタC を生成する。クラスタC の要素がi,k であった
時、１クラスタC の特徴ベクトルを以下の式により計算
する。In the initial state, all answer documents are set as a classification target list. A combination having the highest similarity is extracted from the classification target list, and a cluster C composed of elements of the combination is generated. When the element of the cluster C is i, k, the feature vector of one cluster C is calculated by the following equation.

【００２９】FV(C) ＝(u(i) ^*FV(i) ＋u(k)^*FV(k))／
(u(i) ＋u(k)) ここで、u(i)は、i が文書であれば１、i がクラスタで
あればi に含まれる文書の数とする。FV (C) = (u (i) ^* FV (i) + u (k) ^* FV (k)) /
(u (i) + u (k)) Here, u (i) is 1 when i is a document, and is the number of documents included in i when i is a cluster.

【００３０】これは、特徴ベクトルの重心を求めてい
る。クラスタは複数の文書を含むまとまりのことである
から、そこに含まれる文書数分だけ重みを付けてベクト
ルの重心を求めている。This finds the center of gravity of the feature vector. Since a cluster is a group including a plurality of documents, the weight of the cluster is weighted by the number of documents included in the cluster to determine the center of gravity of the vector.

【００３１】次に、分類対象リスト３３からi とk とを
取り除き、クラスタC を追加して、再度、分類対象リス
トの中で最も類似度の大きい組合せを抽出し、クラスタ
を生成する。Next, i and k are removed from the classification target list 33, a cluster C is added, and a combination having the highest similarity is again extracted from the classification target list to generate a cluster.

【００３２】こうして、類似度の最大値が所定の閾値を
下回るまで、上記の過程を繰り返す。繰り返しが完了し
た時点で、他のクラスタの要素となっていないクラスタ
を列挙し、それぞれのクラスタに含まれる文書の集合
を、類似文書集合とする。The above process is repeated until the maximum value of the similarity falls below a predetermined threshold. When the repetition is completed, clusters that are not elements of other clusters are enumerated, and a set of documents included in each cluster is set as a similar document set.

【００３３】分類カテゴリ特徴量計算手段６は、図３に
示すフローによって実現する。The classification category feature quantity calculating means 6 is realized by the flow shown in FIG.

【００３４】図３において、符号２，４，６は図１に対
応し、６１は問合せ文書・回答文書対応管理部、６２は
分類カテゴリ特徴ベクトル計算部、６３は分類カテゴリ
特徴ベクトル、６４は形態素解析部を表わしている。In FIG. 3, reference numerals 2, 4, and 6 correspond to FIG. 1, 61 is an inquiry document / answer document correspondence management unit, 62 is a classification category feature vector calculation unit, 63 is a classification category feature vector, and 64 is a morpheme. It represents an analysis unit.

【００３５】問合せ文書・回答文書対応管理部６１は、
各々の分類カテゴリに含まれる全ての回答文書につい
て、該回答文書に対応する問合せ文書を回答文書分類カ
テゴリ４から取得する。該問合せ文書の各文を問合せ文
書集合２および形態素解析部６４をへて単語に分解した
状態で管理する。分類カテゴリ特徴ベクトル計算部６２
は、以下の特徴ベクトルFVを求めて、これを以って分類
カテゴリ特徴ベクトル６３とする。The inquiry document / answer document correspondence management unit 61
For all answer documents included in each classification category, an inquiry document corresponding to the answer document is acquired from answer document classification category 4. Each sentence of the inquiry document is managed in a state where it is decomposed into words by the inquiry document set 2 and the morphological analysis unit 64. Classification category feature vector calculation unit 62
Finds the following feature vector FV and uses it as the classification category feature vector 63.

【００３６】[0036]

【数１】 (Equation 1)

【００３７】ここで、i は分類カテゴリの番号、N は問
合せ文書集合における全単語数、C(i)は、分類カテゴリ
i に含まれる回答文書数、tf(c,j) は、分類カテゴリに
含まれる回答文書c に対応する問合せ文書における単語
j の出現回数、M は問合せ文書の総数、df(j) は、問合
せ文書集合における単語j の出現回数である。Where i is the number of the category, N is the total number of words in the query document set, and C (i) is the category
The number of answer documents contained in i, tf (c, j) is the word in the query document corresponding to answer document c contained in the classification category.
j is the number of appearances, M is the total number of query documents, and df (j) is the number of occurrences of word j in the query document set.

【００３８】分類カテゴリは、複数の類似する文書の集
まりとして求められている。この分類カテゴリの特徴ベ
クトルを表すために、その中に含まれる文書の含む単語
の情報を用いる。基本的には個別の文書の特徴ベクトル
を求めるのと同じ考え方であるが、複数の文書を含んで
いるから、各文書に含まれる単語の出現回数を全て足し
合わせることによってカテゴリ全体の特徴ベクトルを求
めている。The classification category is obtained as a group of a plurality of similar documents. In order to represent a feature vector of this classification category, information on a word included in a document included therein is used. Basically, the concept is the same as finding the feature vector of an individual document, but since it includes multiple documents, the feature vector of the entire category is calculated by adding up the number of occurrences of the words contained in each document. I'm asking.

【００３９】適合カテゴリ選択手段８は、図４に示すフ
ローによって実現する。The matching category selecting means 8 is realized by the flow shown in FIG.

【００４０】図４において、符号４，７，８は図１に対
応し、８１は形態素解析部、８２は新規問合せ文書特徴
量計算部、８３は新規問合せ文書特徴ベクトル、８４は
適合度計算部、８５は適合カテゴリリスト、８６はソー
ト処理部を表わしている。In FIG. 4, reference numerals 4, 7, and 8 correspond to FIG. 1, 81 is a morphological analysis unit, 82 is a new query document feature amount calculation unit, 83 is a new query document feature vector, and 84 is a fitness degree calculation unit. , 85 indicates a matching category list, and 86 indicates a sort processing unit.

【００４１】形態素解析部８１は、新規に与えられた問
合せ文書の各文を新規問合せ文書７から取り出して単語
に分解する。The morphological analysis unit 81 extracts each sentence of the newly given inquiry document from the new inquiry document 7 and breaks it down into words.

【００４２】新規問合せ文書特徴量計算部８２では、前
記形態素解析部８１により抽出された単語情報に基づい
て新規問合せ文書特徴ベクトル８３として以下の問合せ
文書特徴ベクトルQVを算出する。The new query document feature vector calculation unit 82 calculates the following query document feature vector QV as a new query document feature vector 83 based on the word information extracted by the morphological analysis unit 81.

【００４３】QV＝(tf(1),...,tf(j),...,tf(N)) ここで、tf(j) は、新規問合せ文書における単語j の出
現回数、N は問合せ文書集合における全単語数である。QV = (tf (1),..., Tf (j),..., Tf (N)) where tf (j) is the number of appearances of word j in the new query document, and N is This is the total number of words in the query document set.

【００４４】適合度計算部８４は、前記新規問合せ文書
特徴量計算部８２により得られた特徴ベクトルと、前記
分類カテゴリ特徴量計算手段６より得られた全ての分類
カテゴリを回答文書分類カテゴリ４から抽出して当該分
類カテゴリの特徴ベクトルとの適合度score を以下の計
算式により計算して、適合カテゴリリスト８５を得る。The relevance calculator 84 converts the feature vector obtained by the new query document feature calculator 82 and all the classification categories obtained by the classification category feature calculator 6 from the answer document classification category 4. The extracted category is calculated by the following formula to calculate the degree of conformity with the feature vector of the classification category, and a conforming category list 85 is obtained.

【００４５】 score(i)＝QV・FV(i) ／ (｜QV｜・｜FV(i) ｜) ここで、QVは、新規問合せ文書の特徴ベクトル、FV(i)
は分類カテゴリi の特徴ベクトルである。Score (i) = QV · FV (i) / (| QV | · | FV (i) |) where QV is the feature vector of the new query document, FV (i)
Is a feature vector of the classification category i.

【００４６】新規問合せ文書と多くの単語が重なってい
る分類カテゴリを求めようとしている。そのため、QV
を、一旦、単語からなるベクトルに分解した後にscore
を計算している。score は、各分類カテゴリの特徴ベク
トルと新規問合せ文書の特徴ベクトルとの成す余弦とし
て計算している。It is intended to find a classification category in which many words overlap the new inquiry document. Therefore, QV
Is once decomposed into a vector of words and then score
Is calculated. The score is calculated as the cosine between the feature vector of each category and the feature vector of the new query document.

【００４７】ソート処理部８６は、前記適合度計算部８
４によって計算された適合度が高いものから順番に並べ
替え、上位から所定の数の分類カテゴリを抽出し、検索
結果として出力する。The sort processing unit 86 includes
4. Sorting is performed in ascending order of the degree of fitness calculated in step 4, and a predetermined number of classification categories are extracted from the top and output as search results.

【００４８】上記において、問合せ文書の分類装置につ
いて説明したが、当該問合せ文書の分類装置に対応する
問合せ文書の分類方法を考慮することができ、かつまた
当該問合せ文書の分類方法を記述したプログラムを用意
しておいて当該プログラムを記録媒体に記録することが
できる。In the above description, the inquiry document classifying apparatus has been described. However, it is possible to consider a query document classifying method corresponding to the inquiry document classifying apparatus, and to execute a program describing the query document classifying method. The program can be recorded on a recording medium in advance.

【００４９】図５は本発明による問合せ文書の分類方法
を表わす処理フローを示している。図５におけるステッ
プ（Ｓ１）ないしステップ（Ｓ４）およびステップ（Ｓ
６）ないしステップ（Ｓ９）は、上述の図１に示す符号
１ないし符号４および符号６ないし符号９に示す各手段
などの処理に対応している。即ち、ステップ（Ｓ１）：回答文書集合１から個々の文書を抽
出する。FIG. 5 shows a processing flow representing a method of classifying an inquiry document according to the present invention. Steps (S1) to (S4) and step (S
Steps 6) to (S9) correspond to the processes of the units indicated by reference numerals 1 to 4 and 6 to 9 shown in FIG. Step (S1): Individual documents are extracted from answer document set 1.

【００５０】ステップ（Ｓ２）：問合せ文書集合２から
文書を抽出する。Step (S2): A document is extracted from the inquiry document set 2.

【００５１】ステップ（Ｓ３）：個々の文書について単
語に分解し形態素解析を行って特徴ベクトルを計算した
上で分類する。Step (S3): Each document is decomposed into words, morphological analysis is performed, feature vectors are calculated, and classification is performed.

【００５２】ステップ（Ｓ４）：分類した回答文書を保
持する。Step (S4): The classified answer document is held.

【００５３】ステップ（Ｓ６）：問合せ文書について特
徴ベクトルを計算し、回答文書と対応づけて分類カテゴ
リ特徴量を計算する。そしてステップ（Ｓ４）に戻り保
持する。Step (S6): The feature vector is calculated for the inquiry document, and the classification category feature amount is calculated in association with the answer document. Then, the process returns to step (S4) and holds.

【００５４】ステップ（Ｓ７）：新規の問合せ文書を抽
出する。Step (S7): A new inquiry document is extracted.

【００５５】ステップ（Ｓ８）：新規の問合せ文書につ
いて特徴ベクトルを計算し、ステップ（Ｓ６）において
保持している分類カテゴリとの適合度を計算する。Step (S8): A feature vector is calculated for the new inquiry document, and the degree of matching with the classification category held in step (S6) is calculated.

【００５６】ステップ（Ｓ９）：適合したカテゴリを上
位から所定の数だけ収集する。Step (S9): A predetermined number of matching categories are collected from the top.

【００５７】上記図５に示した問合せ分類方法はそれを
プログラムの形で記述することができ、当該プログラム
を記録媒体に記録することができる。したがって、本発
明は、当該記録した記録媒体をも技術的範囲に含めるこ
とにする。The inquiry classification method shown in FIG. 5 can be described in the form of a program, and the program can be recorded on a recording medium. Therefore, the present invention includes the recorded recording medium in the technical scope.

【００５８】[0058]

【発明の効果】以上説明した如く、本発明によれば、蓄
積された問合せ・回答文書情報を、その内容に基づいて
複数のカテゴリに分類し、新たに寄せられた問合せに対
して、その内容が最も近いカテゴリを高い精度で選び出
すことが可能となる。As described above, according to the present invention, the accumulated inquiry / answer document information is classified into a plurality of categories based on the contents thereof, and the contents of the inquiry / response document information are added to the newly received inquiry. Can be selected with high accuracy.

【００５９】例えば、企業の問合せ応対窓口において、
以前回答した事例と類似した問合せが寄せられた時に、
前回に回答した時の事例を高精度に抽出できる。ただ
し、企業の問合せ対応窓口ではしばしば非常に大量の問
合せに回答するし、類似する事例もごく大量存在する場
合が有り得る。その時、類似する事例を片端から提示し
ていったのでは作業者の負担になってしまうので、一旦
類似する事例をカテゴリとしてまとめ、このカテゴリを
検索結果として作業者に提示するようにしている。この
ことにより、作業者はより少ない数の候補から正しい情
報を選び出すことができるので作業負担が軽減する効果
が得られる。For example, at the inquiry window of a company,
When we receive a query similar to the one we answered earlier,
It is possible to extract the case that was answered last time with high accuracy. However, a company's inquiry counter often answers a very large number of inquiries, and there may be a very large number of similar cases. At that time, if similar cases are presented from one end, it would be a burden on the worker. Therefore, similar cases are once grouped into categories, and this category is presented to the worker as a search result. As a result, the operator can select correct information from a smaller number of candidates, thereby obtaining an effect of reducing the work load.

[Brief description of the drawings]

【図１】本発明の問合せ文書分類装置のシステムブロッ
ク図である。FIG. 1 is a system block diagram of an inquiry document classification device of the present invention.

【図２】本発明の問合せ文書分類装置の回答文書分類手
段の処理フローである。FIG. 2 is a processing flow of an answer document classifying means of the inquiry document classifying apparatus of the present invention.

【図３】本発明の問合せ文書分類装置の分類カテゴリ特
徴量計算手段の処理フローである。FIG. 3 is a processing flow of a classification category feature amount calculation unit of the inquiry document classification device of the present invention.

【図４】本発明の問合せ文書分類装置の適合カテゴリ選
択手段の処理フローである。FIG. 4 is a processing flow of a matching category selection unit of the inquiry document classification device of the present invention.

【図５】本発明の問合せ文書分類方法の処理フローであ
る。FIG. 5 is a processing flow of an inquiry document classification method according to the present invention.

[Explanation of symbols]

１回答文書集合２問合せ文書集合３回答文書分類手段４回答文書分類カテゴリ５分類カテゴリ特徴量６分類カテゴリ特徴量計算手段７新規問合せ文書８適合カテゴリ選択手段９適合カテゴリ選択結果 1 Answer Document Set 2 Inquiry Document Set 3 Answer Document Classification Means 4 Answer Document Classification Category 5 Classification Category Feature 6 Classification Category Feature Calculation Means 7 New Inquiry Document 8 Matching Category Selection Means 9 Matching Category Selection Result

───────────────────────────────────────────────────── フロントページの続き (72)発明者杉崎正之東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 (72)発明者田中一男東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内Ｆターム(参考） 5B075 ND03 ND36 NK06 NK32 UU06 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Masayuki Sugizaki 3-19-2 Nishi-Shinjuku, Shinjuku-ku, Tokyo Within Japan Telegraph and Telephone Corporation (72) Inventor Kazuo Tanaka 3-192-1, Nishi-Shinjuku, Shinjuku-ku, Tokyo No. Nippon Telegraph and Telephone Corporation F-term (reference) 5B075 ND03 ND36 NK06 NK32 UU06

Claims

[Claims]

An apparatus for managing a plurality of inquiry documents and answer documents for each inquiry in association with each other, wherein a set of answer documents is classified into a plurality of categories according to a feature amount extracted from each answer document. Answer document classifying means; and classification category feature quantity calculating means for calculating the feature quantity of the category using the feature quantity of the query document corresponding to each answer document included in each category classified by the answer document classifying means. And for a newly given query document, a feature amount obtained by the classification category feature amount calculation unit from the categories classified by the answer document classification unit matches the feature amount of the new inquiry document. And a matching category selecting means for extracting an object to be searched.

2. A method for managing a plurality of inquiry documents and answer documents corresponding to each inquiry in association with each other, wherein a set of answer documents is classified into a plurality of categories according to a feature amount extracted from each answer document. An answer document classifying step; and using a feature amount of an inquiry document corresponding to each answer document included in each category classified by the answer document classifying step, calculating a feature amount of the category, a classification category feature amount calculating step. And for a newly given query document, a feature obtained in the classification category feature calculation step from among the categories classified in the answer document classification step matches the feature quantity of the new query document. A matching category selecting step of extracting a matching document.

3. A recording medium in which a method of managing a plurality of inquiry documents and response documents to each inquiry in association with each other is described in the form of a program, and the management method includes the step of managing a set of response documents. An answer document classifying step of classifying the answer document into a plurality of categories according to the feature amount extracted from each answer document; and characteristics of the query document corresponding to each answer document included in each category classified by the answer document classifying step. Calculating a feature quantity of the category using the quantity, a category category feature quantity calculating step, and for a newly given query document, the classification category is selected from the categories classified by the answer document classifying step. A matching category selecting step of extracting a matching feature amount obtained in the feature amount calculating step with the feature amount of the new query document. Law and described in the form of a program, a recording medium characterized by recording the program.