JPH1185797A

JPH1185797A - Automatic document classification device, learning device, classification device, automatic document classification method, learning method, classification method and storage medium

Info

Publication number: JPH1185797A
Application number: JP9250126A
Authority: JP
Inventors: Noriko Otani; 紀子大谷; Shiro Ito; 史朗伊藤; Shogo Shibata; 昇吾柴田; Takanari Ueda; 隆也上田; Yuji Ikeda; 裕治池田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1997-09-01
Filing date: 1997-09-01
Publication date: 1999-03-30

Abstract

PROBLEM TO BE SOLVED: To provide an automatic document classification device which can form a vector space where topics are precisely reflected and which can appropriately execute classification. SOLUTION: The automatic document classification device selects a valid word from a learning document (valid word selection part 103). The number of the valid words contained in respective paragraphs is obtained by referring to the learning document and the valid word (intra-paragraph valid word number calculation part 105). The intra-paragraph cooccurrence frequency of the group of the respective valid words is obtained by using the number of intra-paragraph valid words (intra-paragraph cooccurrence calculation part 107). The valid word vectors of the respective valid words are obtained from obtained intra- paragraph cooccurrence frequency, and the document vectors are obtained on the learning document and the document being a classification object by referring to the valid word vectors. The folder vectors of the respective categories, which are obtained from the document vector of the learning document, are compared with the document vector of the document being the classification object. The category to which the document being the classification object belongs is decided in accordance with the compared result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、分類対象文書をユ
ーザの意図に沿って分類する文書自動分類装置、それに
用いられる学習装置および分類装置と、文書自動分類方
法、それに用いられる学習方法および分類方法と、文書
自動分類装置を構築するための記憶媒体とに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an automatic document classification device for classifying documents to be classified according to the user's intention, a learning device and a classification device used therefor, an automatic document classification method, a learning method and a classification used therefor. The present invention relates to a method and a storage medium for constructing an automatic document classification device.

【０００２】[0002]

【従来の技術】分類対象文書をユーザの意図に沿って分
類する方法の一つとして、ベクトル空間モデルを利用し
た方法がある。このベクトル空間モデルでは、分類に有
用な語や文書、カテゴリをベクトルで表現し、ベクトル
の方向から文書が属するカテゴリを決定する。このベク
トル空間モデルを利用した文書自動分類処理は、学習フ
ェーズと分類フェーズとに分けられる。学習フェーズで
は、予め正しく分類された学習用文書から分類に有用な
語（以下、有効語という）を選出し、各有効語をベクト
ル表現する。このベクトルは有効語ベクトルと呼ばれ、
この有効語ベクトルの成分は、出現頻度や単語共起確率
などにより求められる。また、学習用文書をベクトル表
現して、各カテゴリの特徴を表すフォルダベクトルの算
出が行われる。分類フェーズでは、学習フェーズで得ら
れた有効語辞書を用いて分類対象文書をベクトルで表現
し（以下、文書ベクトルという）、この文書ベクトルと
フォルダベクトルとを比較し、該比較結果に応じて分類
対象文書が属するカテゴリを決定する。2. Description of the Related Art As one of methods for classifying documents to be classified according to a user's intention, there is a method using a vector space model. In this vector space model, words, documents, and categories useful for classification are represented by vectors, and the category to which the document belongs is determined from the direction of the vectors. The automatic document classification processing using this vector space model is divided into a learning phase and a classification phase. In the learning phase, words useful for classification (hereinafter, referred to as effective words) are selected from learning documents that have been correctly classified in advance, and each effective word is expressed as a vector. This vector is called the effective word vector,
The components of the effective word vector are obtained from the appearance frequency, the word co-occurrence probability, and the like. In addition, a folder vector representing the characteristics of each category is calculated by expressing the learning document in a vector. In the classification phase, a document to be classified is represented as a vector using the effective word dictionary obtained in the learning phase (hereinafter, referred to as a document vector), the document vector is compared with a folder vector, and classification is performed according to the comparison result. Determine the category to which the target document belongs.

【０００３】この方法を採用した文書自動分類装置の構
成について図７ないし図９を参照しながら説明する。図
７は従来の文書自動分類装置の構成を示すブロック図、
図８は図７の文書自動分類装置における学習フェーズの
処理手順を示すフローチャート、図９は図７の文書自動
分類装置における分類フェーズの処理手順を示すフロー
チャートである。[0003] The configuration of an automatic document classification apparatus employing this method will be described with reference to FIGS. 7 to 9. FIG. 7 is a block diagram showing the configuration of a conventional automatic document classification device.
FIG. 8 is a flowchart showing the processing procedure of the learning phase in the automatic document classification apparatus of FIG. 7, and FIG. 9 is a flowchart showing the processing procedure of the classification phase in the automatic document classification apparatus of FIG.

【０００４】文書自動分類装置は、図７に示すように、
学習用文書を保持する学習用文書保持部５０１と、分類
対象文書を保持する分類対象文書保持部５０２と、学習
用文書から有効語を選定する有効語選定部５０３と、選
定された有効語を保持する有効語保持部５０４と、学習
用文書と有効語とを参照して各文書に含まれている有効
語の数を求める文書内有効語数計算部５０５と、求めら
れた各文書内の有効語数を保持する文書内有効語数保持
部５０６とを備える。[0004] As shown in FIG.
A learning document holding unit 501 that holds a learning document, a classification target document holding unit 502 that holds a classification target document, an effective word selection unit 503 that selects an effective word from the learning document, and a selected valid word. An effective word holding unit 504 to hold, an effective word number calculation unit 505 for calculating the number of effective words included in each document by referring to the learning document and the effective word, and an effective word in each obtained document. A document effective word count holding unit 506 for holding the word count.

【０００５】文書内有効語数保持部５０６に保持された
文書内の有効語数は文書内共起頻度計算部５０７に与え
られ、文書内共起頻度計算部５０７は文書内有効語数を
用いて各有効語の組の文書内共起頻度を求める。この求
められた文書内共起頻度は、文書内共起頻度保持部５０
８に保持された後に、有効語ベクトル計算部５０９に与
えられる。有効語ベクトル計算部５０９は、文書内共起
頻度を用いて各有効語の有効語ベクトルを求める。ここ
で、有効語Ｔi と有効語Ｔj の共起確率をｃi,j 、有効
語数をＮとすると、有効語Ｔi の有効語ベクトルＴi
は、次の（１）式により、Ｔi ＝（ｃi,1 ，ｃi,2 ，…，ｃi,N ） …（１）となる。また、共起確率ｃi,j は次の（２）式により定
義される。The number of effective words in the document held in the effective word number storage unit 506 is given to a co-occurrence frequency calculation unit 507 in the document. Find the co-occurrence frequency of a word set in a document. The obtained co-occurrence frequency in the document is stored in the co-occurrence frequency in document storage unit 50.
8 is provided to the effective word vector calculation unit 509. The effective word vector calculation unit 509 obtains an effective word vector of each effective word using the co-occurrence frequency in the document. Here, assuming that the co-occurrence probability of the valid word Ti and the valid word Tj is ci, j and the number of valid words is N, the valid word vector Ti of the valid word Ti
Is given by the following equation (1): Ti = (ci, 1, ci, 2, ..., ci, N) (1) The co-occurrence probability ci, j is defined by the following equation (2).

【０００６】ｃi,j ＝（Ｔi とＴj の両方を含む文書数）／（Ｔi を含む文書数） …（２）有効語ベクトル計算部５０９により求められた有効語ベ
クトルは、有効語ベクトル保持部５１０に保持された後
に文書ベクトル計算部５１１に与えられる。文書ベクト
ル計算部５１１は、学習用文書と分類対象文書のそれぞ
れについて、有効語ベクトルを参照して文書ベクトルを
求め、学習用文書と分類対象文書のそれぞれについて求
められた文書ベクトルは文書ベクトル保持部５１２に保
持される。文書ベクトル保持部５１２に保持された学習
用文書の文書ベクトルはフォルダベクトル計算部５１３
に与えられ、フォルダベクトル計算部５１３は学習用文
書の文書ベクトルを用いて各カテゴリのフォルダベクト
ルを求める。求められた各カテゴリのフォルダベクトル
は、フォルダベクトル保持部５１４に保持される。Ci, j = (the number of documents including both Ti and Tj) / (the number of documents including Ti) (2) The effective word vector obtained by the effective word vector calculation unit 509 is an effective word vector holding unit. After being stored in 510, it is provided to the document vector calculation unit 511. The document vector calculation unit 511 obtains a document vector by referring to the effective word vector for each of the learning document and the classification target document. The document vector obtained for each of the learning document and the classification target document is a document vector holding unit. 512. The document vector of the learning document held in the document vector holding unit 512 is a folder vector calculation unit 513.
And the folder vector calculation unit 513 obtains a folder vector of each category using the document vector of the learning document. The obtained folder vector of each category is stored in the folder vector storage unit 514.

【０００７】フォルダベクトル保持部５１４に保持され
た各カテゴリのフォルダベクトルは、文書ベクトル保持
部５１２に保持された分類対象文書の文書ベクトルとと
もに分類決定部５１５に与えられ、分類決定部５１５は
分類対象文書の文書ベクトルと各カテゴリのフォルダベ
クトルとを比較し、該比較結果に応じて分類対象文書が
属するカテゴリを決定する。この決定された分類対象文
書のカテゴリは分類結果保持部５１６に保持される。[0007] The folder vector of each category stored in the folder vector storage unit 514 is provided to the classification determination unit 515 together with the document vector of the classification target document stored in the document vector storage unit 512. The document vector of the document is compared with the folder vector of each category, and the category to which the document to be classified belongs is determined according to the comparison result. The determined category of the classification target document is stored in the classification result storage unit 516.

【０００８】次に、文書自動分類装置における学習フェ
ーズの処理手順について図８を参照しながら説明する。Next, the processing procedure of the learning phase in the automatic document classification device will be described with reference to FIG.

【０００９】まず、ステップＳ６０１において学習要文
書に含まれる語の中から、分類に有用な語を有効語とし
て選定し、続くステップＳ６０２で、各文書内に含まれ
ている選定した有効語の数を求める。First, in step S601, words useful for classification are selected as effective words from the words included in the document requiring learning, and in step S602, the number of selected effective words included in each document is determined. Ask for.

【００１０】次いで、ステップＳ６０３に進み、文書内
有効語数から各有効語の組の文書内共起頻度を求め、続
くステップＳ６０４で、文書内共起頻度から有効語ベク
トルを算出する。そして、ステップＳ６０５で、有効語
ベクトルを参照して学習用文書から有効語を取り出し、
続くステップＳ６０６で、取り出した有効語の有効語ベ
クトルの平均を取って学習用文書の文書ベクトルを求め
る。Next, the process proceeds to step S603, in which the co-occurrence frequency of each set of effective words in the document is obtained from the number of effective words in the document. In step S604, an effective word vector is calculated from the co-occurrence frequency in the document. Then, in step S605, an effective word is extracted from the learning document with reference to the effective word vector,
In the next step S606, the average of the effective word vectors of the extracted effective words is calculated to obtain the document vector of the learning document.

【００１１】次いで、ステップＳ６０７に進み、学習用
文書における各カテゴリに属する文書の文書ベクトルの
平均を取り、該文書のベクトルの平均からフォルダベク
トルを求め、本処理を終了する。Next, the process proceeds to step S607, in which the average of the document vectors of the documents belonging to each category in the learning document is calculated, the folder vector is obtained from the average of the vectors of the documents, and the process ends.

【００１２】この学習フェーズが終了すると、分類フェ
ーズが開始される。この分類フェーズの処理手順につい
て図９を参照しながら説明する。When the learning phase ends, a classification phase starts. The processing procedure of this classification phase will be described with reference to FIG.

【００１３】分類フェーズでは、まずステップＳ７０１
において上記ステップＳ６０４で求めた有効語ベクトル
を参照して分類対象文書から有効語を取り出し、続くス
テップＳ７０２で取り出した有効語のベクトル（上記ス
テップＳ６０４で求めた有効語ベクトル）の平均を取
り、このベクトルの平均から分類対象文書の文書ベクト
ルを求める。In the classification phase, first, step S701
In step, an effective word is extracted from the document to be classified with reference to the effective word vector obtained in step S604, and an average of the effective word vectors extracted in step S702 (effective word vector obtained in step S604) is calculated. A document vector of the document to be classified is obtained from the average of the vectors.

【００１４】次いで、ステップＳ７０３に進み、分類対
象文書の文書ベクトルと学習フェーズで求められたフォ
ルダベクトルとを比較し、該比較結果に応じて分類対象
文書が属するカテゴリを決定し、本処理を終了する。Next, in step S703, the document vector of the document to be classified is compared with the folder vector obtained in the learning phase, the category to which the document to be classified belongs is determined according to the comparison result, and the process ends. I do.

【００１５】[0015]

【発明が解決しようとする課題】しかし、上述した従来
の文書自動分類装置では、学習用文書における有効語の
文書内共起頻度から有効語ベクトルを求めるから、異な
る話題について述べた２つの段落に出現する有効語同士
も共起していると判断されて話題を正確に反映したベク
トル空間が形成されないことがあり、ひいては分類を適
正に行うことができない。However, in the conventional document automatic classification apparatus described above, the effective word vector is obtained from the co-occurrence frequency of the effective word in the learning document in the document. Appearing valid words are also determined to co-occur, and a vector space that accurately reflects the topic may not be formed, and classification cannot be performed properly.

【００１６】本発明の目的は、話題を正確に反映したベ
クトル空間を形成することができ、分類を適正に行うこ
とができる文書自動分類装置、文書自動分類方法および
記憶媒体を提供することにある。An object of the present invention is to provide an automatic document classification apparatus, an automatic document classification method, and a storage medium that can form a vector space that accurately reflects a topic and can perform classification appropriately. .

【００１７】本発明の他の目的は、話題を正確に反映し
たベクトル空間を形成することができ、分類を適正に行
うことが可能な文書自動分類システムを実現することが
できる学習装置、分類装置、学習方法、分類方法および
記憶媒体を提供することにある。Another object of the present invention is to provide a learning apparatus and a classifying apparatus capable of forming a vector space accurately reflecting a topic and realizing an automatic document classification system capable of appropriately performing classification. , A learning method, a classification method, and a storage medium.

【００１８】[0018]

【課題を解決するための手段】請求項１記載の発明は、
学習用文書と該学習用文書から選出された有効語を用い
て、分類対象文書をユーザの意図に沿って分類する文書
自動分類装置において、前記学習用文書について前記有
効語を参照して各文章単位毎にそれに含まれる各有効語
の数を求める文章単位内有効語数計算手段と、前記有効
語数を参照して各有効語の組の文章単位内共起頻度を求
める文章単位内共起頻度計算手段と、前記文章単位内共
起頻度を参照して前記各有効語の有効語ベクトルを求め
る有効語ベクトル計算手段と、前記学習用文書と前記分
類対象文書とのそれぞれについて、前記有効語ベクトル
を参照して文書ベクトルを求める文書ベクトル計算手段
と、前記学習用文書について求められた文書ベクトルを
用いて各カテゴリのフォルダベクトルを求めるフォルダ
ベクトル計算手段と、前記分類対象文書について求めら
れた文書ベクトルと前記各カテゴリのフォルダベクトル
とを比較し、該比較結果に応じて前記分類対象文書が属
するカテゴリを決定する分類決定手段とを備えることを
特徴とする。According to the first aspect of the present invention,
An automatic document classification apparatus for classifying a document to be classified according to a user's intention using a learning document and an effective word selected from the learning document. Means for calculating the number of effective words in each sentence included in each unit, and means for calculating the co-occurrence frequency in each sentence for each set of effective words by referring to the number of effective words Means, an effective word vector calculation means for obtaining an effective word vector of each effective word by referring to the co-occurrence frequency in the text unit, and for each of the learning document and the classification target document, the effective word vector Document vector calculating means for obtaining a document vector by reference, and folder vector calculating means for obtaining a folder vector of each category using the document vector obtained for the learning document Classifying means for comparing a document vector obtained for the classification target document with a folder vector of each category, and determining a category to which the classification target document belongs according to the comparison result. .

【００１９】請求項２記載の発明は、分類対象文書をユ
ーザの意図に沿って分類する文書自動分類システムに用
いられる、前記分類対象文書が属するカテゴリを決定す
るための基準となる各カテゴリのフォルダベクトルを求
めるための学習装置において、学習用文書を保持する学
習用文書保持手段と、前記学習用文書から有効語を選定
する有効語選定手段と、前記学習用文書について前記有
効語を参照して各文章単位毎にそれに含まれる各有効語
の数を求める文章単位内有効語数計算手段と、前記有効
語数を参照して各有効語の組の文章単位内共起頻度を求
める文章単位内共起頻度計算手段と、前記文章単位内共
起頻度を参照して前記各有効語の有効語ベクトルを求め
る有効語ベクトル計算手段と、前記有効語ベクトルを参
照して文書ベクトルを求める文書ベクトル計算手段と、
前記文書ベクトルを用いて前記各カテゴリのフォルダベ
クトルを求めるフォルダベクトル計算手段とを備えるこ
とを特徴とする。According to a second aspect of the present invention, there is provided an automatic document classification system for classifying documents to be classified according to a user's intention, and a folder of each category serving as a reference for determining a category to which the document to be classified belongs. In a learning device for obtaining a vector, a learning document holding unit that holds a learning document, an effective word selecting unit that selects an effective word from the learning document, and a reference to the effective word for the learning document. Means for calculating the number of effective words included in each sentence unit for each sentence unit; and co-occurrence within the sentence unit for obtaining the co-occurrence frequency within the sentence unit for each set of effective words with reference to the number of effective words A frequency calculating means, an effective word vector calculating means for obtaining an effective word vector of each effective word by referring to the co-occurrence frequency in the sentence unit, and a document vector referring to the effective word vector A document vector calculating means for calculating a,
Folder vector calculating means for obtaining a folder vector of each category using the document vector.

【００２０】請求項３記載の発明は、分類対象文書をユ
ーザの意図に沿って分類する文書自動分類システムに請
求項２記載の学習装置とともに用いられる、前記分類対
象文書が属するカテゴリを決定するための分類装置にお
いて、前記分類対象文書を保持する分類対象文書保持手
段と、前記分類対象文書について、前記学習装置で求め
られた有効語ベクトルを参照して文書ベクトルを求める
文書ベクトル計算手段と、前記分類対象文書について求
められた文書ベクトルと前記学習装置で求められた各カ
テゴリのフォルダベクトルとを比較し、該比較結果に応
じて前記分類対象文書が属するカテゴリを決定する分類
決定手段とを備えることを特徴とする。According to a third aspect of the present invention, there is provided an automatic document classification system for classifying a document to be classified according to a user's intention together with the learning apparatus according to the second aspect, for determining a category to which the document to be classified belongs. A classification target document holding unit that holds the classification target document; a document vector calculation unit that obtains a document vector for the classification target document by referring to an effective word vector obtained by the learning device; Classification determining means for comparing a document vector obtained for the classification target document with a folder vector of each category obtained by the learning device, and determining a category to which the classification target document belongs according to the comparison result. It is characterized by.

【００２１】請求項４記載の発明は、学習用文書と該学
習用文書から選出された有効語を用いて、分類対象文書
をユーザの意図に沿って分類する文書自動分類方法にお
いて、前記学習用文書について前記有効語を参照して各
文章単位毎にそれに含まれる各有効語の数を求める工程
と、前記有効語数を参照して各有効語の組の文章単位内
共起頻度を求める工程と、前記文章単位内共起頻度を参
照して前記各有効語の有効語ベクトルを求める工程と、
前記学習用文書と前記分類対象文書とのそれぞれについ
て、前記有効語ベクトルを参照して文書ベクトルを求め
る工程と、前記学習用文書について求められた文書ベク
トルを用いて各カテゴリのフォルダベクトルを求める工
程と、前記分類対象文書について求められた文書ベクト
ルと前記各カテゴリのフォルダベクトルとを比較し、該
比較結果に応じて前記分類対象文書が属するカテゴリを
決定する工程とを備えることを特徴とする。According to a fourth aspect of the present invention, there is provided an automatic document classification method for classifying a document to be classified according to a user's intention using a learning document and an effective word selected from the learning document. A step of obtaining the number of valid words included in each sentence unit by referring to the valid words for the document; anda step of calculating a co-occurrence frequency in the sentence unit of each set of valid words by referring to the number of valid words. Determining the effective word vector of each effective word by referring to the co-occurrence frequency within the sentence unit;
A step of obtaining a document vector by referring to the effective word vector for each of the learning document and the classification target document; and a step of obtaining a folder vector of each category using the document vector obtained for the learning document And comparing the document vector obtained for the classification target document with the folder vector of each category, and determining a category to which the classification target document belongs according to the comparison result.

【００２２】請求項５記載の発明は、分類対象文書をユ
ーザの意図に沿って分類する文書自動分類システムに用
いられる、前記分類対象文書が属するカテゴリを決定す
るための基準となる各カテゴリのフォルダベクトルを求
めるための学習方法において、学習用文書を保持する工
程と、前記学習用文書から有効語を選定する工程と、前
記学習用文書について前記有効語を参照して各文章単位
毎にそれに含まれる各有効語の数を求める工程と、前記
有効語数を参照して各有効語の組の文章単位内共起頻度
を求める工程と、前記文章単位内共起頻度を参照して前
記各有効語の有効語ベクトルを求める工程と、前記有効
語ベクトルを参照して文書ベクトルを求める工程と、前
記文書ベクトルを用いて前記各カテゴリのフォルダベク
トルを求める工程とを備えることを特徴とする。According to a fifth aspect of the present invention, there is provided an automatic document classification system for classifying documents to be classified according to a user's intention, and a folder of each category serving as a reference for determining a category to which the document to be classified belongs. In the learning method for obtaining a vector, a step of holding a learning document, a step of selecting an effective word from the learning document, and a step of selecting the effective word from the learning document, the effective word being included in each sentence unit with reference to the effective word. Determining the number of valid words to be calculated, and calculating the number of valid words by referring to the number of valid words, and calculating the co-occurrence frequency within the sentence unit of each set of valid words. Determining the effective vector of the category, obtaining the document vector with reference to the effective vector, and obtaining the folder vector of each category using the document vector. Characterized in that it comprises a.

【００２３】請求項６記載の発明は、分類対象文書をユ
ーザの意図に沿って分類する文書自動分類システムに請
求項５記載の学習方法とともに用いられる、前記分類対
象文書が属するカテゴリを決定するための分類方法にお
いて、前記分類対象文書を保持する工程と、前記分類対
象文書について、前記学習方法で求められた有効語ベク
トルを参照して文書ベクトルを求める工程と、前記分類
対象文書について求められた文書ベクトルと前記学習方
法で求められた各カテゴリのフォルダベクトルとを比較
し、該比較結果に応じて前記分類対象文書が属するカテ
ゴリを決定する工程とを備えることを特徴とする。According to a sixth aspect of the present invention, there is provided an automatic document classification system for classifying a document to be classified according to a user's intention together with the learning method according to the fifth aspect, for determining a category to which the document to be classified belongs. In the classification method, the step of holding the classification target document, the step of obtaining a document vector for the classification target document by referring to the effective word vector obtained by the learning method, and the step of obtaining the classification target document Comparing the document vector with the folder vector of each category obtained by the learning method, and determining a category to which the classification target document belongs according to the comparison result.

【００２４】請求項７記載の発明は、学習用文書と該学
習用文書から選出された有効語を用いて、分類対象文書
をユーザの意図に沿って分類する文書自動分類装置を構
築するためのプログラムを格納した記憶媒体において、
前記プログラムは、前記学習用文書について前記有効語
を参照して各文章単位毎にそれに含まれる各有効語の数
を求める文章単位内有効語数計算モジュールと、前記有
効語数を参照して各有効語の組の文章単位内共起頻度を
求める文章単位内共起頻度計算モジュールと、前記文章
単位内共起頻度を参照して前記各有効語の有効語ベクト
ルを求める有効語ベクトル計算モジュールと、前記学習
用文書と前記分類対象文書とのそれぞれについて、前記
有効語ベクトルを参照して文書ベクトルを求める文書ベ
クトル計算モジュールと、前記学習用文書について求め
られた文書ベクトルを用いて各カテゴリのフォルダベク
トルを求めるフォルダベクトル計算モジュールと、前記
分類対象文書について求められた文書ベクトルと前記各
カテゴリのフォルダベクトルとを比較し、該比較結果に
応じて前記分類対象文書が属するカテゴリを決定する分
類決定モジュールとを備えることを特徴とする。According to a seventh aspect of the present invention, there is provided an automatic document classifying apparatus for classifying a document to be classified according to a user's intention using a learning document and an effective word selected from the learning document. In the storage medium storing the program,
The program includes a module for calculating the number of valid words included in each sentence unit for each sentence unit with reference to the valid words for the learning document, and a module for calculating each valid word with reference to the number of valid words. A set of sentence co-occurrence frequencies within a sentence unit, a co-occurrence frequency calculation module within a sentence unit, an effective word vector calculation module for obtaining an effective word vector for each effective word by referring to the co-occurrence frequency within a sentence unit, For each of the learning document and the classification target document, a document vector calculation module that obtains a document vector by referring to the effective word vector, and a folder vector of each category is obtained using the document vector obtained for the learning document. A folder vector calculation module to be obtained, a document vector obtained for the classification target document, and a folder of each category. Comparing the vector, characterized in that it comprises a classification decision module that determines the category of the classifying target document belongs in accordance with the comparison result.

【００２５】請求項８記載の発明は、分類対象文書をユ
ーザの意図に沿って分類する文書自動分類システムに用
いられる、前記分類対象文書が属するカテゴリを決定す
るための基準となる各カテゴリのフォルダベクトルを求
めるための学習装置を構築するための学習プログラムを
格納した記憶媒体において、前記学習プログラムは、学
習用文書を保持する学習用文書保持モジュールと、前記
学習用文書から有効語を選定する有効語選定モジュール
と、前記学習用文書について前記有効語を参照して各文
章単位毎にそれに含まれる各有効語の数を求める文章単
位内有効語数計算モジュールと、前記有効語数を参照し
て各有効語の組の文章単位内共起頻度を求める文章単位
内共起頻度計算モジュールと、前記文章単位内共起頻度
を参照して前記各有効語の有効語ベクトルを求める有効
語ベクトル計算モジュールと、前記有効語ベクトルを参
照して文書ベクトルを求める文書ベクトル計算モジュー
ルと、前記文書ベクトルを用いて前記各カテゴリのフォ
ルダベクトルを求めるフォルダベクトルモジュールとを
備えることを特徴とする。According to an eighth aspect of the present invention, there is provided an automatic document classification system for classifying documents to be classified according to a user's intention, and a folder of each category serving as a reference for determining a category to which the document to be classified belongs. In a storage medium storing a learning program for constructing a learning device for obtaining a vector, the learning program includes a learning document holding module for holding a learning document, and an effective word for selecting an effective word from the learning document. A word selection module, a valid word count calculation module within a sentence unit for obtaining the number of valid words included in each sentence unit with reference to the valid words for the learning document, and a valid word number referring to the valid word number. A sentence-unit co-occurrence frequency calculating module for obtaining a sentence-unit co-occurrence frequency of a set of words, and An effective word vector calculation module for obtaining an effective word vector of an effective word; a document vector calculation module for obtaining a document vector by referring to the effective word vector; and a folder vector module for obtaining a folder vector of each category using the document vector. And characterized in that:

【００２６】請求項９記載の発明は、分類対象文書をユ
ーザの意図に沿って分類する文書自動分類システムに請
求項８記載の記憶媒体とともに用いられる、前記分類対
象文書が属するカテゴリを決定するための分類装置を構
築するための分類プログラムを格納した記憶媒体におい
て、前記分類プログラムは、前記分類対象文書を保持す
る分類対象文書保持モジュールと、前記分類対象文書に
ついて、前記請求項８記載の記憶媒体の学習プログラム
により求められた有効語ベクトルを参照して文書ベクト
ルを求める文書ベクトル計算モジュールと、前記分類対
象文書について求められた文書ベクトルと前記請求項８
記載の記憶媒体の学習プログラムにより求められた各カ
テゴリのフォルダベクトルとを比較し、該比較結果に応
じて前記分類対象文書が属するカテゴリを決定する分類
決定モジュールとを備えることを特徴とする。According to a ninth aspect of the present invention, there is provided an automatic document classification system for categorizing a document to be classified according to a user's intention, together with the storage medium according to the eighth aspect, for determining a category to which the document to be classified belongs. 9. A storage medium storing a classification program for constructing a classification device according to claim 8, wherein the classification program stores a classification target document holding module for holding the classification target document and the classification target document. 9. A document vector calculation module for obtaining a document vector by referring to an effective word vector obtained by the learning program of claim 8, and a document vector obtained for the classification target document.
And a classification determining module that compares the folder vector of each category obtained by the learning program of the storage medium described above and determines a category to which the classification target document belongs according to the comparison result.

【００２７】[0027]

【発明の実施の形態】以下に本発明の実施の形態につい
て図を参照しながら説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００２８】図１は本発明の文書自動分類装置の実施の
一形態の機能構成を示すブロック図、図２は図１の文書
自動分類装置のハードウェア構成を示すブロック図であ
る。FIG. 1 is a block diagram showing a functional configuration of an embodiment of the automatic document classification device of the present invention, and FIG. 2 is a block diagram showing a hardware configuration of the automatic document classification device of FIG.

【００２９】文書自動分類装置は、図１に示すように、
学習用文書を保持する学習用文書保持部１０１と、分類
対象文書を保持する分類対象文書保持部１０２と、学習
用文書から有効語を選定する有効語選定部１０３と、選
定された有効語を保持する有効語保持部１０４と、学習
用文書と有効語とを参照して各段落内に含まれている有
効語の数を求める段落内有効語数計算部１０５と、求め
られた各段落内の有効語数を保持する段落内有効語数保
持部１０６とを備える。As shown in FIG. 1, the automatic document classifying apparatus
A learning document holding unit 101 that holds a learning document, a classification target document holding unit 102 that holds a classification target document, an effective word selection unit 103 that selects an effective word from the learning document, and a selected valid word A valid word holding unit 104 for holding, a valid word count calculating unit 105 for calculating the number of valid words included in each paragraph with reference to the learning document and the valid words, A valid word count holding unit 106 for holding the number of valid words.

【００３０】段落内有効語数保持部１０６に保持された
各段落内の有効語数は段落内共起頻度計算部１０７に与
えられ、段落内共起頻度計算部１０７は段落内有効語数
を用いて各有効語の組の段落内共起頻度を求める。この
求められた段落内共起頻度は、段落内共起頻度保持部１
０８に保持された後に、有効語ベクトル計算部１０９に
与えられる。有効語ベクトル計算部１０９は、段落内共
起頻度を用いて各有効語の有効語ベクトルを求める。こ
こで、有効語Ｔi と有効語Ｔjの共起確率をｃ'i,j、有
効語数をＮとすると、有効語Ｔi の有効語ベクトルＴ'i
は、次の（３）式により、Ｔ'i＝（ｃ'i,1，ｃ'i,2，…，ｃ'i,N） …（３）となる。また、共起確率ｃi,j は次の（４）式により定
義される。The number of effective words in each paragraph held in the effective word number holding section 106 is given to the intra-paragraph co-occurrence frequency calculation section 107, and the intra-paragraph co-occurrence frequency calculation section 107 uses the number of effective words in the paragraph to calculate each word. Find the co-occurrence frequency of a set of valid words in a paragraph. The obtained intra-paragraph co-occurrence frequency is stored in the intra-paragraph co-occurrence frequency holding unit 1.
After being stored in 08, it is provided to the effective word vector calculation unit 109. The effective word vector calculation unit 109 obtains an effective word vector of each effective word using the intra-paragraph co-occurrence frequency. Here, assuming that the co-occurrence probability of the valid word Ti and the valid word Tj is c′i, j and the number of valid words is N, the valid word vector T′i of the valid word Ti
From the following equation (3), T′i = (c′i, 1, c′i, 2,..., C′i, N) (3) The co-occurrence probability ci, j is defined by the following equation (4).

【００３１】ｃ'i,j＝（Ｔi とＴj の両方を含む段落数）／（Ｔi を含む段落数） …（４）有効語ベクトル計算部１０９により求められた有効語ベ
クトルは、有効語ベクトル保持部１１０に保持された後
に文書ベクトル計算部１１１に与えられる。文書ベクト
ル計算部１１１は、学習用文書と分類対象文書のそれぞ
れについて、有効語ベクトルを参照して文書ベクトルを
求め、学習用文書と分類対象文書のそれぞれについて求
められた文書ベクトルは文書ベクトル保持部１１２に保
持される。文書ベクトル保持部１１２に保持された学習
用文書の文書ベクトルはフォルダベクトル計算部１１３
に与えられ、フォルダベクトル計算部１１３は学習用文
書の文書ベクトルを用いて各カテゴリのフォルダベクト
ルを求める。求められた各カテゴリのフォルダベクトル
は、フォルダベクトル保持部１１４に保持される。C′i, j = (number of paragraphs including both Ti and Tj) / (number of paragraphs including Ti) (4) The effective word vector calculated by the effective word vector calculation unit 109 is an effective word vector After being held in the holding unit 110, it is given to the document vector calculation unit 111. The document vector calculation unit 111 obtains a document vector for each of the learning document and the classification target document by referring to the effective word vector. The document vector obtained for each of the learning document and the classification target document is a document vector holding unit. 112. The document vector of the learning document held in the document vector holding unit 112 is a folder vector calculation unit 113
And the folder vector calculation unit 113 obtains a folder vector of each category using the document vector of the learning document. The obtained folder vector of each category is stored in the folder vector storage unit 114.

【００３２】フォルダベクトル保持部１１４に保持され
た各カテゴリのフォルダベクトルは、文書ベクトル保持
部１１２に保持された分類対象文書の文書ベクトルとと
もに分類決定部１１５に与えられ、分類決定部１１５は
分類対象文書の文書ベクトルと各カテゴリのフォルダベ
クトルとを比較し、該比較結果に応じて分類対象文書が
属するカテゴリを決定する。この決定された分類対象文
書のカテゴリは分類結果保持部１１６に保持される。The folder vector of each category stored in the folder vector storage unit 114 is provided to the classification determination unit 115 together with the document vector of the classification target document stored in the document vector storage unit 112. The document vector of the document is compared with the folder vector of each category, and the category to which the document to be classified belongs is determined according to the comparison result. The determined category of the document to be classified is stored in the classification result storage unit 116.

【００３３】この文書自動分類装置のハードウェア構成
においては、図２に示すように、ＲＯＭ２０１に格納さ
れている制御プログラムを実行して後述する制御（図３
および図４に示す制御）を行う中央処理装置２０３が設
けられている。中央処理装置２０３の演算処理の作業領
域としてはＲＡＭ２０２が用いられ、また、ＲＡＭ２０
２は、有効ご保持部１０４、段落内共起頻度保持部１０
８、文書ベクトル保持部１１２、分類結果保持部１１６
のための記憶領域を提供する。In the hardware configuration of the automatic document classification apparatus, as shown in FIG. 2, a control program stored in a ROM 201 is executed to execute control (to be described later) (FIG. 3).
And a central processing unit 203 for performing the control shown in FIG. 4). The RAM 202 is used as a work area for the arithmetic processing of the central processing unit 203.
2 is a valid holding unit 104 and a paragraph co-occurrence frequency holding unit 10
8. Document vector storage unit 112, classification result storage unit 116
To provide storage space for

【００３４】中央処理装置２０３には、ＲＯＭ２０１お
よびＲＡＭ２０２とともに、ハードディスク装置２０４
がバス２０５を介して接続され、ハードディスク装置２
０４は、学習用文書保持部１０１、分類対象文書保持部
１０２、有効語ベクトル保持部１１０およびフォルダベ
クトル保持部１１４を構成する。なお、ハードディスク
装置２０４に代えて、他の記憶媒体を用いて、学習用文
書保持部１０１、分類対象文書保持部１０２、有効語ベ
クトル保持部１１０およびフォルダベクトル保持部１１
４を構成することも可能である。The central processing unit 203 includes a hard disk drive 204 together with a ROM 201 and a RAM 202.
Is connected via the bus 205 and the hard disk drive 2
Reference numeral 04 constitutes the learning document storage unit 101, the classification target document storage unit 102, the effective word vector storage unit 110, and the folder vector storage unit 114. Note that, instead of the hard disk drive 204, another storage medium is used to store the learning document holding unit 101, the classification target document holding unit 102, the valid word vector holding unit 110, and the folder vector holding unit 11.
4 can also be configured.

【００３５】次に、本文書自動分類装置が実行する処理
について図３および図４を参照しながら説明する。図３
は図１の文書自動分類装置における学習フェーズの処理
手順を示すフローチャート、図４は図１の文書自動分類
装置における分類フェーズの処理手順を示すフローチャ
ートである。Next, the processing executed by the automatic document classification apparatus will be described with reference to FIGS. FIG.
4 is a flowchart showing a processing procedure of a learning phase in the automatic document classification apparatus of FIG. 1, and FIG. 4 is a flowchart showing a processing procedure of a classification phase in the automatic document classification apparatus of FIG.

【００３６】本文書自動分類装置における処理は学習フ
ェーズと分類フェーズとに分けられ、最初に、学習フェ
ーズの処理手順について図３を参照しながら説明する。The processing in the automatic document classification apparatus is divided into a learning phase and a classification phase. First, a processing procedure in the learning phase will be described with reference to FIG.

【００３７】学習フェーズでは、図３に示すように、ま
ずステップＳ３０１において学習要文書に含まれる語の
中から、分類に有用な語を有効語として選定し、続くス
テップＳ３０２で、各段落内に含まれている選定した有
効語の数を求める。In the learning phase, as shown in FIG. 3, first, in step S301, a word useful for classification is selected as an effective word from words included in the document requiring learning, and in step S302, each word is included in each paragraph. Find the number of selected valid words included.

【００３８】次いで、ステップＳ３０３に進み、各段落
内有効語数から各有効語の組の段落内共起頻度を求め、
続くステップＳ３０４で、段落内共起頻度から有効語ベ
クトルを算出する。そして、ステップＳ３０５で、有効
語ベクトルを参照して学習用文書から有効語を取り出
し、続くステップＳ３０６で、取り出した有効語の有効
語ベクトルの平均を取って学習用文書の文書ベクトルを
求める。Then, the process proceeds to step S303, where the co-occurrence frequency in each paragraph of the set of effective words is obtained from the number of effective words in each paragraph.
In a succeeding step S304, an effective word vector is calculated from the intra-paragraph co-occurrence frequency. Then, in step S305, an effective word is extracted from the learning document with reference to the effective word vector. In step S306, the effective vector of the extracted effective words is averaged to obtain a document vector of the learning document.

【００３９】次いで、ステップＳ３０７に進み、学習用
文書における各カテゴリに属する文書の文書ベクトルの
平均を取り、該文書のベクトルの平均からフォルダベク
トルを求め、本処理を終了する。Next, the process proceeds to step S307, where the average of the document vectors of the documents belonging to each category in the learning document is obtained, the folder vector is obtained from the average of the vectors of the documents, and this processing is completed.

【００４０】この学習フェーズが終了すると、分類フェ
ーズが開始される。この分類フェーズの処理手順につい
て図４を参照しながら説明する。When the learning phase ends, the classification phase starts. The processing procedure of this classification phase will be described with reference to FIG.

【００４１】分類フェーズでは、図４に示すように、ま
ずステップＳ４０１において上記ステップＳ３０４で算
出した有効語ベクトルを参照して分類対象文書から有効
語を取り出し、続くステップＳ４０２で取り出した有効
語のベクトル（上記ステップＳ３０４で算出した有効語
ベクトル）の平均を取り、このベクトルの平均から分類
対象文書の文書ベクトルを求める。In the classification phase, as shown in FIG. 4, first, in step S401, an effective word is extracted from the document to be classified with reference to the effective word vector calculated in step S304. The average of the (effective word vector calculated in step S304) is taken, and the document vector of the document to be classified is determined from the average of this vector.

【００４２】次いで、ステップＳ４０３に進み、分類対
象文書の文書ベクトルと学習フェーズで求められたフォ
ルダベクトルとを比較し、該比較結果に応じて分類対象
文書が属するカテゴリを決定し、本処理を終了する。Next, the process proceeds to step S403, where the document vector of the document to be classified is compared with the folder vector obtained in the learning phase, the category to which the document to be classified belongs is determined according to the comparison result, and this processing ends. I do.

【００４３】以上より、本実施の形態では、文書中の内
容の変化に応じて設けられた段落構造を利用して段落内
共起頻度から有効語ベクトルを求めることにより、異な
る話題について述べた２つの段落に出現する有効語同士
が共起していると判断されることはなく、意味が単語共
起に基づく話題を正確に反映したベクトル空間を形成す
ることができ、分類を適正に行うことができる。As described above, in the present embodiment, different topics have been described by obtaining an effective word vector from a co-occurrence frequency in a paragraph using a paragraph structure provided according to a change in the content of a document. Effective words appearing in one paragraph are not judged to co-occur, and a vector space whose meaning accurately reflects topics based on word co-occurrence can be formed, and classification must be performed properly. Can be.

【００４４】なお、本実施の形態では、学習文書からの
有効語の選定が終了した後に、段落内有効語数を求める
ように設定しているが、有効語の候補を取り出す際に各
有効語の段落内の出現回数を算出してもよい。In the present embodiment, the setting is made such that the number of valid words in a paragraph is obtained after the selection of valid words from the learning document is completed. The number of appearances in a paragraph may be calculated.

【００４５】また、本実施の形態では、学習フェーズに
おいて、有効語の組に対する共起頻度を求めた後に、各
有効語の有効語ベクトルを求めるようにしているが、共
起頻度の算出と有効語ベクトルの算出とを平行して行う
ようにしてもよい。Further, in the present embodiment, in the learning phase, the effective word vector of each effective word is obtained after obtaining the co-occurrence frequency for the set of effective words. The calculation of the word vector may be performed in parallel.

【００４６】さらに、本実施の形態では、段落単位でそ
の段落内の共起頻度を求めているが、これに限定される
ものではなく、文や節など、他の文章単位で扱うことも
可能である。Further, in the present embodiment, the co-occurrence frequency in a paragraph is obtained for each paragraph. However, the present invention is not limited to this, and it can be handled for other sentences such as sentences and sections. It is.

【００４７】さらに、本実施の形態では、上述の処理
（各ブロックの機能）を実行するためのプログラムをＲ
ＯＭに格納した例を示したが、他の記憶媒体を用いて上
記プログラムを供給するように構成することも可能であ
る。また、各ブロックの機能をそれぞれ有する回路構成
により本装置を構成することも可能である。Further, in the present embodiment, a program for executing the above-described processing (function of each block) is
Although the example in which the program is stored in the OM is shown, it is also possible to use another storage medium to supply the program. Further, the present device can be configured by a circuit configuration having the function of each block.

【００４８】さらに、本装置をコンピュータなどの情報
処理装置上に構築することも可能である。この場合、上
述の処理（各ブロックの機能）を実行するためのプログ
ラムを格納した記憶媒体を準備し、ＣＰＵなどが該記憶
媒体から上記プログラムを読み出して実行することによ
り、文書自動分類装置が構成される。上記プログラムを
供給するための記憶媒体としては、フロッピーディス
ク、ハードディスク、光ディスク、光磁気ディスク、Ｃ
ＤＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性メモリカー
ド、ＲＯＭなどを用いることができる。なお、上記プロ
グラムの実行により文書自動分類装置を構成する場合に
は、コンピュータ上で稼働しているＯＳが上記プログラ
ムに含まれる処理の一部または全てを実行するように構
成されている場合も含まれる。また、記憶媒体から供給
されたプログラムがコンピュータに搭載された拡張機能
ボードまたは接続された周辺拡張ユニットに書き込まれ
た後に、拡張機能ボードまたは周辺拡張ユニットに設け
られたＣＰＵが書き込まれたプログラムを実行する場合
も含まれる。Further, the present apparatus can be constructed on an information processing apparatus such as a computer. In this case, a storage medium storing a program for executing the above-described processing (the function of each block) is prepared, and the CPU or the like reads out the program from the storage medium and executes the program. Is done. As a storage medium for supplying the above program, a floppy disk, hard disk, optical disk, magneto-optical disk, C
DROM, CD-R, magnetic tape, nonvolatile memory card, ROM, and the like can be used. Note that when the automatic document classification apparatus is configured by executing the above-described program, the case where the OS running on the computer is configured to execute a part or all of the processing included in the above-described program is also included. It is. Further, after the program supplied from the storage medium is written to the extension function board mounted on the computer or the connected peripheral extension unit, the CPU provided on the extension function board or the peripheral extension unit executes the written program. It is also included.

【００４９】さらに、本発明の原理は、複数の機器から
なるシステム、ひとつの機器からなる装置のいずれにも
適用することが可能である。Further, the principle of the present invention can be applied to any of a system including a plurality of devices and an apparatus including a single device.

【００５０】さらに、本実施の形態では、学習フェーズ
と分類フェーズとを一つの装置上で行う例を説明した
が、これに限定されるものではなく、例えば、学習フェ
ーズを行う装置と、分類フェーズを行う装置とを準備
し、それぞれの装置を用いて文書の分類を行うように構
成することもできる。この場合、学習フェーズを行う装
置により、有効語ベクトルを求めまたフォルダベクトル
を求め、この有効語ベクトルおよびフォルダベクトルを
可搬記憶媒体または通信により、分類フェーズを行う装
置に供給して分類を行う方法が用いられる。Further, in the present embodiment, an example in which the learning phase and the classification phase are performed on one device has been described. However, the present invention is not limited to this. And a device that performs document classification, and classifies documents using the respective devices. In this case, a method for obtaining a valid word vector and a folder vector by a device that performs a learning phase, and supplying the valid word vector and the folder vector to a device that performs a classification phase by a portable storage medium or communication, and performing classification. Is used.

【００５１】この学習フェーズを行う装置および分類フ
ェーズを行う装置について図５および図６を参照しなが
ら説明する。図５は本発明の学習装置の実施の一形態の
構成を示すブロック図、図６は本発明の分類装置の実施
の一形態の構成を示すブロック図である。An apparatus for performing the learning phase and an apparatus for performing the classification phase will be described with reference to FIGS. FIG. 5 is a block diagram showing a configuration of an embodiment of the learning device of the present invention, and FIG. 6 is a block diagram showing a configuration of an embodiment of the classification device of the present invention.

【００５２】学習フェーズを行う装置は、図５に示すよ
うに、学習用文書を保持する学習用文書保持部８０１
と、学習用文書から有効語を選定する有効語選定部８０
２と、選定された有効語を保持する有効語保持部８０３
と、学習用文書と有効語とを参照して各段落内に含まれ
ている有効語の数を求める段落内有効語数計算部８０４
と、求められた各段落内の有効語数を保持する段落内有
効語数保持部８０５とを備える。As shown in FIG. 5, an apparatus for performing a learning phase includes a learning document holding unit 801 for holding a learning document.
And an effective word selecting unit 80 for selecting an effective word from the learning document
2 and a valid word holding unit 803 holding the selected valid word
And the number of effective words included in each paragraph with reference to the learning document and the effective words, and calculates the number of effective words included in each paragraph.
And a number-of-effective-words-in-paragraph holding unit 805 that holds the calculated number of effective words in each paragraph.

【００５３】段落内有効語数保持部８０５に保持された
各段落内の有効語数は段落内共起頻度計算部８０６に与
えられ、段落内共起頻度計算部８０６は段落内有効語数
を用いて各有効語の組の段落内共起頻度を求める。この
求められた段落内共起頻度は、段落内共起頻度保持部８
０７に保持された後に、有効語ベクトル計算部８０８に
与えられる。有効語ベクトル計算部８０８は、段落内共
起頻度を用いて各有効語の有効語ベクトルを求める。The number of effective words in each paragraph held in the effective number of words in paragraph holding section 805 is given to the intra-paragraph co-occurrence frequency calculation section 806. Find the co-occurrence frequency of a set of valid words in a paragraph. The obtained intra-paragraph co-occurrence frequency is stored in the intra-paragraph co-occurrence frequency holding unit 8.
After being held at 07, it is provided to the effective word vector calculation unit 808. The effective word vector calculation unit 808 obtains an effective word vector of each effective word using the intra-paragraph co-occurrence frequency.

【００５４】有効語ベクトル計算部８０８により求めら
れた有効語ベクトルは、有効語ベクトル保持部８０９に
保持された後に文書ベクトル計算部８１０に与えられ
る。文書ベクトル計算部８１０は、学習用文書につい
て、有効語ベクトルを参照して文書ベクトルを求め、学
習用文書について求められた文書ベクトルは文書ベクト
ル保持部８１１に保持される。文書ベクトル保持部８１
１に保持された学習用文書の文書ベクトルはフォルダベ
クトル計算部８１２に与えられ、フォルダベクトル計算
部８１２は学習用文書の文書ベクトルを用いて各カテゴ
リのフォルダベクトルを求める。求められた各カテゴリ
のフォルダベクトルは、フォルダベクトル保持部８１３
に保持される。The valid word vector obtained by the valid word vector calculation unit 808 is provided to the document vector calculation unit 810 after being stored in the valid word vector storage unit 809. The document vector calculation unit 810 obtains a document vector for the learning document by referring to the effective word vector, and the obtained document vector for the learning document is stored in the document vector storage unit 811. Document vector holding unit 81
The document vector of the learning document held in 1 is given to the folder vector calculation unit 812, and the folder vector calculation unit 812 obtains a folder vector of each category using the document vector of the learning document. The obtained folder vector of each category is stored in a folder vector holding unit 813.
Is held.

【００５５】フォルダベクトル保持部８１３に保持され
た各カテゴリのフォルダベクトル、および有効語ベクト
ル保持部８０９に保持された有効語ベクトルは、可搬記
憶媒体に記憶されて分類フェーズを行う装置に供給さ
れ、または通信により分類フェーズを行う装置に供給さ
れる。The folder vector of each category stored in the folder vector storage unit 813 and the effective word vector stored in the effective word vector storage unit 809 are stored in a portable storage medium and supplied to a device that performs a classification phase. , Or by communication to a device that performs the classification phase.

【００５６】分類フェーズを行う装置は、図６に示すよ
うに、分類対象文書を保持する分類対象文書保持部９０
１と、学習フェーズを行う装置から可搬記憶媒体または
通信を介して供給された有効語ベクトルを保持する有効
語ベクトル保持部９０２と、学習フェーズを行う装置か
ら可搬記憶媒体または通信を介して供給されたフォルダ
ベクトルを保持するフォルダベクトル保持部９０５と、
分類対象文書について、有効語ベクトルを参照して文書
ベクトルを求める文書ベクトル計算部９０３と、分類対
象文書について求められた文書ベクトルを保持する文書
ベクトル保持部９０４とを備える。As shown in FIG. 6, the apparatus for performing the classification phase includes a classification target document holding unit 90 for holding classification target documents.
1, an effective word vector holding unit 902 for holding an effective word vector supplied from a device performing the learning phase via a portable storage medium or communication, and a valid word vector holding unit 902 via a portable storage medium or communication from the device performing the learning phase A folder vector holding unit 905 for holding the supplied folder vector,
A document vector calculation unit 903 that obtains a document vector by referring to an effective word vector for a classification target document, and a document vector holding unit 904 that stores a document vector obtained for the classification target document.

【００５７】文書ベクトル保持部９０４に保持された分
類対象文書の文書ベクトルは、フォルダベクトル保持部
９０５に保持された各カテゴリのフォルダベクトルとと
もに分類決定部９０６に与えられ、分類決定部９０６は
分類対象文書の文書ベクトルと各カテゴリのフォルダベ
クトルとを比較し、該比較結果に応じて分類対象文書が
属するカテゴリを決定する。この決定された分類対象文
書のカテゴリは分類結果保持部９０７に保持される。The document vector of the document to be classified held in the document vector holding unit 904 is given to the classification determining unit 906 together with the folder vector of each category held in the folder vector holding unit 905. The document vector of the document is compared with the folder vector of each category, and the category to which the document to be classified belongs is determined according to the comparison result. The determined category of the classification target document is stored in the classification result storage unit 907.

【００５８】[0058]

【発明の効果】以上に説明したように、請求項１記載の
文書自動分類装置によれば、学習用文書について有効語
を参照して各文章単位毎にそれに含まれる各有効語の数
を求める文章単位内有効語数計算手段と、有効語数を参
照して各有効語の組の文章単位内共起頻度を求める文章
単位内共起頻度計算手段と、文章単位内共起頻度を参照
して各有効語の有効語ベクトルを求める有効語ベクトル
計算手段と、学習用文書と分類対象文書とのそれぞれに
ついて、有効語ベクトルを参照して文書ベクトルを求め
る文書ベクトル計算手段と、学習用文書について求めら
れた文書ベクトルを用いて各カテゴリのフォルダベクト
ルを求めるフォルダベクトル計算手段と、分類対象文書
について求められた文書ベクトルと各カテゴリのフォル
ダベクトルとを比較し、該比較結果に応じて分類対象文
書が属するカテゴリを決定する分類決定手段とを備える
から、話題を正確に反映したベクトル空間を形成するこ
とができ、分類を適正に行うことができる。As described above, according to the automatic document classification apparatus of the first aspect, the number of effective words included in each sentence unit is obtained for each sentence unit by referring to the effective words in the learning document. Means for calculating the number of effective words in a sentence unit; means for calculating the co-occurrence frequency in a sentence unit for each set of effective words with reference to the number of effective words; and An effective word vector calculating means for obtaining an effective word vector of an effective word; a document vector calculating means for obtaining a document vector by referring to an effective word vector for each of the learning document and the classification target document; Vector vector calculating means for obtaining a folder vector of each category using the document vector obtained, and comparing the document vector obtained for the document to be classified with the folder vector of each category. And, from and a classification determining means for determining a category of classification target document belongs in accordance with the comparison result, the topic to be able to form a vector space that accurately reflects, it is possible to properly carry out the classification.

【００５９】請求項２記載の学習装置によれば、学習用
文書を保持する学習用文書保持手段と、学習用文書から
有効語を選定する有効語選定手段と、学習用文書につい
て有効語を参照して各文章単位毎にそれに含まれる各有
効語の数を求める文章単位内有効語数計算手段と、有効
語数を参照して各有効語の組の文章単位内共起頻度を求
める文章単位内共起頻度計算手段と、文章単位内共起頻
度を参照して各有効語の有効語ベクトルを求める有効語
ベクトル計算手段と、有効語ベクトルを参照して文書ベ
クトルを求める文書ベクトル計算手段と、文書ベクトル
を用いて各カテゴリのフォルダベクトルを求めるフォル
ダベクトル計算手段とを備えるから、話題を正確に反映
したベクトル空間を形成することができ、分類を適正に
行うことが可能な文書自動分類システムを実現すること
ができる。According to the second aspect of the present invention, a learning document holding unit for holding a learning document, an effective word selecting unit for selecting an effective word from the learning document, and referring to the effective word for the learning document. Means for calculating the number of effective words contained in each sentence unit for each sentence unit, and a co-occurrence frequency for each set of effective words with reference to the number of effective words. A word frequency calculating means, an effective word vector calculating means for obtaining an effective word vector of each effective word by referring to a co-occurrence frequency in a sentence unit, a document vector calculating means for obtaining a document vector by referring to an effective word vector, Since a folder vector calculating means for obtaining a folder vector of each category using a vector is provided, a vector space accurately reflecting a topic can be formed, and classification can be performed properly. It is possible to realize a written automatic classification system.

【００６０】請求項３記載の分類装置によれば、分類対
象文書を保持する分類対象文書保持手段と、分類対象文
書について、学習装置で求められた有効語ベクトルを参
照して文書ベクトルを求める文書ベクトル計算手段と、
分類対象文書について求められた文書ベクトルと学習装
置で求められた各カテゴリのフォルダベクトルとを比較
し、該比較結果に応じて前記分類対象文書が属するカテ
ゴリを決定する分類決定手段とを備えるから、話題を正
確に反映したベクトル空間を形成することができ、分類
を適正に行うことが可能な文書自動分類システムを実現
することができる。According to the third aspect of the present invention, a classifying target document holding means for holding a classifying target document, and a document for obtaining a document vector for the classifying target document by referring to an effective word vector obtained by a learning device. Vector calculation means,
A classification vector determining unit that compares the document vector obtained for the classification target document with the folder vector of each category obtained by the learning device and determines a category to which the classification target document belongs according to the comparison result; A vector space that accurately reflects a topic can be formed, and a document automatic classification system that can appropriately perform classification can be realized.

【００６１】請求項４記載の文書自動分類方法によれ
ば、学習用文書について有効語を参照して各文章単位毎
にそれに含まれる各有効語の数を求める工程と、有効語
数を参照して各有効語の組の文章単位内共起頻度を求め
る工程と、文章単位内共起頻度を参照して各有効語の有
効語ベクトルを求める工程と、学習用文書と分類対象文
書とのそれぞれについて、有効語ベクトルを参照して文
書ベクトルを求める工程と、学習用文書について求めら
れた文書ベクトルを用いて各カテゴリのフォルダベクト
ルを求める工程と、分類対象文書について求められた文
書ベクトルと各カテゴリのフォルダベクトルとを比較
し、該比較結果に応じて分類対象文書が属するカテゴリ
を決定する工程とを備えるから、話題を正確に反映した
ベクトル空間を形成することができ、分類を適正に行う
ことができる。According to the automatic document classification method of the fourth aspect, a step of referring to the effective words in the learning document to determine the number of effective words included in each sentence unit, and referring to the number of effective words. A step of obtaining the co-occurrence frequency in the sentence unit of each set of effective words, a step of obtaining an effective word vector of each effective word by referring to the co-occurrence frequency in the sentence unit, and a process for each of the learning document and the classification target document. A step of obtaining a document vector by referring to an effective word vector, a step of obtaining a folder vector of each category using a document vector obtained for a learning document, and a step of obtaining a document vector obtained for a classification target document and And comparing the folder vector with the folder vector and determining a category to which the document to be classified belongs according to the comparison result, thereby forming a vector space that accurately reflects the topic. It can be, it is possible to properly perform the classification.

【００６２】請求項５記載の学習方法によれば、学習用
文書を保持する工程と、学習用文書について有効語を参
照して各文章単位毎にそれに含まれる各有効語の数を求
める工程と、学習用文書から有効語を選定する工程と、
有効語数を参照して各有効語の組の文章単位内共起頻度
を求める工程と、文章単位内共起頻度を参照して各有効
語の有効語ベクトルを求める工程と、有効語ベクトルを
参照して文書ベクトルを求める工程と、文書ベクトルを
用いて各カテゴリのフォルダベクトルを求める工程とを
備えるから、話題を正確に反映したベクトル空間を形成
することができ、分類を適正に行うことが可能な文書自
動分類システムを実現することができる。According to the fifth aspect of the present invention, there is provided a step of holding a learning document, and a step of referring to an effective word in the learning document to determine the number of effective words included in each sentence unit. , Selecting valid words from the learning document,
A step of obtaining the co-occurrence frequency in the sentence unit of each set of effective words by referring to the number of effective words, a step of obtaining an effective word vector of each effective word by referring to the co-occurrence frequency in the sentence unit, and referring to the effective word vector To obtain a folder vector of each category using the document vector, so that a vector space that accurately reflects the topic can be formed, and classification can be performed properly. A simple document automatic classification system can be realized.

【００６３】請求項６記載の分類方法によれば、分類対
象文書を保持する工程と、分類対象文書について、学習
方法で求められた有効語ベクトルを参照して文書ベクト
ルを求める工程と、分類対象文書について求められた文
書ベクトルと学習方法で求められた各カテゴリのフォル
ダベクトルとを比較し、該比較結果に応じて前記分類対
象文書が属するカテゴリを決定する工程とを備えるか
ら、話題を正確に反映したベクトル空間を形成すること
ができ、分類を適正に行うことが可能な文書自動分類シ
ステムを実現することができる。According to the classifying method of the sixth aspect, a step of holding a document to be classified, a step of obtaining a document vector of the document to be classified by referring to an effective word vector obtained by a learning method, Comparing the document vector obtained for the document with the folder vector of each category obtained by the learning method, and determining a category to which the document to be classified belongs according to the comparison result. It is possible to form a reflected vector space, and realize an automatic document classification system capable of appropriately performing classification.

【００６４】請求項７記載の記憶媒体によれば、プログ
ラムが、学習用文書について有効語を参照して各文章単
位毎にそれに含まれる各有効語の数を求める文章単位内
有効語数計算モジュールと、有効語数を参照して各有効
語の組の文章単位内共起頻度を求める文章単位内共起頻
度計算モジュールと、文章単位内共起頻度を参照して各
有効語の有効語ベクトルを求める有効語ベクトル計算モ
ジュールと、学習用文書と分類対象文書とのそれぞれに
ついて、有効語ベクトルを参照して文書ベクトルを求め
る文書ベクトル計算モジュールと、学習用文書について
求められた文書ベクトルを用いて各カテゴリのフォルダ
ベクトルを求めるフォルダベクトル計算モジュールと、
分類対象文書について求められた文書ベクトルと各カテ
ゴリのフォルダベクトルとを比較し、該比較結果に応じ
て分類対象文書が属するカテゴリを決定する分類決定モ
ジュールとを備えるから、話題を正確に反映したベクト
ル空間を形成することができ、分類を適正に行うことが
できる。According to the storage medium of the present invention, the program refers to the effective word in the learning document and calculates the number of effective words included in each sentence unit for each sentence unit by referring to the effective word. , A co-occurrence frequency within a sentence unit for obtaining a co-occurrence frequency within a sentence unit of each set of valid words with reference to the number of effective words, and an effective word vector of each effective word with reference to a co-occurrence frequency within a sentence unit An effective word vector calculation module, a document vector calculation module for obtaining a document vector by referring to an effective word vector for each of the learning document and the classification target document, and each category using the document vector obtained for the learning document. A folder vector calculation module for obtaining a folder vector of
A classification decision module for comparing the document vector obtained for the document to be classified with the folder vector of each category and determining a category to which the document to be classified belongs according to the comparison result; A space can be formed, and classification can be performed appropriately.

【００６５】請求項８記載の記憶媒体によれば、学習プ
ログラムが、学習用文書を保持する学習用文書保持モジ
ュールと、学習用文書から有効語を選定する有効語選定
モジュールと、学習用文書について有効語を参照して各
文章単位毎にそれに含まれる各有効語の数を求める文章
単位内有効語数計算モジュールと、有効語数を参照して
各有効語の組の文章単位内共起頻度を求める文章単位内
共起頻度計算モジュールと、文章単位内共起頻度を参照
して各有効語の有効語ベクトルを求める有効語ベクトル
計算モジュールと、有効語ベクトルを参照して文書ベク
トルを求める文書ベクトル計算モジュールと、文書ベク
トルを用いて各カテゴリのフォルダベクトルを求めるフ
ォルダベクトルモジュールとを備えるから、話題を正確
に反映したベクトル空間を形成することができ、分類を
適正に行うことが可能な文書自動分類システムを実現す
ることができる。According to the storage medium of the present invention, the learning program includes a learning document holding module for holding a learning document, an effective word selecting module for selecting an effective word from the learning document, and a learning document. A module for calculating the number of valid words included in each sentence unit for each sentence unit with reference to the effective words, and a co-occurrence frequency for each set of valid words within the sentence unit for each set of valid words with reference to the number of effective words Co-occurrence frequency calculation module within sentence unit, effective word vector calculation module for obtaining effective word vector of each effective word by referring to co-occurrence frequency within sentence unit, and document vector calculation for obtaining document vector by referring to effective word vector Module and a folder vector module that obtains a folder vector for each category using document vectors. It is possible to form a space, it is possible to realize the automatic document classification system which can properly perform classification.

【００６６】請求項９記載の記憶媒体によれば、分類プ
ログラムが、分類対象文書を保持する分類対象文書保持
モジュールと、分類対象文書について、請求項８記載の
記憶媒体の学習プログラムにより求められた有効語ベク
トルを参照して文書ベクトルを求める文書ベクトル計算
モジュールと、分類対象文書について求められた文書ベ
クトルと請求項８記載の記憶媒体の学習プログラムによ
り求められた各カテゴリのフォルダベクトルとを比較
し、該比較結果に応じて分類対象文書が属するカテゴリ
を決定する分類決定モジュールとを備えるから、話題を
正確に反映したベクトル空間を形成することができ、分
類を適正に行うことが可能な文書自動分類システムを実
現することができる。According to the storage medium of the ninth aspect, the classification program is obtained for the classification target document holding module for holding the classification target document and the classification target document by the storage medium learning program of the eighth aspect. A document vector calculation module for obtaining a document vector by referring to an effective word vector, and comparing a document vector obtained for a document to be classified with a folder vector of each category obtained by a storage medium learning program according to claim 8. And a classification determining module for determining a category to which the document to be classified belongs according to the comparison result, so that a vector space that accurately reflects the topic can be formed, and the automatic document classification can be performed appropriately. A classification system can be realized.

[Brief description of the drawings]

【図１】本発明の文書自動分類装置の実施の一形態の機
能構成を示すブロック図である。FIG. 1 is a block diagram showing a functional configuration of an embodiment of an automatic document classification device of the present invention.

【図２】図１の文書自動分類装置のハードウェア構成を
示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of the automatic document classification device shown in FIG. 1;

【図３】図１の文書自動分類装置における学習フェーズ
の処理手順を示すフローチャートである。FIG. 3 is a flowchart showing a processing procedure of a learning phase in the automatic document classification device of FIG. 1;

【図４】図１の文書自動分類装置における分類フェーズ
の処理手順を示すフローチャートである。FIG. 4 is a flowchart showing a processing procedure of a classification phase in the automatic document classification device of FIG. 1;

【図５】本発明の学習装置の実施の一形態の構成を示す
ブロック図である。FIG. 5 is a block diagram showing a configuration of an embodiment of the learning device of the present invention.

【図６】本発明の分類装置の実施の一形態の構成を示す
ブロック図である。FIG. 6 is a block diagram illustrating a configuration of an embodiment of a classification device according to the present invention.

【図７】従来の文書自動分類装置の構成を示すブロック
図である。FIG. 7 is a block diagram showing a configuration of a conventional automatic document classification device.

【図８】図７の文書自動分類装置における学習フェーズ
の処理手順を示すフローチャートである。8 is a flowchart showing a processing procedure of a learning phase in the automatic document classification device of FIG. 7;

【図９】図７の文書自動分類装置における分類フェーズ
の処理手順を示すフローチャートである。9 is a flowchart showing a processing procedure of a classification phase in the automatic document classification device of FIG. 7;

[Explanation of symbols]

１０１，８０１学習用文書保持部１０２，９０１分類対象文書保持部１０３，８０２有効語選定部１０４，８０３有効語保持部１０５，８０４段落内有効語数計算部１０６，８０５段落内有効語数保持部１０７，８０６段落内共起頻度計算部１０８，８０７段落内共起頻度保持部１０９，８０８有効語ベクトル計算部１１０，８０９，９０２有効語ベクトル保持部１１１，８１０，９０３文書ベクトル計算部１１２，８１１，９０４文書ベクトル保持部１１３，８１２フォルダベクトル計算部１１４，８１３，９０５フォルダベクトル保持部１１５、９０６分類決定部１１６、９０７分類結果保持部２０１ＲＯＭ２０２ＲＡＭ２０３中央処理装置２０４ハードディスク装置 101, 801 learning document holding unit 102, 901 classification target document holding unit 103, 802 valid word selecting unit 104, 803 valid word holding unit 105, 804 effective word number calculation unit 106, 805 effective word number holding unit 107, paragraph 806 Intra-paragraph co-occurrence frequency calculation unit 108,807 In-paragraph co-occurrence frequency storage unit 109,808 Effective word vector calculation unit 110,809,902 Effective word vector storage unit 111,810,903 Document vector calculation unit 112,811,904 Document vector storage unit 113, 812 Folder vector calculation unit 114, 813, 905 Folder vector storage unit 115, 906 Classification determination unit 116, 907 Classification result storage unit 201 ROM 202 RAM 203 Central processing unit 204 Hard disk device

───────────────────────────────────────────────────── フロントページの続き (72)発明者上田隆也東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者池田裕治東京都大田区下丸子３丁目30番２号キヤノン株式会社内 ──────────────────────────────────────────────────続き Continued on the front page (72) Inventor Takaya Ueda 3-30-2 Shimomaruko, Ota-ku, Tokyo Inside Canon Inc. (72) Inventor Yuji Ikeda 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Inside the corporation

Claims

[Claims]

An automatic document classification apparatus for classifying a document to be classified according to a user's intention using a learning document and an effective word selected from the learning document. Means for calculating the number of effective words included in each sentence unit by referring to each sentence unit; and a sentence unit for obtaining the co-occurrence frequency within each sentence of each set of effective words by referring to the number of effective words. Inner co-occurrence frequency calculating means, effective word vector calculating means for obtaining an effective word vector of each effective word with reference to the textual unit co-occurrence frequency, and for each of the learning document and the classification target document, A document vector calculating means for obtaining a document vector by referring to the effective word vector; and a folder vector for obtaining a folder vector of each category using the document vector obtained for the learning document. And a classification determining means for comparing a document vector obtained for the classification target document with a folder vector of each category, and determining a category to which the classification target document belongs according to the comparison result. Document automatic classification apparatus characterized by the above-mentioned.

2. Learning for obtaining a folder vector of each category, which is used as a reference for determining a category to which the classification target document belongs, which is used in an automatic document classification system for classifying the classification target document according to a user's intention. In the apparatus, learning document holding means for holding a learning document,
An effective word selecting means for selecting an effective word from the learning document; and a valid word number calculating means within a sentence unit for obtaining the number of effective words included in each sentence unit with reference to the effective word in the learning document. And a sentence-unit co-occurrence frequency calculating means for obtaining the intra-sentence co-occurrence frequency of each set of effective words by referring to the number of effective words, and the validity of each effective word by referring to the intra-sentence co-occurrence frequency. Effective vector calculating means for obtaining a word vector; document vector calculating means for obtaining a document vector by referring to the effective word vector; and folder vector calculating means for obtaining a folder vector of each category using the document vector. A learning device, characterized in that:

3. A classification device for determining a category to which the classification target document belongs, which is used together with the learning device according to claim 2 in an automatic document classification system for classifying the classification target document according to a user's intention. Classification target document holding means for holding a classification target document; document vector calculation means for obtaining a document vector by referring to the effective word vector obtained by the learning device for the classification target document; A classification vector determining unit that compares the document vector obtained with the folder vector of each category obtained by the learning device, and determines a category to which the classification target document belongs according to the comparison result. .

4. An automatic document classification method for classifying a document to be classified according to a user's intention using a learning document and an effective word selected from the learning document. Referring to each sentence unit to determine the number of effective words contained therein, and referring to the number of effective words to determine the co-occurrence frequency of each set of effective words within the sentence unit; Obtaining an effective word vector of each of the effective words by referring to the occurrence frequency; obtaining a document vector by referring to the effective word vector for each of the learning document and the classification target document; Obtaining a folder vector of each category using the document vector obtained for the document for use, and a document vector obtained for the classification target document and a folder vector of each category. And determining a category to which the classification target document belongs in accordance with the comparison result.

5. A learning method for determining a folder vector of each category, which is used as a reference for determining a category to which the classification target document belongs, which is used in an automatic document classification system for classifying the classification target document according to a user's intention. The method comprises the steps of: retaining a learning document; selecting valid words from the learning document; and referring to the valid words for the learning document, the number of valid words included in each sentence unit for each sentence unit. Determining the effective word vector of each set of effective words by referring to the number of effective words, and determining the co-occurrence frequency within the sentence unit of each set of effective words with reference to the number of effective words. And a step of obtaining a document vector by referring to the effective word vector; and a step of obtaining a folder vector of each category using the document vector. Learning method to do.

6. A classification method for determining a category to which the classification target document belongs, which is used together with the learning method according to claim 5 in an automatic document classification system for classifying the classification target document according to a user's intention. Holding a document to be classified, for the document to be classified, determining a document vector by referring to an effective word vector obtained by the learning method, and a document vector obtained for the document to be classified and the learning method. And comparing the folder vector of each category obtained in the step (a) with each other, and determining a category to which the document to be classified belongs according to the comparison result.

7. A storage medium storing a program for constructing an automatic document classification apparatus for classifying a document to be classified according to a user's intention using a learning document and an effective word selected from the learning document. In the program, the effective word number calculation module in the sentence unit to find the number of effective words included in each sentence unit for each sentence unit by referring to the effective words for the learning document, A sentence unit co-occurrence frequency calculation module for obtaining a sentence unit co-occurrence frequency of a set of effective words, and an effective word vector calculation module for obtaining an effective word vector of each effective word by referring to the sentence unit co-occurrence frequency; A document vector calculation module for obtaining a document vector by referring to the effective word vector for each of the learning document and the classification target document; And a folder vector calculation module that obtains a folder vector of each category using the document vector obtained for the above, and compares the document vector obtained for the classification target document with the folder vector of each category, and according to the comparison result, A classification determination module for determining a category to which the classification target document belongs;

8. Learning for obtaining a folder vector of each category, which is used as a reference for determining a category to which the classification target document belongs, which is used in an automatic document classification system for classifying the classification target document according to a user's intention. In a storage medium storing a learning program for constructing an apparatus, the learning program includes a learning document holding module for holding a learning document, an effective word selecting module for selecting an effective word from the learning document, A module for calculating the number of effective words in a sentence unit that determines the number of effective words contained in each sentence unit with reference to the effective words for the learning document, and a sentence unit for each set of effective words with reference to the number of effective words An intra-unit co-occurrence frequency calculating module for obtaining an intra-unit co-occurrence frequency, and an effective word vector of each of the effective words by referring to the intra-sentence unit co-occurrence frequency. , A document vector calculation module for obtaining a document vector by referring to the effective word vector, and a folder vector module for obtaining a folder vector of each category using the document vector. Storage medium.

9. A classification apparatus for determining a category to which the classification target document belongs, which is used together with the storage medium according to claim 8 in an automatic document classification system for classifying the classification target document according to a user's intention. 9. A storage medium storing a classification program for storing a classification target document, wherein the classification program stores a classification target document holding module for holding the classification target document and the classification target document.
9. A storage medium learning program according to claim 8, wherein a document vector calculation module for obtaining a document vector by referring to an effective word vector obtained by the storage medium learning program described above, and a document vector obtained for the classification target document. And a classification determination module that compares the folder vector of each category obtained by the above with the folder vector and determines the category to which the classification target document belongs according to the comparison result.