CN107862051A

CN107862051A - A kind of file classifying method, system and a kind of document classification equipment

Info

Publication number: CN107862051A
Application number: CN201711091476.0A
Authority: CN
Inventors: 毕银龙
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2018-03-30

Abstract

The invention discloses a kind of file classifying method to include：The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains sequence of terms；The stop words in the sequence of terms is removed, obtains current term sequence, and Feature Words of the TF IDF weights in the current term sequence more than preset value are added in language material dictionary；Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix；The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.As can be seen here, file classifying method provided in an embodiment of the present invention, output integration of the text from pretreatment to classification results is realized.The invention also discloses a kind of document classification system and a kind of document classification equipment and a kind of computer-readable recording medium, above-mentioned technique effect can be equally realized.

Description

A kind of file classifying method, system and a kind of document classification equipment

Technical field

The present invention relates to Text Classification field, more specifically to a kind of file classifying method, system and one kind Document classification equipment and a kind of computer-readable recording medium.

Background technology

Under current big data background, the data message of magnanimity is full of on network, how using these data, to data Carry out secondary operation processing, excavate useful information create value be current big data, Data Mining research heat Point.And it is to carry out one conventional means of data mining to be identified user's data interested by sorting algorithm.

File classification method of the prior art mainly includes：Participle, extraction feature, text vectorization represent, training point Class model, prediction classification.And the improvement of assorting process is concentrated mainly on feature extraction, text vectorization method for expressing, In sorting algorithm.Feature extraction main method mainly includes：TF, TF-IDF, information gain etc.；Vectorization represents mainly have：To Quantity space model (VMS), term vector, agent model etc.；Sorting algorithm mainly has：Naive Bayesian, LOGISTIC, SVM, RANDOMFOREST, neutral net etc..More sorting algorithm is studied on text classification be concentrated mainly on use at present The low-dimensional term vectorization that word2vector produces characteristic item represents text, is then classified using deep learning model.

Therefore, how to realize that output integration of the text from pretreatment to classification results is that those skilled in the art need to solve Certainly the problem of.

The content of the invention

It is an object of the invention to provide a kind of file classifying method, system and a kind of document classification equipment and a kind of calculating Machine readable storage medium storing program for executing, realize output integration of the text from pretreatment to classification results.

To achieve the above object, the embodiments of the invention provide a kind of file classifying method, including：

The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains word sequence Row；

The stop words in the sequence of terms is removed, obtains current term sequence, and by TF- in the current term sequence The Feature Words that IDF weights are more than preset value are added in language material dictionary；

Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix；

The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.

Wherein, in addition to：

By unknown text after pretreatment, word segmentation processing and removal stop words processing, what the input training was completed divides In class model, so that the disaggregated model that the training is completed exports the classification of the unknown text.

Wherein, it is described to be pre-processed the text collected, including：

The text collected is removed into non-principal text；

Wherein, the non-principal text includes non-text data and/or interference data item.

Wherein, the Feature Words that TF-IDF weight in current sequence of terms is more than to preset value are added to language material dictionary In, including：

The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words in current text In word frequency, IDF be the textual data comprising the Feature Words inverse；

TF-IDF weight using the TF and IDF product as the Feature Words；

Judge whether the TF-IDF weight is more than the preset value, if so, the Feature Words then are added into language material word In allusion quotation.

To achieve the above object, the embodiments of the invention provide a kind of document classification system, including：

Pretreatment module, segmented for the text collected to be pre-processed, and by the text by pretreatment Processing, obtains sequence of terms；

Add module, for removing the stop words in the sequence of terms, current term sequence is obtained, and will be described current TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in sequence of terms；

Vectorization module, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, obtain Vector matrix；

Training module, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to unknown Text is classified.

Wherein, in addition to：

Input module, for by unknown text by pretreatment, word segmentation processing and remove stop words processing after, described in input Train in the disaggregated model completed, so that the disaggregated model that the training is completed exports the classification of the unknown text.

Wherein, the pretreatment module specifically includes：

First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes Non-text data and/or interference data item；

Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.

Wherein, the add module specifically includes：

Second removal unit, for removing the stop words in the sequence of terms；

Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is the feature Word frequency of the word in current text, IDF are the inverse of the textual data comprising the Feature Words；

Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words；

Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then by the Feature Words Added in language material dictionary.

To achieve the above object, the embodiments of the invention provide a kind of document classification equipment, including：

Memory, for storage file sort program；

Processor, realize such as the step of above-mentioned file classifying method during for performing the document classification program.

To achieve the above object, the embodiments of the invention provide a kind of computer-readable recording medium, the computer can Read to be stored with document classification program in storage medium, realized when the document classification program is executed by processor such as above-mentioned file point Class method.

By above scheme, a kind of file classifying method provided in an embodiment of the present invention includes：The text that will be collected This is pre-processed, and the text by pretreatment is carried out into word segmentation processing, obtains sequence of terms；Remove in the sequence of terms Stop words, obtain current term sequence, and TF-IDF weight in the current term sequence is more than to the Feature Words of preset value Added in language material dictionary；Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains moment of a vector Battle array；The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.

As can be seen here, file classifying method provided in an embodiment of the present invention, screened by TF-IDF weight in sequence of terms Feature Words, vectorization expression is carried out to the Feature Words after screening by VSM models, passes through the training classification of obtained vector matrix Model, to classify to unknown text, realize output integration of the text from pretreatment to classification results.The present invention is also A kind of document classification system and a kind of document classification equipment and a kind of computer-readable recording medium are disclosed, can equally be realized State technique effect.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of flow chart of file classifying method disclosed in the embodiment of the present invention；

Fig. 2 is the flow chart of another file classifying method disclosed in the embodiment of the present invention；

Fig. 3 is a kind of structure chart of document classification system disclosed in the embodiment of the present invention；

Fig. 4 is a kind of structure chart of document classification equipment disclosed in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

The embodiment of the invention discloses a kind of file classifying method, realizes output of the text from pretreatment to classification results Integration.

Referring to Fig. 1, a kind of flow chart of file classifying method disclosed in the embodiment of the present invention, as shown in figure 1, including：

S101：The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains word Word order arranges；

In specific implementation, the text collected is pre-processed, pretreated text entered using participle instrument Row participle, a sequence of terms is cut into by the newsletter archive of entire chapter.

S102：The stop words in the sequence of terms is removed, obtains current term sequence, and by the current term sequence The Feature Words that middle TF-IDF weight is more than preset value are added in language material dictionary；

This step be intended to by included in the newsletter archive after participle for example " ", " ", a series of stop words such as " " Cleaned, reduce the influence to caused by classifying quality.

S103：Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix；

VSM models (Chinese full name：Vector space model, English full name：Vector space model) it is the most frequently used Similarity calculation, had a wide range of applications in natural language processing.Can be by the corresponding dimension of each word, and word For frequency as its corresponding characteristic vector, the word and its frequency of so every article just constitute a n-dimensional space figure, two texts The similarity of shelves is exactly the degree of approach of two spaces figure.

S104：The vector matrix is inputted in disaggregated model and trains the disaggregated model, to be carried out to unknown text Classification.

The present invention does not make specific restriction to the disaggregated model, can include：Bayes, LOGISTIC, neutral net, SVM, RANDOMFOREST etc..

File classifying method provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight, Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to Unknown text is classified, realizes output integration of the text from pretreatment to classification results.

On the basis of above-described embodiment, preferably, in addition to：

When realizing document classification, unknown text to be sorted by pretreatment, word segmentation processing and is removed at stop words After reason, in the disaggregated model that input above-described embodiment training is completed, the disaggregated model can export the classification of the unknown text.

The embodiment of the invention discloses a kind of file classifying method, and relative to a upper embodiment, the present embodiment is to technical side Case has made further instruction and optimization.Specifically：

Referring to Fig. 2, the flow chart of another file classifying method provided in an embodiment of the present invention, as shown in Fig. 2 including：

S211：The text collected is removed into non-principal text；Wherein, the non-principal text includes non-text data And/or interference data item.

In specific implementation, a large amount of non-textual data often be present in original newsletter archive language material or part is disturbed The non-principal text such as data item, such as：" URL " link, picture etc., are unfavorable for the feature extraction in later stage.Therefore, it is necessary to for being somebody's turn to do Class data are handled, and only retain the main part of news.

S212：Text by pretreatment is subjected to word segmentation processing, obtains sequence of terms；

S221：The stop words in the sequence of terms is removed, obtains current term sequence；

S222：The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words current Word frequency in text, IDF are the inverse of the textual data comprising the Feature Words；

S223：TF-IDF weight using the TF and IDF product as the Feature Words；

S224：Judge whether the TF-IDF weight is more than the preset value, if so, the Feature Words then are added into language Expect in dictionary；

S203：Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix；

S204：The vector matrix is inputted in disaggregated model and trains the disaggregated model, to be carried out to unknown text Classification.

A kind of document classification system provided in an embodiment of the present invention is introduced below, a kind of file described below point Class system can be with cross-referenced with a kind of above-described file classifying method.

Referring to Fig. 3, a kind of structure chart of document classification system provided in an embodiment of the present invention, as shown in figure 3, including：

Pretreatment module 301, divided for the text collected to be pre-processed, and by the text by pretreatment Word processing, obtains sequence of terms；

Add module 302, for removing the stop words in the sequence of terms, current term sequence is obtained, and by described in TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in current term sequence；

Vectorization module 303, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, Obtain vector matrix；

Training module 304, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to not Know that text is classified.

Document classification system provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight, Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to Unknown text is classified, realizes output integration of the text from pretreatment to classification results.

On the basis of above-described embodiment, preferably, in addition to：

On the basis of above-described embodiment, preferably, the pretreatment module specifically includes：

On the basis of above-described embodiment, preferably, the add module specifically includes：

Second removal unit, for removing the stop words in the sequence of terms；

Present invention also provides a kind of document classification equipment, referring to Fig. 4, a kind of document classification provided in an embodiment of the present invention The structure chart of equipment, as shown in figure 4, including：

Memory 401, for storage file sort program；

Processor 402, the step of above-described embodiment provides can be realized during for performing the document classification program.When The right document classification equipment can also include the component such as various network interfaces, power supply.

Document classification equipment provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight, Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to Unknown text is classified, realizes output integration of the text from pretreatment to classification results.

Present invention also provides a kind of computer-readable recording medium, is stored thereon with document classification program, the file Sort program can realize the step of above-described embodiment provides when being executed by processor.The storage medium can include：USB flash disk, Mobile hard disk, read-only storage (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. are various can be with the medium of store program codes.

Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.

The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims

A kind of 1. file classifying method, it is characterised in that including：

The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains sequence of terms；

The stop words in the sequence of terms is removed, obtains current term sequence, and by TF-IDF in the current term sequence The Feature Words that weight is more than preset value are added in language material dictionary；

Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix；

The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.
2. file classifying method according to claim 1, it is characterised in that also include：

By unknown text after pretreatment, word segmentation processing and removal stop words processing, the classification mould that the training is completed is inputted In type, so that the disaggregated model that the training is completed exports the classification of the unknown text.
3. file classifying method according to claim 1, it is characterised in that it is described to be pre-processed the text collected, Including：

The text collected is removed into non-principal text；

Wherein, the non-principal text includes non-text data and/or interference data item.
4. according to any one of the claim 1-3 file classifying methods, it is characterised in that described by current sequence of terms The Feature Words that TF-IDF weight is more than preset value are added in language material dictionary, including：

The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words in current text Word frequency, IDF are the inverse of the textual data comprising the Feature Words；

TF-IDF weight using the TF and IDF product as the Feature Words；

Judge whether the TF-IDF weight is more than the preset value, if so, then the Feature Words are added in language material dictionary.
A kind of 5. document classification system, it is characterised in that including：

Pretreatment module, word segmentation processing is carried out for the text collected to be pre-processed, and by the text by pretreatment, Obtain sequence of terms；

Add module, for removing the stop words in the sequence of terms, obtain current term sequence, and by the current term TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in sequence；

Vectorization module, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, obtain vector Matrix；

Training module, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to unknown text Classified.
6. document classification system according to claim 5, it is characterised in that also include：

Input module, for after pretreatment, word segmentation processing and removal stop words processing, unknown text to be inputted into the training In the disaggregated model of completion, so that the disaggregated model that the training is completed exports the classification of the unknown text.
7. document classification system according to claim 5, it is characterised in that the pretreatment module specifically includes：

First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes non-text Notebook data and/or interference data item；

Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.
8. according to any one of the claim 5-7 document classification systems, it is characterised in that the add module specifically includes：

Second removal unit, for removing the stop words in the sequence of terms；

Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is that the Feature Words exist Word frequency in current text, IDF are the inverse of the textual data comprising the Feature Words；

Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words；

Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then adding the Feature Words Into language material dictionary.
A kind of 9. document classification equipment, it is characterised in that including：

Memory, for storage file sort program；

Processor, the file classifying method as described in any one of Claims 1-4 is realized during for performing the document classification program The step of.
10. a kind of computer-readable recording medium, it is characterised in that file point is stored with the computer-readable recording medium Class method, the document classification side as described in any one of Claims 1-4 is realized when the document classification program is executed by processor Method.