[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN107862051A - A kind of file classifying method, system and a kind of document classification equipment - Google Patents

A kind of file classifying method, system and a kind of document classification equipment Download PDF

Info

Publication number
CN107862051A
CN107862051A CN201711091476.0A CN201711091476A CN107862051A CN 107862051 A CN107862051 A CN 107862051A CN 201711091476 A CN201711091476 A CN 201711091476A CN 107862051 A CN107862051 A CN 107862051A
Authority
CN
China
Prior art keywords
text
sequence
feature words
idf
terms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711091476.0A
Other languages
Chinese (zh)
Inventor
毕银龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201711091476.0A priority Critical patent/CN107862051A/en
Publication of CN107862051A publication Critical patent/CN107862051A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of file classifying method to include:The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains sequence of terms;The stop words in the sequence of terms is removed, obtains current term sequence, and Feature Words of the TF IDF weights in the current term sequence more than preset value are added in language material dictionary;Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.As can be seen here, file classifying method provided in an embodiment of the present invention, output integration of the text from pretreatment to classification results is realized.The invention also discloses a kind of document classification system and a kind of document classification equipment and a kind of computer-readable recording medium, above-mentioned technique effect can be equally realized.

Description

A kind of file classifying method, system and a kind of document classification equipment
Technical field
The present invention relates to Text Classification field, more specifically to a kind of file classifying method, system and one kind Document classification equipment and a kind of computer-readable recording medium.
Background technology
Under current big data background, the data message of magnanimity is full of on network, how using these data, to data Carry out secondary operation processing, excavate useful information create value be current big data, Data Mining research heat Point.And it is to carry out one conventional means of data mining to be identified user's data interested by sorting algorithm.
File classification method of the prior art mainly includes:Participle, extraction feature, text vectorization represent, training point Class model, prediction classification.And the improvement of assorting process is concentrated mainly on feature extraction, text vectorization method for expressing, In sorting algorithm.Feature extraction main method mainly includes:TF, TF-IDF, information gain etc.;Vectorization represents mainly have:To Quantity space model (VMS), term vector, agent model etc.;Sorting algorithm mainly has:Naive Bayesian, LOGISTIC, SVM, RANDOMFOREST, neutral net etc..More sorting algorithm is studied on text classification be concentrated mainly on use at present The low-dimensional term vectorization that word2vector produces characteristic item represents text, is then classified using deep learning model.
Therefore, how to realize that output integration of the text from pretreatment to classification results is that those skilled in the art need to solve Certainly the problem of.
The content of the invention
It is an object of the invention to provide a kind of file classifying method, system and a kind of document classification equipment and a kind of calculating Machine readable storage medium storing program for executing, realize output integration of the text from pretreatment to classification results.
To achieve the above object, the embodiments of the invention provide a kind of file classifying method, including:
The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains word sequence Row;
The stop words in the sequence of terms is removed, obtains current term sequence, and by TF- in the current term sequence The Feature Words that IDF weights are more than preset value are added in language material dictionary;
Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;
The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.
Wherein, in addition to:
By unknown text after pretreatment, word segmentation processing and removal stop words processing, what the input training was completed divides In class model, so that the disaggregated model that the training is completed exports the classification of the unknown text.
Wherein, it is described to be pre-processed the text collected, including:
The text collected is removed into non-principal text;
Wherein, the non-principal text includes non-text data and/or interference data item.
Wherein, the Feature Words that TF-IDF weight in current sequence of terms is more than to preset value are added to language material dictionary In, including:
The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words in current text In word frequency, IDF be the textual data comprising the Feature Words inverse;
TF-IDF weight using the TF and IDF product as the Feature Words;
Judge whether the TF-IDF weight is more than the preset value, if so, the Feature Words then are added into language material word In allusion quotation.
To achieve the above object, the embodiments of the invention provide a kind of document classification system, including:
Pretreatment module, segmented for the text collected to be pre-processed, and by the text by pretreatment Processing, obtains sequence of terms;
Add module, for removing the stop words in the sequence of terms, current term sequence is obtained, and will be described current TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in sequence of terms;
Vectorization module, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, obtain Vector matrix;
Training module, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to unknown Text is classified.
Wherein, in addition to:
Input module, for by unknown text by pretreatment, word segmentation processing and remove stop words processing after, described in input Train in the disaggregated model completed, so that the disaggregated model that the training is completed exports the classification of the unknown text.
Wherein, the pretreatment module specifically includes:
First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes Non-text data and/or interference data item;
Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.
Wherein, the add module specifically includes:
Second removal unit, for removing the stop words in the sequence of terms;
Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is the feature Word frequency of the word in current text, IDF are the inverse of the textual data comprising the Feature Words;
Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words;
Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then by the Feature Words Added in language material dictionary.
To achieve the above object, the embodiments of the invention provide a kind of document classification equipment, including:
Memory, for storage file sort program;
Processor, realize such as the step of above-mentioned file classifying method during for performing the document classification program.
To achieve the above object, the embodiments of the invention provide a kind of computer-readable recording medium, the computer can Read to be stored with document classification program in storage medium, realized when the document classification program is executed by processor such as above-mentioned file point Class method.
By above scheme, a kind of file classifying method provided in an embodiment of the present invention includes:The text that will be collected This is pre-processed, and the text by pretreatment is carried out into word segmentation processing, obtains sequence of terms;Remove in the sequence of terms Stop words, obtain current term sequence, and TF-IDF weight in the current term sequence is more than to the Feature Words of preset value Added in language material dictionary;Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains moment of a vector Battle array;The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.
As can be seen here, file classifying method provided in an embodiment of the present invention, screened by TF-IDF weight in sequence of terms Feature Words, vectorization expression is carried out to the Feature Words after screening by VSM models, passes through the training classification of obtained vector matrix Model, to classify to unknown text, realize output integration of the text from pretreatment to classification results.The present invention is also A kind of document classification system and a kind of document classification equipment and a kind of computer-readable recording medium are disclosed, can equally be realized State technique effect.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of file classifying method disclosed in the embodiment of the present invention;
Fig. 2 is the flow chart of another file classifying method disclosed in the embodiment of the present invention;
Fig. 3 is a kind of structure chart of document classification system disclosed in the embodiment of the present invention;
Fig. 4 is a kind of structure chart of document classification equipment disclosed in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
The embodiment of the invention discloses a kind of file classifying method, realizes output of the text from pretreatment to classification results Integration.
Referring to Fig. 1, a kind of flow chart of file classifying method disclosed in the embodiment of the present invention, as shown in figure 1, including:
S101:The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains word Word order arranges;
In specific implementation, the text collected is pre-processed, pretreated text entered using participle instrument Row participle, a sequence of terms is cut into by the newsletter archive of entire chapter.
S102:The stop words in the sequence of terms is removed, obtains current term sequence, and by the current term sequence The Feature Words that middle TF-IDF weight is more than preset value are added in language material dictionary;
This step be intended to by included in the newsletter archive after participle for example " ", " ", a series of stop words such as " " Cleaned, reduce the influence to caused by classifying quality.
S103:Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;
VSM models (Chinese full name:Vector space model, English full name:Vector space model) it is the most frequently used Similarity calculation, had a wide range of applications in natural language processing.Can be by the corresponding dimension of each word, and word For frequency as its corresponding characteristic vector, the word and its frequency of so every article just constitute a n-dimensional space figure, two texts The similarity of shelves is exactly the degree of approach of two spaces figure.
S104:The vector matrix is inputted in disaggregated model and trains the disaggregated model, to be carried out to unknown text Classification.
The present invention does not make specific restriction to the disaggregated model, can include:Bayes, LOGISTIC, neutral net, SVM, RANDOMFOREST etc..
File classifying method provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight, Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to Unknown text is classified, realizes output integration of the text from pretreatment to classification results.
On the basis of above-described embodiment, preferably, in addition to:
By unknown text after pretreatment, word segmentation processing and removal stop words processing, what the input training was completed divides In class model, so that the disaggregated model that the training is completed exports the classification of the unknown text.
When realizing document classification, unknown text to be sorted by pretreatment, word segmentation processing and is removed at stop words After reason, in the disaggregated model that input above-described embodiment training is completed, the disaggregated model can export the classification of the unknown text.
The embodiment of the invention discloses a kind of file classifying method, and relative to a upper embodiment, the present embodiment is to technical side Case has made further instruction and optimization.Specifically:
Referring to Fig. 2, the flow chart of another file classifying method provided in an embodiment of the present invention, as shown in Fig. 2 including:
S211:The text collected is removed into non-principal text;Wherein, the non-principal text includes non-text data And/or interference data item.
In specific implementation, a large amount of non-textual data often be present in original newsletter archive language material or part is disturbed The non-principal text such as data item, such as:" URL " link, picture etc., are unfavorable for the feature extraction in later stage.Therefore, it is necessary to for being somebody's turn to do Class data are handled, and only retain the main part of news.
S212:Text by pretreatment is subjected to word segmentation processing, obtains sequence of terms;
S221:The stop words in the sequence of terms is removed, obtains current term sequence;
S222:The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words current Word frequency in text, IDF are the inverse of the textual data comprising the Feature Words;
S223:TF-IDF weight using the TF and IDF product as the Feature Words;
S224:Judge whether the TF-IDF weight is more than the preset value, if so, the Feature Words then are added into language Expect in dictionary;
S203:Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;
S204:The vector matrix is inputted in disaggregated model and trains the disaggregated model, to be carried out to unknown text Classification.
A kind of document classification system provided in an embodiment of the present invention is introduced below, a kind of file described below point Class system can be with cross-referenced with a kind of above-described file classifying method.
Referring to Fig. 3, a kind of structure chart of document classification system provided in an embodiment of the present invention, as shown in figure 3, including:
Pretreatment module 301, divided for the text collected to be pre-processed, and by the text by pretreatment Word processing, obtains sequence of terms;
Add module 302, for removing the stop words in the sequence of terms, current term sequence is obtained, and by described in TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in current term sequence;
Vectorization module 303, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, Obtain vector matrix;
Training module 304, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to not Know that text is classified.
Document classification system provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight, Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to Unknown text is classified, realizes output integration of the text from pretreatment to classification results.
On the basis of above-described embodiment, preferably, in addition to:
Input module, for by unknown text by pretreatment, word segmentation processing and remove stop words processing after, described in input Train in the disaggregated model completed, so that the disaggregated model that the training is completed exports the classification of the unknown text.
On the basis of above-described embodiment, preferably, the pretreatment module specifically includes:
First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes Non-text data and/or interference data item;
Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.
On the basis of above-described embodiment, preferably, the add module specifically includes:
Second removal unit, for removing the stop words in the sequence of terms;
Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is the feature Word frequency of the word in current text, IDF are the inverse of the textual data comprising the Feature Words;
Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words;
Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then by the Feature Words Added in language material dictionary.
Present invention also provides a kind of document classification equipment, referring to Fig. 4, a kind of document classification provided in an embodiment of the present invention The structure chart of equipment, as shown in figure 4, including:
Memory 401, for storage file sort program;
Processor 402, the step of above-described embodiment provides can be realized during for performing the document classification program.When The right document classification equipment can also include the component such as various network interfaces, power supply.
Document classification equipment provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight, Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to Unknown text is classified, realizes output integration of the text from pretreatment to classification results.
Present invention also provides a kind of computer-readable recording medium, is stored thereon with document classification program, the file Sort program can realize the step of above-described embodiment provides when being executed by processor.The storage medium can include:USB flash disk, Mobile hard disk, read-only storage (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. are various can be with the medium of store program codes.
Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.
The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims (10)

  1. A kind of 1. file classifying method, it is characterised in that including:
    The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains sequence of terms;
    The stop words in the sequence of terms is removed, obtains current term sequence, and by TF-IDF in the current term sequence The Feature Words that weight is more than preset value are added in language material dictionary;
    Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;
    The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.
  2. 2. file classifying method according to claim 1, it is characterised in that also include:
    By unknown text after pretreatment, word segmentation processing and removal stop words processing, the classification mould that the training is completed is inputted In type, so that the disaggregated model that the training is completed exports the classification of the unknown text.
  3. 3. file classifying method according to claim 1, it is characterised in that it is described to be pre-processed the text collected, Including:
    The text collected is removed into non-principal text;
    Wherein, the non-principal text includes non-text data and/or interference data item.
  4. 4. according to any one of the claim 1-3 file classifying methods, it is characterised in that described by current sequence of terms The Feature Words that TF-IDF weight is more than preset value are added in language material dictionary, including:
    The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words in current text Word frequency, IDF are the inverse of the textual data comprising the Feature Words;
    TF-IDF weight using the TF and IDF product as the Feature Words;
    Judge whether the TF-IDF weight is more than the preset value, if so, then the Feature Words are added in language material dictionary.
  5. A kind of 5. document classification system, it is characterised in that including:
    Pretreatment module, word segmentation processing is carried out for the text collected to be pre-processed, and by the text by pretreatment, Obtain sequence of terms;
    Add module, for removing the stop words in the sequence of terms, obtain current term sequence, and by the current term TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in sequence;
    Vectorization module, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, obtain vector Matrix;
    Training module, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to unknown text Classified.
  6. 6. document classification system according to claim 5, it is characterised in that also include:
    Input module, for after pretreatment, word segmentation processing and removal stop words processing, unknown text to be inputted into the training In the disaggregated model of completion, so that the disaggregated model that the training is completed exports the classification of the unknown text.
  7. 7. document classification system according to claim 5, it is characterised in that the pretreatment module specifically includes:
    First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes non-text Notebook data and/or interference data item;
    Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.
  8. 8. according to any one of the claim 5-7 document classification systems, it is characterised in that the add module specifically includes:
    Second removal unit, for removing the stop words in the sequence of terms;
    Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is that the Feature Words exist Word frequency in current text, IDF are the inverse of the textual data comprising the Feature Words;
    Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words;
    Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then adding the Feature Words Into language material dictionary.
  9. A kind of 9. document classification equipment, it is characterised in that including:
    Memory, for storage file sort program;
    Processor, the file classifying method as described in any one of Claims 1-4 is realized during for performing the document classification program The step of.
  10. 10. a kind of computer-readable recording medium, it is characterised in that file point is stored with the computer-readable recording medium Class method, the document classification side as described in any one of Claims 1-4 is realized when the document classification program is executed by processor Method.
CN201711091476.0A 2017-11-08 2017-11-08 A kind of file classifying method, system and a kind of document classification equipment Pending CN107862051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711091476.0A CN107862051A (en) 2017-11-08 2017-11-08 A kind of file classifying method, system and a kind of document classification equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711091476.0A CN107862051A (en) 2017-11-08 2017-11-08 A kind of file classifying method, system and a kind of document classification equipment

Publications (1)

Publication Number Publication Date
CN107862051A true CN107862051A (en) 2018-03-30

Family

ID=61701264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711091476.0A Pending CN107862051A (en) 2017-11-08 2017-11-08 A kind of file classifying method, system and a kind of document classification equipment

Country Status (1)

Country Link
CN (1) CN107862051A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563722A (en) * 2018-04-03 2018-09-21 有米科技股份有限公司 Trade classification method, system, computer equipment and the storage medium of text message
CN109189920A (en) * 2018-08-02 2019-01-11 上海欣方智能系统有限公司 Sweep-black case classification method and system
CN109684121A (en) * 2018-12-20 2019-04-26 鸿秦(北京)科技有限公司 A kind of file access pattern method and system
CN109977327A (en) * 2019-03-20 2019-07-05 新华三信息安全技术有限公司 A kind of Web page classification method and device
CN111177386A (en) * 2019-12-27 2020-05-19 安徽商信政通信息技术股份有限公司 Proposal classification method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095996A (en) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 Method for text classification
CN106250372A (en) * 2016-08-17 2016-12-21 国网上海市电力公司 A kind of Chinese electric power data text mining method for power system
CN106503254A (en) * 2016-11-11 2017-03-15 上海智臻智能网络科技股份有限公司 Language material sorting technique, device and terminal
CN106528642A (en) * 2016-10-13 2017-03-22 广东广业开元科技有限公司 TF-IDF feature extraction based short text classification method
CN107145560A (en) * 2017-05-02 2017-09-08 北京邮电大学 A kind of file classification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095996A (en) * 2016-06-22 2016-11-09 量子云未来(北京)信息科技有限公司 Method for text classification
CN106250372A (en) * 2016-08-17 2016-12-21 国网上海市电力公司 A kind of Chinese electric power data text mining method for power system
CN106528642A (en) * 2016-10-13 2017-03-22 广东广业开元科技有限公司 TF-IDF feature extraction based short text classification method
CN106503254A (en) * 2016-11-11 2017-03-15 上海智臻智能网络科技股份有限公司 Language material sorting technique, device and terminal
CN107145560A (en) * 2017-05-02 2017-09-08 北京邮电大学 A kind of file classification method and device

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563722A (en) * 2018-04-03 2018-09-21 有米科技股份有限公司 Trade classification method, system, computer equipment and the storage medium of text message
CN109189920A (en) * 2018-08-02 2019-01-11 上海欣方智能系统有限公司 Sweep-black case classification method and system
CN109684121A (en) * 2018-12-20 2019-04-26 鸿秦(北京)科技有限公司 A kind of file access pattern method and system
CN109977327A (en) * 2019-03-20 2019-07-05 新华三信息安全技术有限公司 A kind of Web page classification method and device
CN111177386A (en) * 2019-12-27 2020-05-19 安徽商信政通信息技术股份有限公司 Proposal classification method and system
CN111177386B (en) * 2019-12-27 2021-05-14 安徽商信政通信息技术股份有限公司 Proposal classification method and system

Similar Documents

Publication Publication Date Title
CN107862051A (en) A kind of file classifying method, system and a kind of document classification equipment
CN111581355B (en) Threat information topic detection method, device and computer storage medium
CN110516074B (en) Website theme classification method and device based on deep learning
CN111104510B (en) Text classification training sample expansion method based on word embedding
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN102799647A (en) Method and device for webpage reduplication deletion
US20190163737A1 (en) Method and apparatus for constructing binary feature dictionary
Barua et al. Multi-class sports news categorization using machine learning techniques: resource creation and evaluation
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN115098690A (en) Multi-data document classification method and system based on cluster analysis
CN107357895A (en) A kind of processing method of the text representation based on bag of words
CN107861945A (en) Finance data analysis method, application server and computer-readable recording medium
Jayady et al. Theme Identification using Machine Learning Techniques
Qureshi et al. Aspect Level Songs Rating Based Upon Reviews in English.
Agustina et al. The Implementation of TF-IDF and Word2Vec on Booster Vaccine Sentiment Analysis Using Support Vector Machine Algorithm
CN105550292B (en) A kind of Web page classification method based on von Mises-Fisher probabilistic models
Lei et al. Automatically classify chinese judgment documents utilizing machine learning algorithms
Vadivukarassi et al. A comparison of supervised machine learning approaches for categorized tweets
Shah et al. An automatic text summarization on Naive Bayes classifier using latent semantic analysis
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
CN111062219A (en) Latent semantic analysis text processing method and device based on tensor
Wang et al. Discriminant mutual information for text feature selection
CN108733733B (en) Biomedical text classification method, system and storage medium based on machine learning
CN111488452A (en) Webpage tampering detection method, detection system and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180330

RJ01 Rejection of invention patent application after publication