CN107862051A - A kind of file classifying method, system and a kind of document classification equipment - Google Patents
A kind of file classifying method, system and a kind of document classification equipment Download PDFInfo
- Publication number
- CN107862051A CN107862051A CN201711091476.0A CN201711091476A CN107862051A CN 107862051 A CN107862051 A CN 107862051A CN 201711091476 A CN201711091476 A CN 201711091476A CN 107862051 A CN107862051 A CN 107862051A
- Authority
- CN
- China
- Prior art keywords
- text
- sequence
- feature words
- idf
- terms
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of file classifying method to include:The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains sequence of terms;The stop words in the sequence of terms is removed, obtains current term sequence, and Feature Words of the TF IDF weights in the current term sequence more than preset value are added in language material dictionary;Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.As can be seen here, file classifying method provided in an embodiment of the present invention, output integration of the text from pretreatment to classification results is realized.The invention also discloses a kind of document classification system and a kind of document classification equipment and a kind of computer-readable recording medium, above-mentioned technique effect can be equally realized.
Description
Technical field
The present invention relates to Text Classification field, more specifically to a kind of file classifying method, system and one kind
Document classification equipment and a kind of computer-readable recording medium.
Background technology
Under current big data background, the data message of magnanimity is full of on network, how using these data, to data
Carry out secondary operation processing, excavate useful information create value be current big data, Data Mining research heat
Point.And it is to carry out one conventional means of data mining to be identified user's data interested by sorting algorithm.
File classification method of the prior art mainly includes:Participle, extraction feature, text vectorization represent, training point
Class model, prediction classification.And the improvement of assorting process is concentrated mainly on feature extraction, text vectorization method for expressing,
In sorting algorithm.Feature extraction main method mainly includes:TF, TF-IDF, information gain etc.;Vectorization represents mainly have:To
Quantity space model (VMS), term vector, agent model etc.;Sorting algorithm mainly has:Naive Bayesian, LOGISTIC, SVM,
RANDOMFOREST, neutral net etc..More sorting algorithm is studied on text classification be concentrated mainly on use at present
The low-dimensional term vectorization that word2vector produces characteristic item represents text, is then classified using deep learning model.
Therefore, how to realize that output integration of the text from pretreatment to classification results is that those skilled in the art need to solve
Certainly the problem of.
The content of the invention
It is an object of the invention to provide a kind of file classifying method, system and a kind of document classification equipment and a kind of calculating
Machine readable storage medium storing program for executing, realize output integration of the text from pretreatment to classification results.
To achieve the above object, the embodiments of the invention provide a kind of file classifying method, including:
The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains word sequence
Row;
The stop words in the sequence of terms is removed, obtains current term sequence, and by TF- in the current term sequence
The Feature Words that IDF weights are more than preset value are added in language material dictionary;
Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;
The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.
Wherein, in addition to:
By unknown text after pretreatment, word segmentation processing and removal stop words processing, what the input training was completed divides
In class model, so that the disaggregated model that the training is completed exports the classification of the unknown text.
Wherein, it is described to be pre-processed the text collected, including:
The text collected is removed into non-principal text;
Wherein, the non-principal text includes non-text data and/or interference data item.
Wherein, the Feature Words that TF-IDF weight in current sequence of terms is more than to preset value are added to language material dictionary
In, including:
The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words in current text
In word frequency, IDF be the textual data comprising the Feature Words inverse;
TF-IDF weight using the TF and IDF product as the Feature Words;
Judge whether the TF-IDF weight is more than the preset value, if so, the Feature Words then are added into language material word
In allusion quotation.
To achieve the above object, the embodiments of the invention provide a kind of document classification system, including:
Pretreatment module, segmented for the text collected to be pre-processed, and by the text by pretreatment
Processing, obtains sequence of terms;
Add module, for removing the stop words in the sequence of terms, current term sequence is obtained, and will be described current
TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in sequence of terms;
Vectorization module, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, obtain
Vector matrix;
Training module, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to unknown
Text is classified.
Wherein, in addition to:
Input module, for by unknown text by pretreatment, word segmentation processing and remove stop words processing after, described in input
Train in the disaggregated model completed, so that the disaggregated model that the training is completed exports the classification of the unknown text.
Wherein, the pretreatment module specifically includes:
First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes
Non-text data and/or interference data item;
Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.
Wherein, the add module specifically includes:
Second removal unit, for removing the stop words in the sequence of terms;
Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is the feature
Word frequency of the word in current text, IDF are the inverse of the textual data comprising the Feature Words;
Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words;
Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then by the Feature Words
Added in language material dictionary.
To achieve the above object, the embodiments of the invention provide a kind of document classification equipment, including:
Memory, for storage file sort program;
Processor, realize such as the step of above-mentioned file classifying method during for performing the document classification program.
To achieve the above object, the embodiments of the invention provide a kind of computer-readable recording medium, the computer can
Read to be stored with document classification program in storage medium, realized when the document classification program is executed by processor such as above-mentioned file point
Class method.
By above scheme, a kind of file classifying method provided in an embodiment of the present invention includes:The text that will be collected
This is pre-processed, and the text by pretreatment is carried out into word segmentation processing, obtains sequence of terms;Remove in the sequence of terms
Stop words, obtain current term sequence, and TF-IDF weight in the current term sequence is more than to the Feature Words of preset value
Added in language material dictionary;Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains moment of a vector
Battle array;The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.
As can be seen here, file classifying method provided in an embodiment of the present invention, screened by TF-IDF weight in sequence of terms
Feature Words, vectorization expression is carried out to the Feature Words after screening by VSM models, passes through the training classification of obtained vector matrix
Model, to classify to unknown text, realize output integration of the text from pretreatment to classification results.The present invention is also
A kind of document classification system and a kind of document classification equipment and a kind of computer-readable recording medium are disclosed, can equally be realized
State technique effect.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of file classifying method disclosed in the embodiment of the present invention;
Fig. 2 is the flow chart of another file classifying method disclosed in the embodiment of the present invention;
Fig. 3 is a kind of structure chart of document classification system disclosed in the embodiment of the present invention;
Fig. 4 is a kind of structure chart of document classification equipment disclosed in the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
The embodiment of the invention discloses a kind of file classifying method, realizes output of the text from pretreatment to classification results
Integration.
Referring to Fig. 1, a kind of flow chart of file classifying method disclosed in the embodiment of the present invention, as shown in figure 1, including:
S101:The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains word
Word order arranges;
In specific implementation, the text collected is pre-processed, pretreated text entered using participle instrument
Row participle, a sequence of terms is cut into by the newsletter archive of entire chapter.
S102:The stop words in the sequence of terms is removed, obtains current term sequence, and by the current term sequence
The Feature Words that middle TF-IDF weight is more than preset value are added in language material dictionary;
This step be intended to by included in the newsletter archive after participle for example " ", " ", a series of stop words such as " "
Cleaned, reduce the influence to caused by classifying quality.
S103:Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;
VSM models (Chinese full name:Vector space model, English full name:Vector space model) it is the most frequently used
Similarity calculation, had a wide range of applications in natural language processing.Can be by the corresponding dimension of each word, and word
For frequency as its corresponding characteristic vector, the word and its frequency of so every article just constitute a n-dimensional space figure, two texts
The similarity of shelves is exactly the degree of approach of two spaces figure.
S104:The vector matrix is inputted in disaggregated model and trains the disaggregated model, to be carried out to unknown text
Classification.
The present invention does not make specific restriction to the disaggregated model, can include:Bayes, LOGISTIC, neutral net,
SVM, RANDOMFOREST etc..
File classifying method provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight,
Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to
Unknown text is classified, realizes output integration of the text from pretreatment to classification results.
On the basis of above-described embodiment, preferably, in addition to:
By unknown text after pretreatment, word segmentation processing and removal stop words processing, what the input training was completed divides
In class model, so that the disaggregated model that the training is completed exports the classification of the unknown text.
When realizing document classification, unknown text to be sorted by pretreatment, word segmentation processing and is removed at stop words
After reason, in the disaggregated model that input above-described embodiment training is completed, the disaggregated model can export the classification of the unknown text.
The embodiment of the invention discloses a kind of file classifying method, and relative to a upper embodiment, the present embodiment is to technical side
Case has made further instruction and optimization.Specifically:
Referring to Fig. 2, the flow chart of another file classifying method provided in an embodiment of the present invention, as shown in Fig. 2 including:
S211:The text collected is removed into non-principal text;Wherein, the non-principal text includes non-text data
And/or interference data item.
In specific implementation, a large amount of non-textual data often be present in original newsletter archive language material or part is disturbed
The non-principal text such as data item, such as:" URL " link, picture etc., are unfavorable for the feature extraction in later stage.Therefore, it is necessary to for being somebody's turn to do
Class data are handled, and only retain the main part of news.
S212:Text by pretreatment is subjected to word segmentation processing, obtains sequence of terms;
S221:The stop words in the sequence of terms is removed, obtains current term sequence;
S222:The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words current
Word frequency in text, IDF are the inverse of the textual data comprising the Feature Words;
S223:TF-IDF weight using the TF and IDF product as the Feature Words;
S224:Judge whether the TF-IDF weight is more than the preset value, if so, the Feature Words then are added into language
Expect in dictionary;
S203:Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;
S204:The vector matrix is inputted in disaggregated model and trains the disaggregated model, to be carried out to unknown text
Classification.
A kind of document classification system provided in an embodiment of the present invention is introduced below, a kind of file described below point
Class system can be with cross-referenced with a kind of above-described file classifying method.
Referring to Fig. 3, a kind of structure chart of document classification system provided in an embodiment of the present invention, as shown in figure 3, including:
Pretreatment module 301, divided for the text collected to be pre-processed, and by the text by pretreatment
Word processing, obtains sequence of terms;
Add module 302, for removing the stop words in the sequence of terms, current term sequence is obtained, and by described in
TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in current term sequence;
Vectorization module 303, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models,
Obtain vector matrix;
Training module 304, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to not
Know that text is classified.
Document classification system provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight,
Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to
Unknown text is classified, realizes output integration of the text from pretreatment to classification results.
On the basis of above-described embodiment, preferably, in addition to:
Input module, for by unknown text by pretreatment, word segmentation processing and remove stop words processing after, described in input
Train in the disaggregated model completed, so that the disaggregated model that the training is completed exports the classification of the unknown text.
On the basis of above-described embodiment, preferably, the pretreatment module specifically includes:
First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes
Non-text data and/or interference data item;
Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.
On the basis of above-described embodiment, preferably, the add module specifically includes:
Second removal unit, for removing the stop words in the sequence of terms;
Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is the feature
Word frequency of the word in current text, IDF are the inverse of the textual data comprising the Feature Words;
Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words;
Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then by the Feature Words
Added in language material dictionary.
Present invention also provides a kind of document classification equipment, referring to Fig. 4, a kind of document classification provided in an embodiment of the present invention
The structure chart of equipment, as shown in figure 4, including:
Memory 401, for storage file sort program;
Processor 402, the step of above-described embodiment provides can be realized during for performing the document classification program.When
The right document classification equipment can also include the component such as various network interfaces, power supply.
Document classification equipment provided in an embodiment of the present invention, the Feature Words in sequence of terms are screened by TF-IDF weight,
Vectorization expression is carried out to the Feature Words after screening by VSM models, by obtained vector matrix train classification models, so as to
Unknown text is classified, realizes output integration of the text from pretreatment to classification results.
Present invention also provides a kind of computer-readable recording medium, is stored thereon with document classification program, the file
Sort program can realize the step of above-described embodiment provides when being executed by processor.The storage medium can include:USB flash disk,
Mobile hard disk, read-only storage (Read-Only Memory, ROM), random access memory (Random Access
Memory, RAM), magnetic disc or CD etc. are various can be with the medium of store program codes.
Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other
The difference of embodiment, between each embodiment identical similar portion mutually referring to.
The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention.
A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention
The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one
The most wide scope caused.
Claims (10)
- A kind of 1. file classifying method, it is characterised in that including:The text collected is pre-processed, and the text by pretreatment is subjected to word segmentation processing, obtains sequence of terms;The stop words in the sequence of terms is removed, obtains current term sequence, and by TF-IDF in the current term sequence The Feature Words that weight is more than preset value are added in language material dictionary;Vectorization expression is carried out to the Feature Words in the language material dictionary using VSM models, obtains vector matrix;The vector matrix is inputted in disaggregated model and trains the disaggregated model, to classify to unknown text.
- 2. file classifying method according to claim 1, it is characterised in that also include:By unknown text after pretreatment, word segmentation processing and removal stop words processing, the classification mould that the training is completed is inputted In type, so that the disaggregated model that the training is completed exports the classification of the unknown text.
- 3. file classifying method according to claim 1, it is characterised in that it is described to be pre-processed the text collected, Including:The text collected is removed into non-principal text;Wherein, the non-principal text includes non-text data and/or interference data item.
- 4. according to any one of the claim 1-3 file classifying methods, it is characterised in that described by current sequence of terms The Feature Words that TF-IDF weight is more than preset value are added in language material dictionary, including:The TF and IDF of each Feature Words in current sequence of terms are calculated, the TF is the Feature Words in current text Word frequency, IDF are the inverse of the textual data comprising the Feature Words;TF-IDF weight using the TF and IDF product as the Feature Words;Judge whether the TF-IDF weight is more than the preset value, if so, then the Feature Words are added in language material dictionary.
- A kind of 5. document classification system, it is characterised in that including:Pretreatment module, word segmentation processing is carried out for the text collected to be pre-processed, and by the text by pretreatment, Obtain sequence of terms;Add module, for removing the stop words in the sequence of terms, obtain current term sequence, and by the current term TF-IDF weight is added in language material dictionary more than the Feature Words of preset value in sequence;Vectorization module, for carrying out vectorization expression to the Feature Words in the language material dictionary using VSM models, obtain vector Matrix;Training module, the disaggregated model is trained for the vector matrix to be inputted in disaggregated model, so as to unknown text Classified.
- 6. document classification system according to claim 5, it is characterised in that also include:Input module, for after pretreatment, word segmentation processing and removal stop words processing, unknown text to be inputted into the training In the disaggregated model of completion, so that the disaggregated model that the training is completed exports the classification of the unknown text.
- 7. document classification system according to claim 5, it is characterised in that the pretreatment module specifically includes:First removal unit, for the text collected to be removed into non-principal text, wherein, the non-principal text includes non-text Notebook data and/or interference data item;Participle unit, for the text by pretreatment to be carried out into word segmentation processing, obtain sequence of terms.
- 8. according to any one of the claim 5-7 document classification systems, it is characterised in that the add module specifically includes:Second removal unit, for removing the stop words in the sequence of terms;Computing unit, for calculating the TF and IDF of each Feature Words in current sequence of terms, the TF is that the Feature Words exist Word frequency in current text, IDF are the inverse of the textual data comprising the Feature Words;Determining unit, for the TF-IDF weight using the TF and the IDF product as the Feature Words;Judging unit, for judging whether the TF-IDF weight is more than the preset value, if so, then adding the Feature Words Into language material dictionary.
- A kind of 9. document classification equipment, it is characterised in that including:Memory, for storage file sort program;Processor, the file classifying method as described in any one of Claims 1-4 is realized during for performing the document classification program The step of.
- 10. a kind of computer-readable recording medium, it is characterised in that file point is stored with the computer-readable recording medium Class method, the document classification side as described in any one of Claims 1-4 is realized when the document classification program is executed by processor Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711091476.0A CN107862051A (en) | 2017-11-08 | 2017-11-08 | A kind of file classifying method, system and a kind of document classification equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711091476.0A CN107862051A (en) | 2017-11-08 | 2017-11-08 | A kind of file classifying method, system and a kind of document classification equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107862051A true CN107862051A (en) | 2018-03-30 |
Family
ID=61701264
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711091476.0A Pending CN107862051A (en) | 2017-11-08 | 2017-11-08 | A kind of file classifying method, system and a kind of document classification equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107862051A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108563722A (en) * | 2018-04-03 | 2018-09-21 | 有米科技股份有限公司 | Trade classification method, system, computer equipment and the storage medium of text message |
CN109189920A (en) * | 2018-08-02 | 2019-01-11 | 上海欣方智能系统有限公司 | Sweep-black case classification method and system |
CN109684121A (en) * | 2018-12-20 | 2019-04-26 | 鸿秦(北京)科技有限公司 | A kind of file access pattern method and system |
CN109977327A (en) * | 2019-03-20 | 2019-07-05 | 新华三信息安全技术有限公司 | A kind of Web page classification method and device |
CN111177386A (en) * | 2019-12-27 | 2020-05-19 | 安徽商信政通信息技术股份有限公司 | Proposal classification method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095996A (en) * | 2016-06-22 | 2016-11-09 | 量子云未来(北京)信息科技有限公司 | Method for text classification |
CN106250372A (en) * | 2016-08-17 | 2016-12-21 | 国网上海市电力公司 | A kind of Chinese electric power data text mining method for power system |
CN106503254A (en) * | 2016-11-11 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | Language material sorting technique, device and terminal |
CN106528642A (en) * | 2016-10-13 | 2017-03-22 | 广东广业开元科技有限公司 | TF-IDF feature extraction based short text classification method |
CN107145560A (en) * | 2017-05-02 | 2017-09-08 | 北京邮电大学 | A kind of file classification method and device |
-
2017
- 2017-11-08 CN CN201711091476.0A patent/CN107862051A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095996A (en) * | 2016-06-22 | 2016-11-09 | 量子云未来(北京)信息科技有限公司 | Method for text classification |
CN106250372A (en) * | 2016-08-17 | 2016-12-21 | 国网上海市电力公司 | A kind of Chinese electric power data text mining method for power system |
CN106528642A (en) * | 2016-10-13 | 2017-03-22 | 广东广业开元科技有限公司 | TF-IDF feature extraction based short text classification method |
CN106503254A (en) * | 2016-11-11 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | Language material sorting technique, device and terminal |
CN107145560A (en) * | 2017-05-02 | 2017-09-08 | 北京邮电大学 | A kind of file classification method and device |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108563722A (en) * | 2018-04-03 | 2018-09-21 | 有米科技股份有限公司 | Trade classification method, system, computer equipment and the storage medium of text message |
CN109189920A (en) * | 2018-08-02 | 2019-01-11 | 上海欣方智能系统有限公司 | Sweep-black case classification method and system |
CN109684121A (en) * | 2018-12-20 | 2019-04-26 | 鸿秦(北京)科技有限公司 | A kind of file access pattern method and system |
CN109977327A (en) * | 2019-03-20 | 2019-07-05 | 新华三信息安全技术有限公司 | A kind of Web page classification method and device |
CN111177386A (en) * | 2019-12-27 | 2020-05-19 | 安徽商信政通信息技术股份有限公司 | Proposal classification method and system |
CN111177386B (en) * | 2019-12-27 | 2021-05-14 | 安徽商信政通信息技术股份有限公司 | Proposal classification method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107862051A (en) | A kind of file classifying method, system and a kind of document classification equipment | |
CN111581355B (en) | Threat information topic detection method, device and computer storage medium | |
CN110516074B (en) | Website theme classification method and device based on deep learning | |
CN111104510B (en) | Text classification training sample expansion method based on word embedding | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN102799647A (en) | Method and device for webpage reduplication deletion | |
US20190163737A1 (en) | Method and apparatus for constructing binary feature dictionary | |
Barua et al. | Multi-class sports news categorization using machine learning techniques: resource creation and evaluation | |
CN110888983B (en) | Positive and negative emotion analysis method, terminal equipment and storage medium | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
CN115098690A (en) | Multi-data document classification method and system based on cluster analysis | |
CN107357895A (en) | A kind of processing method of the text representation based on bag of words | |
CN107861945A (en) | Finance data analysis method, application server and computer-readable recording medium | |
Jayady et al. | Theme Identification using Machine Learning Techniques | |
Qureshi et al. | Aspect Level Songs Rating Based Upon Reviews in English. | |
Agustina et al. | The Implementation of TF-IDF and Word2Vec on Booster Vaccine Sentiment Analysis Using Support Vector Machine Algorithm | |
CN105550292B (en) | A kind of Web page classification method based on von Mises-Fisher probabilistic models | |
Lei et al. | Automatically classify chinese judgment documents utilizing machine learning algorithms | |
Vadivukarassi et al. | A comparison of supervised machine learning approaches for categorized tweets | |
Shah et al. | An automatic text summarization on Naive Bayes classifier using latent semantic analysis | |
CN112487263A (en) | Information processing method, system, equipment and computer readable storage medium | |
CN111062219A (en) | Latent semantic analysis text processing method and device based on tensor | |
Wang et al. | Discriminant mutual information for text feature selection | |
CN108733733B (en) | Biomedical text classification method, system and storage medium based on machine learning | |
CN111488452A (en) | Webpage tampering detection method, detection system and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180330 |
|
RJ01 | Rejection of invention patent application after publication |