CN110019763B - Text filtering method, system, equipment and computer readable storage medium - Google Patents
Text filtering method, system, equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN110019763B CN110019763B CN201711449882.XA CN201711449882A CN110019763B CN 110019763 B CN110019763 B CN 110019763B CN 201711449882 A CN201711449882 A CN 201711449882A CN 110019763 B CN110019763 B CN 110019763B
- Authority
- CN
- China
- Prior art keywords
- text
- junk
- text data
- target text
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text data filtering method, a system, equipment and a computer readable storage medium, wherein the method comprises the following steps: creating a junk text information base, wherein at least one piece of junk text data is stored in the junk text information base; performing feature extraction on the junk text data to generate junk text feature vectors, and training a junk text prediction model by combining the weight of each feature; extracting features of the target text data to generate target text feature vectors, inputting the target text feature vectors into a junk text prediction model, and calculating the probability that the target text data are the junk text data; and judging whether the target text data is junk text data or not according to the probability. The method and the device can make up the defects of overlarge stickiness and more resource occupation of an administrator caused by manually auditing and managing the published contents of forums, communities or posts and the like in the prior art, intelligently filter the target text data belonging to the junk text data and improve the judgment efficiency.
Description
Technical Field
The present invention relates to the field of text processing, and in particular, to a text filtering method, system, device, and computer-readable storage medium.
Background
At present, a plurality of forums, communities or posts and other websites or channels which can be used for people to issue own opinions or comments exist on the internet, and the websites or channels can provide free speech space for people, and meanwhile, meaningless spam comments or inappropriate speech related to sensitive subjects can also appear, so that the websites or channels are necessary to be properly supervised.
In the current supervision mode, a website administrator is usually matched with preset keywords to manually screen and filter forum content, community article content, post content, comment content and the like, and meaningless junk information or sensitive information is deleted.
This type of supervision relies heavily on manual audit management. The administrator needs to browse forums, communities or posts in real time, and for hot contents, because the number of browsed persons is too large and the information amount is large, the administrator is difficult to filter one by one and is easy to make mistakes, and the stickiness of the administrator is too large, so that more resources are occupied.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of overlarge viscosity and more resource occupation of an administrator caused by manually examining and managing published contents of forums, communities or posts and the like in the prior art, and provide a text filtering method, a text filtering system, text filtering equipment and a computer readable storage medium, wherein the text filtering method, the text filtering system, the text filtering equipment and the computer readable storage medium can automatically filter junk texts.
The invention solves the technical problems through the following technical scheme:
the invention provides a text data filtering method, which is characterized by comprising the following steps:
creating a junk text information base, wherein at least one piece of junk text data is stored in the junk text information base;
extracting features of the junk text data to generate junk text feature vectors, and training a junk text prediction model by combining the weight of each feature;
performing feature extraction on target text data to generate a target text feature vector, and inputting the target text feature vector into the junk text prediction model to calculate the probability that the target text data is the junk text data;
and judging whether the target text data is junk text data or not according to the probability.
Preferably, the junk text data comprises junk text content, and the target text data comprises target text content;
and performing feature extraction on the junk text data, wherein the feature extraction comprises the following steps: converting the junk text content into numerical representation;
performing feature extraction on the target text data, including: and converting the target text content into a digital representation.
Preferably, converting the spam text content into a numerical representation includes:
extracting keywords from the spam text content;
counting the occurrence frequency of each keyword in the junk text content;
listing the occurrence times of each keyword according to the index sequence of the keywords to form a first space vector, wherein the first space vector is used as the junk text feature vector or the value of partial dimensionality in the junk text feature vector when the junk text feature vector is generated;
converting the target text content into a numerical representation, comprising:
extracting keywords from the target text content;
counting the occurrence frequency of each keyword in the target text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a second space vector, wherein the second space vector is used as the target text feature vector or the value of partial dimension in the target text feature vector when the target text feature vector is generated.
Preferably, the junk text data comprises junk text publishing time, and the target text data comprises target text publishing time;
extracting the features of the junk text data, and further comprising: converting the junk text release time into numerical representation;
performing feature extraction on the target text data, further comprising: and converting the target text publishing time into numerical representation.
Preferably, converting the spam text publishing time into a numerical representation comprises:
dividing a plurality of time periods, and respectively setting a numerical value for each time period;
judging a first time period to which the junk text publishing time belongs, determining a numerical value corresponding to the first time period, wherein the numerical value corresponding to the first time period is used as a value of one dimension of the junk text feature vector when the junk text feature vector is generated, and combining the numerical value with the first space vector to form the junk text feature vector;
converting the target text publishing time into a numerical representation, comprising:
and judging a second time period to which the target text publishing time belongs according to the divided time periods, determining a numerical value corresponding to the second time period, wherein the numerical value corresponding to the second time period is used as a value of one dimension of the target text feature vector when the target text feature vector is generated, and combining the numerical value with the second space vector to form the target text feature vector.
Preferably, the weight of each feature is calculated by a Relieff algorithm, and the spam text prediction model is trained based on the Relieff algorithm.
Preferably, the text data filtering method further includes:
manually checking the target text data which are judged to be the junk text data;
and/or storing the target text data which is judged to be the junk text data by the junk text prediction model or the target text data which is confirmed to be the junk text data through manual verification into the junk text information base.
The present invention also provides a text data filtering system, which is characterized in that the text data filtering system comprises: a data unit, a model unit and a judgment unit;
the data unit is used for creating a junk text information base, and at least one piece of junk text data is stored in the junk text information base;
the model unit includes:
the first feature extraction module is used for extracting features of the junk text data;
the first feature vector module is used for generating a junk text feature vector;
the model training module is used for training the junk text prediction model by combining the weight of each feature;
the judging unit includes:
the second feature extraction module is used for extracting features of the target text data;
the second feature vector module is used for generating a target text feature vector;
and the probability calculation module is used for inputting the target text feature vector into the junk text prediction model so as to calculate the probability that the target text data is the junk text data, and judging whether the target text data is the junk text data or not according to the probability.
Preferably, the junk text data comprises junk text content, and the target text data comprises target text content;
the first feature extraction module is used for converting the junk text content into a numerical representation;
the second feature extraction module is used for converting the target text content into a digital representation.
Preferably, converting the spam text content into a numerical representation includes:
extracting keywords from the spam text content;
counting the occurrence frequency of each keyword in the junk text content;
listing the occurrence times of each keyword according to the index sequence of the keywords to form a first space vector, wherein the first space vector is used as the junk text feature vector or the value of partial dimensionality in the junk text feature vector when the junk text feature vector is generated;
converting the target text content into a numerical representation, comprising:
extracting keywords from the target text content;
counting the occurrence frequency of each keyword in the target text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a second space vector, wherein the second space vector is used as the target text feature vector or the value of partial dimension in the target text feature vector when the target text feature vector is generated.
Preferably, the junk text data comprises junk text publishing time, and the target text data comprises target text publishing time;
the first feature extraction module is further used for converting the junk text release time into a numerical representation;
the second feature extraction module is further configured to convert the target text release time into a numerical representation.
Preferably, converting the spam text publishing time into a numerical representation comprises:
dividing a plurality of time periods, and respectively setting a numerical value for each time period;
judging a first time period to which the junk text publishing time belongs, determining a numerical value corresponding to the first time period, wherein the numerical value corresponding to the first time period is used as a value of one dimension of the junk text feature vector when the junk text feature vector is generated, and combining the numerical value with the first space vector to form the junk text feature vector;
converting the target text publishing time into a numerical representation, comprising:
and judging a second time period to which the target text publishing time belongs according to the divided time periods, determining a numerical value corresponding to the second time period, wherein the numerical value corresponding to the second time period is used as a value of one dimension of the target text feature vector when the target text feature vector is generated, and combining the numerical value with the second space vector to form the target text feature vector.
Preferably, in the model training module, the weight of each feature is calculated by a ReliefF algorithm, and the spam text prediction model is trained based on the ReliefF algorithm.
Preferably, the text data filtering system further comprises:
the checking unit is used for manually checking the target text data which are judged to be the junk text data;
and/or the storage unit is used for storing the target text data which is judged to be the junk text data by the junk text prediction model or the target text data which is confirmed to be the junk text data through manual verification into the junk text information base.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, and is characterized in that the processor executes the program to realize the text data filtering method with any combination of the preferable conditions.
The present invention also provides a computer-readable storage medium, on which a computer program is stored, which is characterized in that the program, when being executed by a processor, carries out the steps of the text data filtering method in any combination of the above-mentioned preferred conditions.
On the basis of the common knowledge in the field, the above preferred conditions can be combined randomly to obtain the preferred embodiments of the invention.
The positive progress effects of the invention are as follows: the method and the device can train the junk text prediction model according to the junk text data in the junk text information base, intelligently filter the target text data belonging to the junk text data by using the junk text prediction model, reduce the viscosity to an administrator, reduce the occupied resources and improve the discrimination efficiency.
Drawings
FIG. 1 is a flow chart of a text data filtering method according to an embodiment 1 of the present invention
Fig. 2 is a flowchart of step 102 in the text data filtering method according to the preferred embodiment 1 of the present invention.
Fig. 3 is a flowchart of step 103 of the text data filtering method according to the preferred embodiment 1 of the present invention.
Fig. 4 is a schematic block diagram of a text data filtering system according to a preferred embodiment 2 of the present invention.
Fig. 5 is a schematic hardware structure diagram of an electronic device according to a preferred embodiment 3 of the invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
Fig. 1 shows a flowchart of the text data filtering method of the present embodiment. The text data filtering method is mainly used for judging whether target text data are junk text data or not so as to filter the published junk text data. Generally, the spam text data refers to text whose content is in any form that is meaningless or related to sensitive subject matter and is not suitable for publishing or publishing comments, posts, articles, and the like in public places.
The text filtering method comprises the following steps:
And 102, extracting features of the junk text data to generate junk text feature vectors, and training a junk text prediction model by combining the weight of each feature. The junk text prediction model is used for predicting whether the text data is junk text data.
And step 104, judging whether the target text data is junk text data or not according to the probability.
In this embodiment, the spam text data includes the spam text publishing time and the spam text content, but the present invention is not limited to this, and may also include other related information, such as an account number, an IP, etc. for publishing the spam text. The following table gives one specific format that may be used to store data:
step 102 is further described below by taking the example that the spam text data includes spam text content and spam text publishing time, as shown in fig. 2, step 102 specifically includes the following steps:
The specific process of converting the junk text content into numerical representation comprises the following steps:
extracting keywords from the junk text content, wherein the keywords are words which are not elegant or relate to sensitive subjects or words which frequently appear in the junk text content, and a fixed and unique index sequence is preset and established;
counting the occurrence frequency of each keyword in the junk text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a first space vector.
In specific implementation, the keyword can be extracted from the junk text content through a Word2vec model, and the keyword can also be extracted from the junk text content through other modes according to actual requirements. Word2vec is an efficient tool for representing words as real-valued vectors, and by using the thought of deep learning, the processing of text contents can be simplified into vector operation in a K-dimensional vector space through training, words in sentences are converted into low-dimensional continuous values, words with similar meanings are mapped to similar positions in the vector space, and the similarity in the vector space can be used for representing the similarity in text semantics. The basic idea is to assume that for a text, its word order and syntax, are ignored, and only considered as a set of words, and each word of the text is independent. The Word2vec model has the advantages that a hidden layer of the neural network is removed, and the calculation amount is reduced.
Assume two simple texts as follows:
John likes to watch movies.Mary likes too.
John also likes to watch football games.
based on the words appearing in the two documents, the following dictionary (dictionary) is constructed:
{"John":1,"likes":2,"to":3,"watch":4,"movies":5,"also":6,"football":7,"games":8,"Mary":9,"too":10}
the dictionary above contains 10 words, each with a unique index, and each text can be represented using a 10-dimensional vector. The following were used:
[1,2,1,1,1,0,0,0,1,1]
[1,1,1,1,0,1,1,1,0,0]
the generated vector has no relation with the appearance sequence of the words in the original text, and represents the number of times each word appears in the corresponding text.
In the embodiment, each keyword is taken as a feature, the representation of each keyword on a vector space is obtained, and the junk text content is finally converted into a first space vector.
The specific process of converting the junk text release time into numerical representation comprises the following steps:
dividing a plurality of time periods, and respectively setting a numerical value for each time period;
and judging a first time period to which the junk text publishing time belongs, and determining a numerical value corresponding to the first time period.
The time periods can be freely divided or combined with the concentrated time period division of junk text release in past experience, and the numerical value corresponding to each time period can also be freely set. In this embodiment, the time of day is divided into 4 time segments, wherein,
0: 00-10: 00, wherein the corresponding numerical value is set to be 0 in morning time period;
10: 00-14: 00, setting the corresponding numerical value quantity to be 1 for the noon time period;
14: 00-19: 00, setting the corresponding numerical value amount to be 2 in the afternoon period;
19: 00-24: 00, and setting the corresponding numerical value to be 3 in the evening period.
If the release time of the junk text of one piece of junk text data is 11:00, the time period of the release time of the junk text is 10: 00-14: 00, and the corresponding numerical value is 1.
And 1022, taking the numerical value corresponding to the first time period as a value of one dimension of the spam feature vector, and combining the numerical value with the first space vector to form the spam text feature vector. And generating the feature vector of the junk text.
For example, a piece of spam text data published at 7:00, the corresponding feature vector is:
[0,1,2,3,2,1,0,4, … ], wherein the first number 0 is a first space vector representing the time of release of spam text and the following numbers represent the first space vector into which spam text content is converted by Word2 Vec.
Of course, if the spam text data only includes spam text content but not spam text release time, the first space vector can be directly used as a spam text feature vector; if the junk text data also comprises other related information, the junk text data can be digitalized and then used as a value of a part of dimensions in the junk text feature vector to participate in the calculation of the junk text prediction model.
And step 1023, training the junk text prediction model by combining the weight of each feature. In this embodiment, the weight of each feature is specifically calculated by a ReliefF algorithm, and the weight of each feature is stored in a model of the ReliefF algorithm, so as to train a spam text prediction model based on the ReliefF algorithm. Of course, other algorithms may be used to calculate the weight of each feature and train the corresponding algorithm model.
The relevance of features and classes in the ReliefF algorithm is based on the ability of features to distinguish close-range samples. The algorithm randomly selects a sample R from a training set D, then searches a nearest neighbor sample H from samples in the same class as R, called Near Hit, and searches a nearest neighbor sample M from samples in different classes from R, called Near Miss, and then updates the weight of each feature according to the following rules: if the distance between R and Near Hit on a feature is smaller than the distance between R and Near Miss, the feature is beneficial to distinguishing the nearest neighbors of the same class and different classes, and the weight of the feature is increased; conversely, if the distance between R and Near Hit in a feature is greater than the distance between R and Near Miss, indicating that the feature has a negative effect on distinguishing between similar and dissimilar nearest neighbors, the weight of the feature is reduced. Repeating the above processes m times to obtain the average weight of each feature. The larger the weight of a feature is, the stronger the classification capability of the feature is, and conversely, the weaker the classification capability of the feature is. In the multi-classification text, a sample R is randomly extracted from a sample set in each training, K adjacent samples (near Hits) are found out from the sample set with similar samples, K adjacent samples (near Misses) are found out from different sample sets of each R, and then the weight of each feature is updated.
The running time of the Relieff algorithm increases linearly with the increase of the sampling times m of the samples and the number N of the original features, so that the running efficiency is very high.
In this embodiment, the target text data includes the target text publishing time and the target text content, but the present invention is not limited to this, and may also include other related information, such as an account number, an IP, and the like for publishing the target text.
Step 103 is further explained below by taking the example that the target text data includes the target text content and the target text publishing time, as shown in fig. 3, step 103 specifically includes the following steps:
and step 1031, converting the target text content into a digital representation and converting the target text release time into a digital representation. Thereby realizing the feature extraction of the target text data.
The specific process of converting the target text content into numerical representation comprises the following steps:
extracting keywords from the target text content, wherein the keywords are the same as the keywords set in the step 1021 and have the same index sequence;
counting the occurrence frequency of each keyword in the target text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a second space vector.
In specific implementation, the extraction of the keywords from the target text content can be realized through Word2vec, and the extraction of the keywords from the junk text content can also be realized through other modes according to actual requirements. The specific process of forming the second space vector may refer to the process of forming the first space vector, and is not described herein again.
The specific process of converting the target text release time into numerical representation comprises the following steps:
and judging a second time period to which the target text publishing time belongs according to the divided time periods, and determining a numerical value corresponding to the second time period.
For example, target text data, published at 18:00, then the corresponding feature vector is:
[2,1,3,0,1,2,0,4, … ], wherein the first number 2 is a second space vector representing the time of release of the target text, and the following numbers represent the second space vector into which the target text content is converted by Word2 Vec.
Of course, if the target text data only includes the target text content but not the target text publishing time, the second space vector may be directly used as the target text feature vector; if the target text data also comprises other related information, the target text data can also be digitalized and then used as a value of a part of dimensionality in the target text feature vector, and finally the target text feature vector is formed.
In addition, the step 104 may specifically be configured as: and judging whether the probability is greater than a probability threshold value, if so, judging that the target text data is junk text data, and if not, judging that the target text data is non-junk text data. The probability threshold can be set by self, the higher the probability threshold is set, the stricter the requirement of the junk text data is determined to be, and on the contrary, the smaller the probability threshold is set, the looser the requirement of the junk text data is determined to be.
The target text data determined as the spam text data may be automatically deleted or may be processed by an administrator.
In order to further confirm whether the determination result of step 104 is correct, the text data filtering method may further include, after step 104:
and manually checking the target text data which is judged to be the junk text data. And for the target data which is artificially determined to be misjudged as the junk text data, correcting the judgment result, tracing the misjudgment reason, further correcting the junk text prediction model and improving the judgment accuracy.
In order to collect more spam text data, the spam text information base is expanded, and the text data filtering method may further comprise after step 104:
and storing the target text data which is judged to be the junk text data by using the junk text prediction model or the target text data which is confirmed to be the junk text data through manual checking into the junk text information base.
Example 2
Fig. 4 shows a schematic block diagram of the text data filtering system of the present embodiment. The text data filtering system is mainly used for judging whether the target text data is junk text data or not so as to filter the published junk text data.
The text data filtering system includes: a data unit 201, a model unit 202 and a judgment unit 203.
The data unit 201 is configured to create a spam text information base, where at least one spam text data is stored. The junk text information base is formed by collecting historical junk text data and can be specifically established in a database form. In this embodiment, the spam text data includes the spam text publishing time and the spam text content, but the present invention is not limited to this, and may also include other related information, such as an account number, an IP, etc. for publishing the spam text.
The model unit 202 includes: a first feature extraction module 2021, a first feature vector module 2022, and a model training module 2023.
The first feature extraction module 2021 is configured to perform feature extraction on the spam text data.
The first feature vector module 2022 is configured to generate a spam text feature vector.
The model training module 2023 is used to train the spam text prediction model in conjunction with the weight of each feature. The junk text prediction model is used for predicting whether the text data is junk text data.
The judging unit includes: a second feature extraction module 2031, a second feature vector module 2032, and a probability calculation module 2033.
The second feature extraction module 2031 is configured to perform feature extraction on the target text data. The target text data may be any form of text such as comments, posts, articles published or published in forums, communities, posts or other websites, or other text. In this embodiment, the target text data includes the target text publishing time and the target text content, but the present invention is not limited to this, and may also include other related information, such as an account number, an IP, and the like for publishing the target text.
The second feature vector module 2032 is configured to generate a target text feature vector.
The probability calculation module 2033 is configured to input the target text feature vector into the spam text prediction model to calculate a probability that the target text data is spam text data, and determine whether the target text data is spam text data according to the probability.
The first feature extraction module 2021, the first feature vector module 2022, and the model training module 2023 are further described below:
the first feature extraction module 2021 converts the spam text content into a numerical representation and converts the spam text publication time into a numerical representation. Therefore, the feature extraction of the junk text data is realized.
Wherein, converting the spam text content into a numerical representation comprises:
extracting keywords from the junk text content, wherein the keywords are preset and set with a fixed unique index sequence;
counting the occurrence frequency of each keyword in the junk text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a first space vector.
Converting the junk text publishing time into a numerical representation, comprising:
dividing a plurality of time periods, and respectively setting a numerical value for each time period;
and judging a first time period to which the junk text publishing time belongs, and determining a numerical value corresponding to the first time period.
The first feature vector module 2022 uses the numerical value corresponding to the first time period as a value of one dimension of the spam feature vector, and combines the numerical value with the first space vector to form the spam text feature vector. Therefore, the feature extraction of the junk text data is realized. Of course, if the spam text data only includes spam text content but not spam text release time, the first space vector can be directly used as a spam text feature vector; if the junk text data also comprises other related information, the junk text data can be digitalized and then used as a value of a part of dimensions in the junk text feature vector to participate in the calculation of the junk text prediction model.
The model training module 2023 calculates the weight of each feature through a ReliefF algorithm, and the weight of each feature is stored in the model of the ReliefF algorithm to train a spam text prediction model based on the ReliefF algorithm. Of course, other algorithms may be used to calculate the weight of each feature and train the corresponding algorithm model.
The second feature extraction module 2031, the second feature vector module 2032, and the probability calculation module 2033 are further described below:
the second feature extraction module 2031 converts the target text content into a digital representation and converts the target text release time into a digital representation. Therefore, the feature extraction of the junk target text data is realized. Wherein, converting the target text content into a numerical representation comprises:
extracting keywords from the target text content;
counting the occurrence frequency of each keyword in the target text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a second space vector.
In specific implementation, the extraction of the keywords from the target text content can be realized through Word2vec, and the extraction of the keywords from the junk text content can also be realized through other modes according to actual requirements. The specific process of forming the second space vector may refer to the process of forming the first space vector, and is not described herein again.
Converting the target text publishing time into a numerical representation, comprising:
and judging a second time period to which the target text publishing time belongs according to the divided time periods, and determining a numerical value corresponding to the second time period.
The second feature vector module 2032 uses the numerical value corresponding to the second time period as a value of one dimension of the target text feature vector, and combines the numerical value with the second space vector to form the target text feature vector. Thereby realizing the feature extraction of the target text data. Of course, if the target text data only includes the target text content but not the target text publishing time, the second space vector may be directly used as the target text feature vector; if the target text data also comprises other related information, the target text data can also be digitalized and then used as a value of a part of dimensionality in the target text feature vector, and finally the target text feature vector is formed.
The probability calculation module 2033 is configured to input the target text feature vector into the spam text prediction model and calculate a model output quantity, where the model output quantity represents a probability that the target text data is spam text data, and if the probability is greater than a probability threshold, it is determined that the target text data is spam text data. And if the probability is not greater than the probability threshold, judging that the target text data is non-junk text data. The probability threshold can be set by self, the higher the probability threshold is set, the stricter the requirement of the junk text data is determined to be, and on the contrary, the smaller the probability threshold is set, the looser the requirement of the junk text data is determined to be.
The target text data determined as the spam text data may be automatically deleted or may be processed by an administrator.
In order to further confirm whether the judgment result of the judging unit 203 is correct, the text data filtering system further includes:
and the checking unit 204 is configured to manually check the target text data determined as the spam text data. And for the target data which is artificially determined to be misjudged as the junk text data, correcting the judgment result, tracing the misjudgment reason, further correcting the junk text prediction model and improving the judgment accuracy.
In order to collect more spam text data, expanding a spam text information base, wherein the text data filtering system further comprises:
a storage unit 205, configured to store the target text data determined as spam text data by using the spam text prediction model or the target text data determined as spam text data through manual verification into the spam text information base.
Example 3
Fig. 5 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the text data filtering method of embodiment 1 when executing the program. The electronic device 30 shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 5, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).
The bus 33 includes a data bus, an address bus, and a control bus.
The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.
The processor 31 executes various functional applications and data processing, such as a text data filtering method provided in embodiment 1 of the present invention, by executing the computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 4
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the steps of the text data filtering method provided in embodiment 1.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation manner, the present invention can also be implemented in the form of a program product, which includes program code for causing a terminal device to execute the steps of implementing the text data filtering method described in embodiment 1 when the program product runs on the terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.
Claims (10)
1. A text data filtering method, characterized in that the text data filtering method comprises:
creating a junk text information base, wherein at least one piece of junk text data is stored in the junk text information base;
extracting features of the junk text data to generate junk text feature vectors, and training a junk text prediction model by combining the weight of each feature;
performing feature extraction on target text data to generate a target text feature vector, and inputting the target text feature vector into the junk text prediction model to calculate the probability that the target text data is the junk text data;
judging whether the target text data is junk text data or not according to the probability;
the junk text data comprise junk text publishing time, and the target text data comprise target text publishing time;
extracting the features of the junk text data, and further comprising: converting the junk text release time into numerical representation;
performing feature extraction on the target text data, further comprising: converting the target text release time into numerical representation;
the junk text data comprises junk text content, and the target text data comprises target text content;
and performing feature extraction on the junk text data, wherein the feature extraction comprises the following steps: converting the junk text content into numerical representation;
performing feature extraction on the target text data, including: converting the target text content into a digital representation;
converting the spam text content into a numerical representation, comprising:
extracting keywords from the spam text content, the keywords being words that are not elegant or relate to sensitive subject matter or words that occur frequently in spam text content;
counting the occurrence frequency of each keyword in the junk text content;
listing the occurrence times of each keyword according to the index sequence of the keywords to form a first space vector, wherein the first space vector is used as the junk text feature vector or the value of partial dimensionality in the junk text feature vector when the junk text feature vector is generated;
converting the target text content into a numerical representation, comprising:
extracting keywords from the target text content;
counting the occurrence frequency of each keyword in the target text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a second space vector, wherein the second space vector is used as the target text feature vector or the value of partial dimension in the target text feature vector when the target text feature vector is generated.
2. The method of text data filtering according to claim 1, wherein converting the spam text publication time to a numerical representation comprises:
dividing a plurality of time periods, and respectively setting a numerical value for each time period;
judging a first time period to which the junk text publishing time belongs, determining a numerical value corresponding to the first time period, wherein the numerical value corresponding to the first time period is used as a value of one dimension of a junk text feature vector when the junk text feature vector is generated, and combining the numerical value with the first space vector to form the junk text feature vector;
converting the target text publishing time into a numerical representation, comprising:
and judging a second time period to which the target text publishing time belongs according to the divided time periods, determining a numerical value corresponding to the second time period, wherein the numerical value corresponding to the second time period is used as a value of one dimension of the target text feature vector when the target text feature vector is generated, and combining the numerical value with the second space vector to form the target text feature vector.
3. The text data filtering method according to claim 1, wherein the weight of each feature is calculated by a ReliefF algorithm, and the spam text prediction model is trained based on the ReliefF algorithm.
4. The text data filtering method according to claim 1, further comprising:
manually checking the target text data which are judged to be the junk text data;
and/or storing the target text data which is judged to be the junk text data by the junk text prediction model or the target text data which is confirmed to be the junk text data through manual verification into the junk text information base.
5. A text data filtering system, comprising: a data unit, a model unit and a judgment unit;
the data unit is used for creating a junk text information base, and at least one piece of junk text data is stored in the junk text information base;
the model unit includes:
the first feature extraction module is used for extracting features of the junk text data;
the first feature vector module is used for generating a junk text feature vector;
the model training module is used for training the junk text prediction model by combining the weight of each feature;
the judging unit includes:
the second feature extraction module is used for extracting features of the target text data;
the second feature vector module is used for generating a target text feature vector;
the probability calculation module is used for inputting the target text feature vector into the junk text prediction model so as to calculate the probability that the target text data is the junk text data, and judging whether the target text data is the junk text data or not according to the probability;
the junk text data comprise junk text publishing time, and the target text data comprise target text publishing time;
the first feature extraction module is further used for converting the junk text release time into a numerical representation;
the second feature extraction module is further used for converting the target text release time into a numerical representation;
the junk text data comprises junk text content, and the target text data comprises target text content;
the first feature extraction module is used for converting the junk text content into a numerical representation;
the second feature extraction module is used for converting the target text content into a digital representation;
converting the spam text content into a numerical representation, comprising:
extracting keywords from the spam text content, the keywords being words that are not elegant or relate to sensitive subject matter or words that occur frequently in spam text content;
counting the occurrence frequency of each keyword in the junk text content;
listing the occurrence times of each keyword according to the index sequence of the keywords to form a first space vector, wherein the first space vector is used as the junk text feature vector or the value of partial dimensionality in the junk text feature vector when the junk text feature vector is generated;
converting the target text content into a numerical representation, comprising:
extracting keywords from the target text content;
counting the occurrence frequency of each keyword in the target text content;
and listing the occurrence times of each keyword according to the index sequence of the keywords to form a second space vector, wherein the second space vector is used as the target text feature vector or the value of partial dimension in the target text feature vector when the target text feature vector is generated.
6. The text data filtering system of claim 5, wherein converting the spam text publication time to a numerical representation comprises:
dividing a plurality of time periods, and respectively setting a numerical value for each time period;
judging a first time period to which the junk text publishing time belongs, determining a numerical value corresponding to the first time period, wherein the numerical value corresponding to the first time period is used as a value of one dimension of a junk text feature vector when the junk text feature vector is generated, and combining the numerical value with the first space vector to form the junk text feature vector;
converting the target text publishing time into a numerical representation, comprising:
and judging a second time period to which the target text publishing time belongs according to the divided time periods, determining a numerical value corresponding to the second time period, wherein the numerical value corresponding to the second time period is used as a value of one dimension of the target text feature vector when the target text feature vector is generated, and combining the numerical value with the second space vector to form the target text feature vector.
7. The text data filtering system of claim 5, wherein in the model training module, the weight of each feature is calculated by a Relieff algorithm, and the spam text prediction model is trained based on the Relieff algorithm.
8. The text data filtering system of claim 5, wherein the text data filtering system further comprises:
the checking unit is used for manually checking the target text data which are judged to be the junk text data;
and/or the storage unit is used for storing the target text data which is judged to be the junk text data by the junk text prediction model or the target text data which is confirmed to be the junk text data through manual verification into the junk text information base.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the text data filtering method of any one of claims 1 to 4 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the text data filtering method according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711449882.XA CN110019763B (en) | 2017-12-27 | 2017-12-27 | Text filtering method, system, equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711449882.XA CN110019763B (en) | 2017-12-27 | 2017-12-27 | Text filtering method, system, equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110019763A CN110019763A (en) | 2019-07-16 |
CN110019763B true CN110019763B (en) | 2022-04-12 |
Family
ID=67187050
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711449882.XA Active CN110019763B (en) | 2017-12-27 | 2017-12-27 | Text filtering method, system, equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110019763B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516066B (en) * | 2019-07-23 | 2022-04-15 | 同盾控股有限公司 | Text content safety protection method and device |
CN110442875A (en) * | 2019-08-12 | 2019-11-12 | 北京思维造物信息科技股份有限公司 | A kind of text checking method, apparatus and system |
CN113538002B (en) * | 2020-04-14 | 2024-06-18 | 北京沃东天骏信息技术有限公司 | Method and device for auditing text |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
JP2011048488A (en) * | 2009-08-25 | 2011-03-10 | Nippon Telegr & Teleph Corp <Ntt> | Apparatus, system, method and program for analysis of data flow |
CN103186845A (en) * | 2011-12-29 | 2013-07-03 | 盈世信息科技(北京)有限公司 | Junk mail filtering method |
CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system |
CN104111925A (en) * | 2013-04-16 | 2014-10-22 | 中国移动通信集团公司 | Item recommendation method and device |
CN107256245A (en) * | 2017-06-02 | 2017-10-17 | 河海大学 | Improved and system of selection towards the off-line model that refuse messages are classified |
-
2017
- 2017-12-27 CN CN201711449882.XA patent/CN110019763B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101106539A (en) * | 2007-08-03 | 2008-01-16 | 浙江大学 | Filtering method for spam based on supporting vector machine |
CN101227435A (en) * | 2008-01-28 | 2008-07-23 | 浙江大学 | Method for filtering Chinese junk mail based on Logistic regression |
JP2011048488A (en) * | 2009-08-25 | 2011-03-10 | Nippon Telegr & Teleph Corp <Ntt> | Apparatus, system, method and program for analysis of data flow |
CN103186845A (en) * | 2011-12-29 | 2013-07-03 | 盈世信息科技(北京)有限公司 | Junk mail filtering method |
CN104111925A (en) * | 2013-04-16 | 2014-10-22 | 中国移动通信集团公司 | Item recommendation method and device |
CN103473369A (en) * | 2013-09-27 | 2013-12-25 | 清华大学 | Semantic-based information acquisition method and semantic-based information acquisition system |
CN107256245A (en) * | 2017-06-02 | 2017-10-17 | 河海大学 | Improved and system of selection towards the off-line model that refuse messages are classified |
Also Published As
Publication number | Publication date |
---|---|
CN110019763A (en) | 2019-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11334635B2 (en) | Domain specific natural language understanding of customer intent in self-help | |
Alam et al. | Processing social media images by combining human and machine computing during crises | |
CN106874292B (en) | Topic processing method and device | |
CN111460153B (en) | Hot topic extraction method, device, terminal equipment and storage medium | |
WO2023108980A1 (en) | Information push method and device based on text adversarial sample | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
US11354340B2 (en) | Time-based optimization of answer generation in a question and answer system | |
Shi et al. | Learning-to-rank for real-time high-precision hashtag recommendation for streaming news | |
US20150356203A1 (en) | Determining Temporal Categories for a Domain of Content for Natural Language Processing | |
CN112559747B (en) | Event classification processing method, device, electronic equipment and storage medium | |
Hossny et al. | Feature selection methods for event detection in Twitter: a text mining approach | |
CN110321561B (en) | Keyword extraction method and device | |
CN113032552B (en) | Text abstract-based policy key point extraction method and system | |
CN112989208B (en) | Information recommendation method and device, electronic equipment and storage medium | |
CN111061837A (en) | Topic identification method, device, equipment and medium | |
CN111723256A (en) | Government affair user portrait construction method and system based on information resource library | |
Sun et al. | Efficient event detection in social media data streams | |
CN110019763B (en) | Text filtering method, system, equipment and computer readable storage medium | |
CN113626704A (en) | Method, device and equipment for recommending information based on word2vec model | |
CN105512300B (en) | information filtering method and system | |
CN114328800A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
KR20220074576A (en) | A method and an apparatus for extracting new words based on deep learning to generate marketing knowledge graphs | |
CN113569118B (en) | Self-media pushing method, device, computer equipment and storage medium | |
CN114255067A (en) | Data pricing method and device, electronic equipment and storage medium | |
CN111767404A (en) | Event mining method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |