[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN103313248B - Method and device for identifying junk information - Google Patents

Method and device for identifying junk information Download PDF

Info

Publication number
CN103313248B
CN103313248B CN201310156662.3A CN201310156662A CN103313248B CN 103313248 B CN103313248 B CN 103313248B CN 201310156662 A CN201310156662 A CN 201310156662A CN 103313248 B CN103313248 B CN 103313248B
Authority
CN
China
Prior art keywords
information
identified
probability
behavior
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310156662.3A
Other languages
Chinese (zh)
Other versions
CN103313248A (en
Inventor
陈冬梁
张友明
谢龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaomi Inc
Original Assignee
Xiaomi Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaomi Inc filed Critical Xiaomi Inc
Priority to CN201310156662.3A priority Critical patent/CN103313248B/en
Publication of CN103313248A publication Critical patent/CN103313248A/en
Application granted granted Critical
Publication of CN103313248B publication Critical patent/CN103313248B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention provides a method and a device for identifying junk information for solving the problem of lower identification accuracy, wherein the method for identifying junk information comprises the following steps that information to be identified is converted according to a preset comparison table; keywords are extracted from the converted information to be identified, wherein variable character strings in the converted information to be identified are identified, and the variable character strings in the same type are determined to be one keyword; the junk information probability and the non-junk-information probability corresponding to the keyword are found from a first keyword base generated in advance; and whether the information to be identified is the junk information or not is determined according to the junk information probability and the non-junk-information probability corresponding to the keyword. According to the embodiment of the invention, the information to be identified is converted, the keyword in the information can be more accurately identified, the missing judgment or the mistake judgment of the junk information is reduced; and secondly, the keyword corresponding to the variable character strings is combined for identification, and the identification accuracy can be further improved.

Description

Method and device for identifying junk information
Technical Field
The embodiment of the invention relates to the technical field of information identification, in particular to a method and a device for identifying junk information.
Background
Since the internet was created, spam has been developed along with the development of networks, and forms of spam are increasing from the initial spam to the spam leaving, sharing, and the like in current social networks. The network is full of various kinds of junk information, such as advertisements, illegal activity propaganda, pornography, abuse and the like, and the junk information occupies network resources and greatly influences the experience of users.
In order to reduce the influence of spam on users, anti-spam filtering techniques are also being developed. At present, most websites have a spam filtering function, such as filtering by adopting methods of delayed release, manual review, keyword filtering, intelligent identification by using certain algorithms, and the like. Moreover, with the increasing amount of network information, a simple manual identification method cannot meet the requirement, and therefore spam identification tends to be automated and intelligent more and more.
At present, the method for automatically identifying spam mainly comprises the following steps: extracting keywords of the information to be identified according to a preset keyword library, and calculating the probability that the information to be identified is junk information according to the keywords, thereby judging whether the information to be identified is the junk information.
However, in order to avoid the review, some words in the information are usually subjected to format conversion when the information is released, so that the meaning of the converted words is changed and cannot be stored in the keyword library as keywords, and thus the converted words cannot be recognized as keywords. Therefore, the adoption of the method for automatically identifying the junk information can easily cause the missing judgment or the erroneous judgment of the junk information, and the identification accuracy is low.
Disclosure of Invention
The embodiment of the invention provides a method and a device for identifying junk information, which can reduce the missing judgment or the erroneous judgment of the junk information and improve the accuracy of information identification.
In order to solve the above problem, an embodiment of the present invention discloses a method for identifying spam, which is characterized by comprising:
converting information to be identified according to a preset comparison table;
extracting keywords from the converted information to be identified, wherein variable character strings in the converted information to be identified are identified, and the same type of variable character strings are determined as the same keyword;
searching a junk information probability and a non-junk information probability corresponding to the keywords from a pre-generated first keyword library;
and determining whether the information to be identified is junk information or not according to the junk information probability and the non-junk information probability corresponding to the keyword.
Optionally, recognizing the variable character string in the converted information to be recognized includes:
and matching the regular expression with the converted information to be identified to identify the variable character string in the converted information to be identified.
Optionally, extracting a keyword from the converted information to be recognized further includes:
judging whether at least one word identical to a keyword in a pre-generated second keyword library exists in the converted information to be identified, wherein the second keyword library comprises at least one keyword;
and if so, determining at least one word which is the same as the keyword in the pre-generated second keyword library as the converted keyword of the information to be identified.
Optionally, the look-up table comprises: complex and simplified characters comparison table, special and common characters comparison table,
converting the information to be identified according to a preset comparison table, comprising:
acquiring traditional Chinese characters and special characters in information to be identified;
searching a complex font character in the information to be identified from a complex font character and simple font character comparison table, and converting the complex font character into a simple font character;
and searching the special characters in the information to be identified from the special character and common character comparison table, and converting the special characters into the common characters.
Optionally, the first keyword library is generated by:
collecting garbage information samples and non-garbage information samples;
respectively converting a junk information sample and a non-junk information sample according to a preset comparison table;
extracting key words from the converted junk information samples and non-junk information samples;
calculating the probability of the keywords appearing in the converted spam samples and the probability of the keywords appearing in the converted non-spam samples;
taking the probability of the keyword appearing in the converted spam sample as the spam probability corresponding to the keyword, and taking the probability of the keyword appearing in the converted non-spam sample as the non-spam probability corresponding to the keyword;
and storing the keywords, and the spam probability and the non-spam probability corresponding to the keywords to generate a first keyword library.
Optionally, determining whether the information to be identified is spam according to the spam probability and the non-spam probability corresponding to the keyword includes:
calculating the probability that the information to be identified is junk information according to the junk information probability and the non-junk information probability corresponding to the keyword;
comparing the probability that the information to be identified is junk information with a preset junk information threshold value;
acquiring a behavior record of a user issuing information to be identified, wherein the behavior record comprises: violation or suspicious behavior;
and determining whether the information to be identified is junk information according to the comparison result and the behavior record.
Optionally, determining whether the information to be identified is spam according to the comparison result and the behavior record, including:
when the probability that the information to be identified is the junk information is not smaller than a junk information threshold value, judging whether violation behaviors or suspicious behaviors exist in the behavior records;
if the probability that the information to be identified is the junk information is not smaller than the junk information threshold value and the violation behavior or the suspicious behavior exists in the behavior record, determining that the information to be identified is the junk information, and recording the behavior of issuing the information to be identified as the violation behavior in the behavior record;
if the probability that the information to be identified is the junk information is not smaller than the junk information threshold value, and no violation behavior or suspicious behavior exists in the behavior record, the probability that the information to be identified is the junk information is reduced, the information to be identified is determined to be the non-junk information, and the behavior of issuing the information to be identified at this time is recorded as the suspicious behavior in the behavior record;
when the probability that the information to be identified is the junk information is smaller than a junk information threshold value, judging whether violation behaviors or suspicious behaviors exist in the behavior records;
if the probability that the information to be identified is the junk information is smaller than the junk information threshold value and the violation behavior or the suspicious behavior exists in the behavior record, the probability that the information to be identified is the junk information is increased, the information to be identified is determined to be the junk information, and the behavior of issuing the information to be identified at this time is recorded as the violation behavior in the behavior record;
and if the probability that the information to be identified is the junk information is smaller than the junk information threshold value and no violation behaviors or suspicious behaviors exist in the behavior record, determining that the information to be identified is the non-junk information, and recording the behavior of issuing the information to be identified as the normal behavior in the behavior record.
On the other hand, the invention also discloses a device for identifying the junk information, which is characterized by comprising the following steps:
the information conversion module is used for converting the information to be identified according to a preset comparison table;
the information extraction module is used for extracting keywords from the converted information to be identified; identifying variable character strings in the converted information to be identified, and determining the variable character strings of the same type as the same keyword;
the probability searching module is used for searching the junk information probability and the non-junk information probability corresponding to the keywords from a pre-generated first keyword library;
and the information judgment module is used for judging whether the information to be identified is junk information according to the junk information probability and the non-junk information probability corresponding to the keyword.
Optionally, the information extraction module includes: the recognition submodule is used for recognizing the variable character string in the converted information to be recognized; the first determining submodule is used for determining the variable character strings of the same type as the same keyword;
the identification submodule is used for matching the converted information to be identified by the regular expression, and identifying the variable character string in the converted information to be identified.
Optionally, the information extraction module includes:
the keyword judgment sub-module is used for judging whether at least one word identical to a keyword in a pre-generated second keyword library exists in the converted information to be identified, and the second keyword library comprises at least one keyword;
and the second determining submodule is used for determining at least one word which is the same as the keyword in the pre-generated second keyword library as the converted keyword of the information to be identified when the judgment result of the judging submodule is that the converted keyword exists.
Optionally, the look-up table comprises: complex and simplified characters comparison table, special and common characters comparison table,
the information conversion module includes:
the acquisition submodule is used for acquiring traditional Chinese characters and special characters in the information to be identified;
the conversion submodule is used for searching the traditional Chinese character in the information to be identified from the traditional Chinese character and simple Chinese character comparison table and converting the traditional Chinese character into a simple Chinese character; and searching the special characters in the information to be identified from the special character and common character comparison table, and converting the special characters into the common characters.
Optionally, the apparatus further comprises:
the sample collection module is used for collecting garbage information samples and non-garbage information samples;
the sample conversion module is used for converting the junk information samples and the non-junk information samples according to a preset comparison table respectively;
the sample extraction module is used for extracting keywords from the converted junk information samples and non-junk information samples;
the probability calculation module is used for calculating the probability of the keyword appearing in the converted junk information sample and the probability of the keyword appearing in the converted non-junk information sample, taking the probability of the keyword appearing in the converted junk information sample as the junk information probability corresponding to the keyword, and taking the probability of the keyword appearing in the converted non-junk information sample as the non-junk information probability corresponding to the keyword;
and the generating module is used for storing the keywords, and generating a first keyword library according to the junk information probability and the non-junk information probability corresponding to the keywords.
Optionally, the information determining module includes:
the calculation submodule is used for calculating the probability that the information to be identified is the junk information according to the junk information probability and the non-junk information probability corresponding to the keywords;
the comparison submodule is used for comparing the probability that the information to be identified is the junk information with a preset junk information threshold;
the record obtaining submodule is used for obtaining the behavior record of the user issuing the information to be identified, and the behavior record comprises: violation or suspicious behavior;
and the modification determining submodule is used for determining whether the information to be identified is junk information according to the comparison result and the behavior record.
Optionally, the modification determination sub-module includes:
the first behavior judgment subunit is used for judging whether an illegal behavior or a suspicious behavior exists in the behavior record or not when the probability that the information to be identified is the junk information is not smaller than the junk information threshold value;
the first modification determining subunit is used for determining that the information to be identified is the junk information when the judgment result of the first behavior judging subunit is present, and recording the behavior of issuing the information to be identified as the violation behavior in the behavior record; when the judgment result of the first behavior judgment subunit is that the information to be identified does not exist, the probability that the information to be identified is junk information is reduced, the information to be identified is determined to be non-junk information, and the behavior of issuing the information to be identified at this time is recorded as suspicious behavior in the behavior record;
the second behavior judgment subunit is used for judging whether an illegal behavior or a suspicious behavior exists in the behavior record or not when the probability that the information to be identified is the junk information is smaller than the junk information threshold value;
the second modification determining subunit is used for increasing the probability that the information to be identified is the junk information when the judgment result of the second behavior judging subunit is present, determining that the information to be identified is the junk information, and recording the behavior of issuing the information to be identified in the behavior record as an illegal behavior; and when the judgment result of the second behavior judgment subunit is that the second behavior judgment subunit does not exist, determining that the information to be identified is non-spam, and recording the behavior of issuing the information to be identified as normal behavior in the behavior record.
Compared with the background art, the embodiment of the invention has the following advantages:
firstly, the embodiment of the invention presets a comparison table, converts the information to be identified according to the comparison table, extracts keywords from the converted information to be identified, searches the probability of junk information and the probability of non-junk information corresponding to the keywords according to the obtained keywords, and judges the information to be identified according to the obtained probability. By converting the information to be recognized, the situation that certain keywords cannot be recognized due to the fact that certain words in the information are subjected to form conversion when the information is issued can be avoided, the keywords in the information can be recognized more accurately, and the missing judgment or the misjudgment of the junk information is reduced.
Secondly, the embodiment of the invention can also match the variable character strings in the information to be identified by using the regular expression, and determine the variable character strings of the same type as the same keyword. Because the variable character strings have higher probability of appearing in the spam information, the variable character strings are combined with the corresponding keywords for identification, and the identification accuracy can be further improved.
And thirdly, the embodiment of the invention can be used for identifying by combining with the behavior record of the user, thereby reducing the missing judgment or the misjudgment of the junk information.
Drawings
Fig. 1 is a flowchart of a method for identifying spam according to an embodiment of the present invention;
fig. 2 is a flowchart of a method for identifying spam according to a second embodiment of the present invention;
fig. 3 is a block diagram of an apparatus for identifying spam according to a fourth embodiment of the present invention;
fig. 4 is a block diagram of an apparatus for identifying spam according to a fifth embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the embodiments of the present invention more comprehensible, embodiments of the present invention are described in detail below with reference to the accompanying drawings and the detailed description.
In the embodiment of the invention, the information to be identified can be converted firstly, then the keyword is extracted from the converted information to be identified, when the keyword is extracted, the variable character strings in the information to be identified can be matched, the variable character strings of the same type are determined as the same keyword, and finally whether the information to be identified is the junk information is judged according to the extracted keyword. The keyword in the information can be more accurately identified through the process, the accuracy of identifying the junk information is improved, the misjudgment rate is reduced, the harm to the user is reduced, and the user experience is enhanced.
The first embodiment is as follows:
at present, the method for automatically identifying spam mainly comprises the following steps: extracting keywords of the information to be identified according to a preset keyword library, and calculating the probability that the information to be identified is junk information according to the keywords, thereby judging whether the information to be identified is the junk information.
However, in order to avoid the review, some words in the information are usually subjected to format conversion when the information is released, so that the meaning of the converted words is changed and cannot be stored in the keyword library as keywords, and thus the converted words cannot be recognized as keywords. Therefore, the adoption of the method for automatically identifying the junk information can easily cause the missing judgment or the erroneous judgment of the junk information, and the identification accuracy is low.
In order to solve the above problem, an embodiment of the present invention provides a method for identifying spam information, which can solve the above problem by converting information to be identified.
Referring to fig. 1, a flowchart of a method for identifying spam information according to an embodiment of the present invention is shown, where the method may include:
step 101, converting information to be identified according to a preset comparison table.
Because some words in the information are usually converted in form when the information is published, so as to avoid the examination, for example, some simplified words in the information are converted into traditional words, common characters are converted into special characters, and the like. These converted word meanings have changed and cannot be stored in the keyword library as valid keywords.
Therefore, the embodiment of the invention can firstly convert the information to be recognized, so that the converted words can still be recognized as effective keywords when the information is issued.
A look-up table may be set in advance, and conversion is performed according to the look-up table. For example, a comparison table of traditional and simplified characters, a comparison table of special and common characters, a comparison table of homophones, and the like may be provided, which is not limited in the embodiments of the present invention.
And 102, extracting keywords from the converted information to be identified, wherein variable character strings in the converted information to be identified are identified, and the variable character strings of the same type are determined as the same key.
After the information to be recognized is converted in the above step 101, keywords may be extracted from the converted information to be recognized, and then a judgment may be performed according to the keywords.
In this embodiment of the present invention, extracting the keyword from the converted information to be recognized may include:
substep 1021, identifying the variable character string in the converted information to be identified;
substep 1022, determining the same type of variable character string as the same keyword.
Of course, the process of extracting keywords may also include other steps, and the specific process will be discussed in detail in the following embodiments.
And 103, searching the spam probability and the non-spam probability corresponding to the keywords from a pre-generated first keyword library.
For each keyword, it may correspond to a spam probability and a non-spam probability. In the embodiment of the invention, the keywords can be extracted through the pre-collected samples, the probability of junk information and the probability of non-junk information corresponding to the keywords are calculated, and the keywords and the probability of junk information and the probability of non-junk information corresponding to the keywords are stored in the first keyword library.
Therefore, after the keywords are extracted from the converted information to be identified, the spam probability and the non-spam probability corresponding to the keywords can be directly searched from the first keyword library.
And step 104, determining whether the information to be identified is junk information according to the junk information probability and the non-junk information probability corresponding to the keyword.
After obtaining the spam probability and the non-spam probability corresponding to the keyword, it can be determined whether the information to be identified is spam according to the two probabilities, and for the specific determination process, the following embodiments will discuss in detail.
The embodiment of the invention presets a comparison table, converts the information to be identified according to the comparison table, extracts keywords from the converted information to be identified, searches the probability of junk information and the probability of non-junk information corresponding to the keywords according to the obtained keywords, and judges the information to be identified according to the obtained probabilities. By converting the information to be recognized, the situation that certain keywords cannot be recognized due to the fact that certain words in the information are subjected to form conversion when the information is issued can be avoided, the keywords in the information can be recognized more accurately, and the missing judgment or the misjudgment of the junk information is reduced.
Example two:
in the second embodiment, the steps in the first embodiment will be described in detail.
Referring to fig. 2, a flowchart of a method for identifying spam according to a second embodiment of the present invention is shown, where the method may include:
step 201, converting the information to be identified according to a preset comparison table.
In the embodiment of the invention, a comparison table can be preset, and the information to be identified is converted according to the comparison table.
For example, for a case of converting a simplified character into a traditional character, or converting a common character into a special character, a traditional character and simplified character comparison table may be set, and the special character and common character comparison table may include:
a1, acquiring traditional characters and special characters in the information to be recognized.
For example, the information to be recognized is ' panning ', QQ (first) ' (+ sixty nine) ' complex character ' ', special characters ' first ', ' third ', ' fourth ', ' fifth ', ' sixth ', ' seventh ', ' and ' ninja '.
a2, searching the traditional character in the information to be identified from the traditional character and simplified character comparison table, and converting the traditional character into the simplified character.
Searching a comparison table of traditional Chinese characters and simplified Chinese characters, wherein the traditional Chinese character '' can be searched, and then the traditional Chinese character '' can be converted into the simplified Chinese character 'bao' according to the corresponding relation in the comparison table of the traditional Chinese characters and the simplified Chinese characters.
a3, searching the special character in the information to be recognized from the special character and common character comparison table, and converting the special character into the common character.
The comparison table of the special characters and the common characters is searched, the special characters 'phi', 'tetra', 'phi', 'and' ninu 'can be respectively searched, and the special characters' phi ',' and 'ninu' can be respectively converted into the common characters '1', '2', '3', '4', '5', '6', '7', '8', and '9' according to the corresponding relation in the comparison table of the special characters and the common characters.
Of course, other forms of conversion may also be present in the information to be identified, such as homophone conversion (converting "overnight" to "one-leaf"), different name or nickname conversion (e.g., converting "QQ" to "ball" or "penguin"), and so forth.
Therefore, in the embodiment of the present invention, the comparison table is not limited to the two comparison tables, and other types of comparison tables may be included, for example, the homophone comparison table may be used to convert the "one-leaved emotion" into the "one-night emotion" according to the homophone comparison table, and the details of the embodiment of the present invention are not discussed herein.
Step 202, extracting keywords from the converted information to be identified.
After the information to be identified is converted, keywords can be extracted from the converted information to be identified.
In the embodiment of the present invention, the extracted keywords mainly include two forms: keywords derived from regular words in the information to be recognized and keywords derived from variable character strings in the information to be recognized.
Thus, this step 202 may comprise:
b1, judging whether at least one word identical to the keyword in the pre-generated second keyword library exists in the converted information to be identified.
First, in the embodiment of the present invention, a second keyword library may be generated in advance, where the second keyword library includes at least one keyword. The keyword library can be a keyword library for identifying junk information in the prior art; preferably, in the embodiment of the present invention, on the basis of a keyword library for identifying spam in the prior art, the number of keywords in the keyword library is further increased, and the keyword library with the increased number of keywords is used as the second keyword library in the embodiment of the present invention.
The keywords in the second keyword library may be obtained by selecting the corpus and performing word segmentation on the corpus, for example, N-gram groups of words may be used for word segmentation, and so on.
b2, if the keyword exists, determining the at least one word which is the same as the keyword in the pre-generated second keyword library as the converted keyword of the information to be identified.
If at least one word identical to the keyword in the pre-generated second keyword library exists in the converted information to be recognized, the words identical to the keyword in the pre-generated second keyword library can be determined as the keyword of the converted information to be recognized.
For example, if the converted information to be recognized is "pan, plus QQ 123456789", and it is determined that "pan" and "QQ" exist in the second keyword library, "pan" and "QQ" may be determined as the keywords of the converted information to be recognized.
In addition, the converted information to be recognized, namely panning and adding QQ123456789, further comprises a variable character string 123456789 (QQ number). For such variable character strings, which are key points in spam, the probability of appearing in spam is high. However, the variable character string cannot be recognized in the ordinary word segmentation process, that is, the second keyword library may not include the keyword of the variable character string class, and thus the variable character string cannot be extracted through the steps b1 and b 2.
Therefore, the embodiment of the present invention further provides a process of processing the variable character string, where the step 202 further includes:
b3, identifying the variable character string in the converted information to be identified.
In the embodiment of the invention, the regular expression can be used for matching with the converted information to be identified, so that the variable character string in the converted information to be identified can be identified.
Of course, those skilled in the art may also recognize the variable character string by other methods according to practical experience, and the embodiment of the present invention is not limited thereto.
b4, determining the variable character strings of the same type as the same keyword.
After the variable character strings are identified, the same type of variable character strings can be determined as the same keyword. For example, all identified QQ numbers are determined to be the same keyword "_ QQ number", all identified mobile phone numbers are determined to be the same keyword "_ mobile phone number", all identified Uniform Resource Locators (URLs) are determined to be the same keyword "_ URL", all identified mailboxes are determined to be the same keyword "_ mailbox", and so on.
Step 203, searching the spam probability and the non-spam probability corresponding to the keywords from a pre-generated first keyword library.
In the embodiment of the invention, a first keyword library can be generated in advance, and a plurality of keywords and the spam probability and the non-spam probability corresponding to each keyword are stored in the first keyword library.
The first keyword library may be generated by the following steps:
c1, collecting garbage information samples and non-garbage information samples.
And c2, respectively converting the junk information samples and the non-junk information samples according to a preset comparison table.
In the embodiment of the present invention, the conversion process for the sample may be similar to the conversion process for the information to be identified. For example, the step c2 may include:
c21, acquiring traditional characters and special characters in the junk information sample and the non-junk information sample respectively;
c22, searching the traditional Chinese characters in the junk information sample and the non-junk information sample from the traditional Chinese character and simple Chinese character comparison table, and converting the traditional Chinese characters into simple Chinese characters;
c23, searching the special characters in the garbage information sample and the non-garbage information sample from the special character and common character comparison table, and converting the special characters into common characters.
Similarly, homophones and the like in the spam samples and the non-spam samples can be converted, and the embodiment of the invention is not discussed in detail here.
And c3, extracting keywords from the converted spam information sample and the non-spam information sample.
The process for extracting keywords from the converted spam samples and non-spam samples may be similar to the process for extracting keywords from the converted information to be identified. For example, the step c3 may include:
c31, judging whether one or more words in the converted spam information sample and non-spam information sample exist in the pre-generated second keyword library, wherein the second keyword library comprises a plurality of keywords;
c32, if the words exist, using the words as the keywords of the converted junk information sample and non-junk information sample;
c33, matching the converted junk information samples and non-junk information samples by using a regular expression, and identifying variable character strings in the converted junk information samples and non-junk information samples;
c34, determining the variable character strings of the same type as the same keyword.
c4, calculating the probability of the keyword appearing in the converted spam sample and the probability of the keyword appearing in the converted non-spam sample.
And c5, taking the probability of the keyword appearing in the converted spam sample as the spam probability corresponding to the keyword, and taking the probability of the keyword appearing in the converted non-spam sample as the non-spam probability corresponding to the keyword.
And c6, storing the keywords and the probability of the junk information and the probability of the non-junk information corresponding to the keywords, and generating a first keyword library.
After the keywords are extracted in step 202, the spam probability and the non-spam probability corresponding to the keywords can be directly searched from the first keyword library.
And 204, calculating the probability that the information to be identified is the junk information according to the junk information probability and the non-junk information probability corresponding to the keywords.
The probability that the information to be identified is spam can be calculated according to the spam probability and the non-spam probability corresponding to the keyword, and for a specific calculation process, the detailed description will be given in the following embodiments.
Step 205, comparing the probability that the information to be identified is spam with a preset spam threshold.
And step 206, determining whether the information to be identified is junk information according to the comparison result.
In the embodiment of the present invention, a spam threshold (for example, 0.9) may be preset, and after the probability that the information to be identified is spam is obtained, the probability that the information to be identified is spam may be compared with the spam threshold, and then the determination is performed according to the comparison result.
This step 206 may include:
d1, when the probability that the information to be identified is the spam is not less than the preset spam threshold, determining the information to be identified as the spam.
d2, when the probability that the information to be identified is the spam is less than the preset spam threshold value, determining that the information to be identified is the non-spam.
Preferably, after comparing the probability that the information to be identified is spam with the preset spam threshold, the embodiment of the present invention can also judge the information to be identified by combining with the behavior record of the user. In this case, the step 206 may include:
e1, obtaining the behavior record of the user who issues the information to be identified.
e2, judging whether the information to be identified is garbage information according to the comparison result and the behavior record.
Wherein the behavior record may include: violation behaviors or suspicious behaviors. For example, after the comparison result is obtained, whether the information to be identified is spam may be determined in combination with whether the user has a previous unlawful behavior or a suspicious behavior. For the specific judgment process, it will be discussed in detail in the following embodiments.
The embodiment of the invention can match the variable character strings in the information to be identified by using the regular expression and determine the variable character strings of the same type as the same keyword. Because the variable character strings have higher probability of appearing in the spam information, the variable character strings are combined with the corresponding keywords for identification, and the identification accuracy can be further improved. In addition, the embodiment of the invention can also be used for identifying by combining with the behavior record of the user, thereby reducing the missing judgment or the misjudgment of the junk information.
Example three:
the method for identifying the junk information in the embodiment of the invention can comprise the following processes:
1. establishing a common keyword library (i.e. the second keyword library in the first and second embodiments)
With the continuous change of information, a large number of new words, such as "pan bao", "mobile phone", "gay" and the like, appear in the information. If a relatively old or low vocabulary keyword library is used, which may not have such words, and thus such keywords cannot be distinguished for the information to be recognized, it is difficult to effectively recognize spam that is subject to such words.
Therefore, in the embodiment of the invention, the number of the keywords can be increased in the current keyword library so as to better extract more effective keywords. The keywords may be manually input or may be automatically collected after the server connects to the network (for example, when the server finds that a certain group of continuous adjacent words appears in a new corpus or weblog in a large amount and is not registered in the word stock, the continuous adjacent words may be assumed to be new keywords and added to the current keyword stock).
Preferably, in the embodiment of the present invention, the common keyword library may be established by the following steps:
a1, preparing a corpus.
First, a corpus containing some of the most recently occurring words may be obtained.
A2, performing word segmentation based on the N-element word group of the character, and selecting candidate words.
Because the meaning of a single Chinese character is variable and not specific enough, the meaning of the Chinese character needs to be specific under different words and contexts. Therefore, the first step of processing the corpus is word segmentation, and the keywords in the word segmentation are extracted, and the meaning of the keywords is more definite than that of a single word, so that the intelligent recognition is more facilitated.
For example, for the corpus "new Chinese word recognition", if the word segmentation is performed according to the 2-element word group of the word, the result is: chinese/, Wenxin/, New word/, word recognition/, recognition/. The N-gram phrases by word are similar, typically 2< = N < = 4.
After word segmentation, existing words in the current keyword library can be taken out, and the remaining words are used as candidate words.
A3, counting the word frequency (the number of times the word appears in all corpora) of the candidate word, and adding the word with the word frequency larger than a certain threshold (such as 10 times) into the word bank to be examined.
A4, manual review.
After the word bank to be examined is obtained, the word bank to be examined can be further examined manually, so that the words in the word bank to be examined are more accurate. After the review, the manually reviewed word bank and the current keyword bank may be merged to be a common keyword bank (i.e., the second keyword bank in the first and second embodiments).
Of course, the embodiment of the present invention may also directly combine the word bank to be examined in step a3 and the current keyword bank as a common keyword bank without manual review, which is not limited in this embodiment of the present invention.
2. Establishing a first keyword library
By summarizing a large amount of junk information, the junk information is often evaded from examination by means of methods such as character deformation, simplified and unsimplified characters, synonyms of different sounds, similar character lines, different calling methods or nicknames and the like. For example: when releasing information, the QQ is replaced by a ball or a penguin, and the 123456789 is replaced by (phi), (phi) and (phi), and the night condition is replaced by a leaf condition. Because the meaning of the words is changed when the words are replaced, the words cannot be called as an effective phrase, so that the common keyword library cannot contain the words, but the user can read out the originally expressed semantics from the deformed words, which brings great difficulty to the identification of the junk information.
Therefore, for the case that the form conversion exists in the current information, some spam samples and non-spam samples in which the form conversion exists can be collected in the embodiment of the invention, and then the samples are analyzed to establish the first keyword library.
Preferably, in the embodiment of the present invention, the first keyword library may be established by the following steps:
and B1, collecting garbage information samples and non-garbage information samples.
B2, converting the garbage information samples and the non-garbage information samples respectively.
For example, the traditional Chinese characters are all converted into simple ones, and the special characters are converted into commonly used characters 123456789 and the like, wherein the special characters are the third character, the fourth character, the seventh character, the ninth character, the tenth character, the ninth character, the eleventh character, the thirteenth.
And B3, extracting keywords from the converted spam information samples and non-spam information samples.
First, keywords can be extracted through the above-mentioned general keyword library. That is, if one or more words in the spam sample and the non-spam sample exist in the common keyword library, the one or more words can be used as keywords.
In the embodiment of the present invention, the extracted keywords may also be further manually reviewed. And the keywords which are not converted can be stored as common keywords, and the converted keywords can be stored as special keywords, and the probability that the special keywords appear in the junk information is relatively high. For example, since "QQ", "orb", and "penguin" are all regarded as the same special keyword "QQ", the probability that the special keyword appears in spam is greater than the probability of a general keyword.
Secondly, in the embodiment of the invention, the variable character strings in the junk information samples and the non-junk information samples can be processed.
For example, variable character strings such as URLs, mobile phone numbers, QQ numbers, mailboxes, etc. are all very important key points in spam, and the probability of occurrence in spam is much higher than that in non-spam. But because these character strings are not fixed, they cannot be recognized in the ordinary word segmentation process.
The embodiment of the invention can adopt the regular expression to match with the junk information sample and the non-junk information sample, thereby matching the character strings which accord with the URL, the mobile phone number, the QQ number, the mailbox and the like. And then determining the identified variable character strings of the same type as the same keyword.
For example, if matching is performed, several URLs in the spam sample and the non-spam sample are respectively: "http:// www.xxx.com? 1123 "," http:// www.xxx.com? 2321 "and" http:// www.xxx.com? 3412 ", although different URL addresses are used, if each different URL is used as a single keyword, the probability of the URL is reduced, so that all URLs are used as a fixed keyword" _ URL ", that is, each URL recognized corresponds to the occurrence of a keyword" _ URL ", and the frequency of occurrence of the keyword" _ URL "is increased by one.
Similarly, similar processing is performed on the mobile phone numbers, the mailboxes and the QQ numbers, for example, all the mobile phone numbers correspond to a specific keyword _ mobile phone number, all the mailboxes correspond to _ mailboxes, all the QQ numbers correspond to _ QQ numbers, and the like, and the embodiment of the invention is not discussed in detail herein.
And B4, calculating the probability of the spam information and the probability of the non-spam information corresponding to the keywords.
In the embodiment of the invention, the probability of the keyword appearing in the converted junk information sample can be used as the corresponding junk information probability of the keyword, and the probability of the keyword appearing in the converted non-junk information sample can be used as the corresponding non-junk information probability of the keyword.
For example, there are 4000 spam samples and 4000 non-spam samples, and for the keyword "QQ", 200 spam samples in the 4000 spam samples contain the keyword, so that the spam probability of the "QQ" is 5%; and in 4000 non-spam samples, only 2 samples contain the keyword, so that the non-spam probability of the QQ is 0.05%.
B5, storing the keyword and the spam probability and the non-spam probability corresponding to the keyword, and generating a first keyword library.
3. Extracting keywords in information to be identified
Similar to the above process of extracting the keywords of the spam sample and the non-spam sample. Firstly, information to be recognized is converted, and then keywords (including extracting keywords from a common keyword library and extracting keywords from variable character strings) are extracted from the converted information to be recognized. For the specific process, reference is made to the above description, and the embodiments of the present invention will not be discussed in detail herein.
4. Calculating the probability that the information to be identified is garbage information
For a piece of information to be identified, it can be assumed that it is spam with a probability of 50% without statistical analysis. If S (spam) represents the junk information, H (health) represents the non-junk information, the prior probability that the information to be identified is the junk information is P (S), and the prior probability that the information to be identified is the non-junk information is P (H), then:
P(S)=P(H)=50%
then, the piece of information to be recognized is analyzed, and the keyword 'QQ' is found to be contained in the piece of information to be recognized, wherein W represents the keyword 'QQ', and P (S | W) represents the probability that the piece of information to be recognized is spam under the condition that the keyword W already exists.
Then, according to the conditional probability formula, it can be derived:
wherein, P (W | S) is a probability of spam corresponding to the keyword W, and P (W | H) is a probability of non-spam corresponding to the keyword W.
The above calculation is performed for each keyword in the information to be recognized, and then the N (e.g., 15) with the highest P (S | W) are extracted, and their joint probabilities are calculated.
It should be noted that if some keyword words are first appeared but not in the first keyword library, it can be assumed that P (S | W) is equal to 0.4, and since spam is often some fixed words, a word never appears, and it may be a normal word.
The joint probability means how large the probability of another event occurs when a plurality of events occur. For example, it is known that W1 and W2 are two different words, and they both appear in a certain piece of information, and then the probability that this piece of information is spam is the joint probability.
With the knowledge of W1 and W2, there are either two outcomes: spam (event E1) or non-spam (event E2), then the following table one:
event(s) W1 W2 Junk information
E1 Appear Appear Is that
E2 Appear Appear Is not provided with
Watch 1
The probability corresponding to the parameters in the first table is shown in the second table:
event(s) W1 W2 Junk information
E1 P(S|W1) P(S|W2) P(S)
E2 1-P(S|W1) 1-P(S|W2) 1-P(S)
Watch two
If it is assumed that all events are independent events, then the probability P of event E1 occurring (E1) and the probability P of event E2 occurring (E2) can be calculated:
P(E1)=P(S|W1)P(S|W2)P(S)
P(E2)=(1-P(S|W1))(1-P(S|W2))(1-P(S))
since the probability P that the information to be identified is spam is as follows under the condition that E1 and E2 have occurred:
namely:
substituting p(s) =0.5, yields:
when P (S | W1) is denoted as P1 and P (S | W1) is denoted as P2, the formula becomes:
equation 1
The above formula 1 is a calculation formula of joint probability.
Expanding the formula 1 to the condition of N keywords, the final calculation formula of the probability P that the information to be identified is spam is obtained:
and finally, taking the joint probability as the probability that the information to be identified is the junk information.
5. Judging whether the information to be identified is junk information
In the embodiment of the present invention, a spam threshold (for example, 0.9) may be set, and then the probability that the information to be identified is spam is compared with the spam threshold. The following two determination methods can be included:
(1) making a judgment based only on the comparison
If the probability that the information to be identified is the junk information is not smaller than a preset junk information threshold value, determining that the information to be identified is the junk information; and if the probability that the information to be identified is the junk information is smaller than a preset junk information threshold value, determining that the information to be identified is the non-junk information.
(2) Making a judgment in conjunction with a user's behavioral record
After the probability that the information to be identified is spam is compared with the spam threshold, the behavior record of the user who issues the information to be identified can be further acquired, and whether the information to be identified is spam is judged according to the comparison result and the behavior record.
Preferably, the judging whether the information to be identified is spam according to the comparison result and the behavior record may include:
and C1, when the probability that the information to be identified is the spam is not less than (i.e. is greater than or equal to) the spam threshold, judging whether the violation or suspicious behavior exists in the behavior record.
If so, go to step C2; if not, go to step C3.
And C2, determining that the information to be identified is spam, and recording the action of issuing the information to be identified as violation in the action record.
And C3, reducing the probability that the information to be identified is junk information, determining that the information to be identified is non-junk information, and recording the behavior of issuing the information to be identified as suspicious behavior in the behavior record.
If the probability that the information to be identified is spam exceeds the spam threshold, but there is no exception in the behavior record of the user (the behavior of the user may refer to post comment, upload head portrait, add friend, etc., and the exception behavior is, for example, the user has issued 100 comments within 1 minute, and the same comment content has consecutively commented 20 different posts, etc.), the probability that the information to be identified that is issued this time is spam may be reduced, and the information to be identified that is issued this time is recorded as suspicious behavior (the counted content may include the ID of the user, the issuing time, the issued content, the probability of being spam, the processing result, etc.), and when such a situation occurs many times later, the processing is performed (for example, the user is prohibited to issue information, the user is prohibited from logging in, the user account is deleted, etc.).
And C4, when the probability that the information to be identified is spam is smaller than the spam threshold, judging whether the behavior record has violation behaviors or suspicious behaviors.
If so, go to step C5; if not, go to step C6.
And C5, increasing the probability that the information to be identified is spam, determining that the information to be identified is spam, and recording the behavior of issuing the information to be identified in the behavior record as violation behavior.
If the probability that the information to be identified is the junk information does not exceed the junk information threshold, combining with the behavior record of the user, finding that multiple illegal behaviors or suspicious behaviors appear in the record, increasing the probability that the information to be identified is the junk information, considering the information to be the junk information, and recording the information to be the illegal behaviors.
And C6, determining that the information to be identified is non-spam information, and recording the action of issuing the information to be identified as normal action in the action record.
According to the embodiment of the invention, the keyword can be more accurately identified from the converted information to be identified by converting the information to be identified, and meanwhile, the accuracy of identifying the junk information can be further improved, the misjudgment rate is reduced, the harm to the user is reduced, and the user experience is enhanced by processing the variable character string and combining the behavior record of the user for processing.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Example four:
referring to fig. 3, a block diagram of an apparatus for identifying spam according to a fourth embodiment of the present invention is shown. The apparatus may include: the system comprises an information conversion module 301, an information extraction module 302, a probability search module 303 and an information judgment module 304.
Wherein,
the information conversion module 301 is configured to convert information to be identified according to a preset comparison table;
an information extraction module 302, configured to extract a keyword from the converted information to be identified, where a variable character string in the converted information to be identified is identified, and the variable character strings of the same type are determined as the same keyword;
a probability searching module 303, configured to search a pre-generated first keyword library for spam probabilities and non-spam probabilities corresponding to the keywords;
and the information determining module 304 is configured to determine whether the information to be identified is spam according to the spam probability and the non-spam probability corresponding to the keyword.
The information extraction module 302 may include:
an identifying submodule 3021, configured to identify a variable character string in the converted information to be identified;
the first determining sub-module 3022 is configured to determine variable character strings of the same type as the same keyword. The recognition submodule 3021 may match the converted information to be recognized with a regular expression, and recognize a variable character string in the converted information to be recognized.
Preferably, the information extraction module 302 may include:
the keyword judgment sub-module is used for judging whether at least one word identical to a keyword in a pre-generated second keyword library exists in the converted information to be identified, and the second keyword library comprises at least one keyword;
and the second determining submodule is used for determining at least one word which is the same as the keyword in the pre-generated second keyword library as the converted keyword of the information to be identified when the judgment result of the judging submodule is that the word exists.
Preferably, the look-up table comprises: complex and simplified characters comparison table, special and common characters comparison table,
the information conversion module includes:
the acquisition submodule is used for acquiring traditional Chinese characters and special characters in the information to be identified;
the conversion submodule is used for searching the traditional Chinese character in the information to be identified from the traditional Chinese character and simple Chinese character comparison table and converting the traditional Chinese character into a simple Chinese character; and searching the special characters in the information to be identified from the special character and common character comparison table, and converting the special characters into common characters.
Preferably, the apparatus further comprises:
the sample collection module is used for collecting garbage information samples and non-garbage information samples;
the sample conversion module is used for converting the junk information samples and the non-junk information samples according to the preset comparison table respectively;
the sample extraction module is used for extracting keywords from the converted junk information samples and non-junk information samples;
the probability calculation module is used for calculating the probability of the keyword appearing in the converted junk information sample and the probability of the keyword appearing in the converted non-junk information sample, taking the probability of the keyword appearing in the converted junk information sample as the junk information probability corresponding to the keyword, and taking the probability of the keyword appearing in the converted non-junk information sample as the non-junk information probability corresponding to the keyword;
and the generating module is used for storing the keywords, and generating a first keyword library according to the junk information probability and the non-junk information probability corresponding to the keywords.
Preferably, the information determination module includes:
the calculation submodule is used for calculating the probability that the information to be identified is junk information according to the junk information probability and the non-junk information probability corresponding to the keyword;
the comparison submodule is used for comparing the probability that the information to be identified is the junk information with a preset junk information threshold;
a record obtaining sub-module, configured to obtain a behavior record of a user who issues the information to be identified, where the behavior record includes: violation or suspicious behavior;
and the modification determining submodule is used for determining whether the information to be identified is junk information according to the comparison result and the behavior record.
Preferably, the revision determination submodule includes:
a first behavior judging subunit, configured to, when the probability that the information to be identified is spam is not less than the spam threshold, judge whether an illegal behavior or a suspicious behavior exists in the behavior record;
the first modification determining subunit is used for determining that the information to be identified is the junk information when the judgment result of the first behavior judging subunit is present, and recording the behavior of issuing the information to be identified as the violation behavior in the behavior record; when the judgment result of the first behavior judgment subunit is that the information to be identified is not present, reducing the probability that the information to be identified is junk information, determining that the information to be identified is non-junk information, and recording the behavior of issuing the information to be identified as suspicious behavior in behavior records;
the second behavior judgment subunit is configured to, when the probability that the information to be identified is spam is smaller than the spam threshold, judge whether an illegal behavior or a suspicious behavior exists in the behavior record;
the second modification determining subunit is used for increasing the probability that the information to be identified is the junk information when the judgment result of the second behavior judging subunit is present, determining that the information to be identified is the junk information, and recording the behavior of issuing the information to be identified in the behavior record as an illegal behavior; when the judgment result of the second behavior judgment subunit is not existed, the information to be identified is determined to be non-spam information, and the behavior of issuing the information to be identified this time is recorded as normal behavior in the behavior record
In the embodiment of the invention, the information to be identified can be converted firstly, then the keyword is extracted from the converted information to be identified, when the keyword is extracted, the variable character strings in the information to be identified can be matched, the variable character strings of the same type are determined as the same keyword, and finally whether the information to be identified is the junk information is judged according to the extracted keyword. The keyword in the information can be more accurately identified through the process, the accuracy of identifying the junk information is improved, the misjudgment rate is reduced, the harm to the user is reduced, and the user experience is enhanced.
Example five:
referring to fig. 4, a block diagram of an apparatus for identifying spam according to a fifth embodiment of the present invention is shown. The apparatus may include: a sample collection module 401, a sample conversion module 402, a sample extraction module 403, a probability calculation module 404, a generation module 405, an information conversion module 406, an information extraction module 407, a probability lookup module 408, and an information determination module 409.
Wherein,
a sample collection module 401 for collecting garbage information samples and non-garbage information samples;
a sample conversion module 402, configured to convert a spam sample and a non-spam sample according to a preset comparison table, respectively;
a sample extraction module 403, configured to extract keywords from the converted spam samples and non-spam samples;
a probability calculating module 404, configured to calculate a probability that the keyword appears in the converted spam sample and a probability that the keyword appears in the converted non-spam sample, take the probability that the keyword appears in the converted spam sample as a spam probability corresponding to the keyword, and take the probability that the keyword appears in the converted non-spam sample as a non-spam probability corresponding to the keyword;
a generating module 405, configured to store the keyword, and a spam probability and a non-spam probability corresponding to the keyword, and generate a first keyword library;
the information conversion module 406 is used for converting the information to be identified according to a preset comparison table;
the above-mentioned look-up table may include: complex and simplified character comparison tables, special and common character comparison tables. Of course, other forms of tables (such as phonetic-character tables) may also be included, and the embodiment of the present invention is not limited thereto.
The information conversion module 406 may include:
the obtaining submodule 4061 is used for obtaining traditional Chinese characters and special characters in the information to be identified;
the conversion submodule 4062 is configured to search the complex character in the information to be identified from the complex character and simple character comparison table, and convert the complex character into a simple character; and searching the special characters in the information to be identified from the special character and common character comparison table, and converting the special characters into the common characters.
An information extraction module 407, configured to extract a keyword from the converted information to be identified;
the information extraction module 407 may include:
the identification submodule 4071 is used for identifying the converted variable character strings in the information to be identified;
in the embodiment of the invention, the identifier module can utilize the regular expression to match with the converted information to be identified, and identify the variable character string in the converted information to be identified.
A first determination sub-module 4072, configured to determine variable character strings of the same type as the same keyword;
the keyword judgment sub-module 4073 is configured to judge whether at least one word identical to a keyword in a pre-generated second keyword library exists in the converted information to be identified, where the second keyword library includes at least one keyword;
the second determining sub-module 4074 is configured to, when the determination result of the determining sub-module is yes, determine the at least one word that is the same as the keyword in the pre-generated second keyword library as the converted keyword of the information to be identified.
A probability searching module 408, configured to search a pre-generated first keyword library for spam probabilities and non-spam probabilities corresponding to the keywords;
and the information judging module 409 is configured to judge whether the information to be identified is spam according to the spam probability and the non-spam probability corresponding to the keyword.
The information determining module 409 may include:
a calculating sub-module 4091, configured to calculate, according to the spam probability and the non-spam probability corresponding to the keyword, a probability that the information to be identified is spam;
a comparison sub-module 4092, configured to compare the probability that the information to be identified is spam with a preset spam threshold;
the record obtaining sub-module 4093 is configured to obtain a behavior record of a user issuing information to be identified, where the behavior record may include: violation or suspicious behavior;
and the modification judgment sub-module 4094 is used for judging whether the information to be identified is spam according to the comparison result and the behavior record.
The modification determination sub-module 4094 may include:
the first behavior judgment subunit is used for judging whether an illegal behavior or a suspicious behavior exists in the behavior record or not when the probability that the information to be identified is the junk information is not smaller than the junk information threshold value;
the first modification determining subunit is used for determining that the information to be identified is the junk information when the judgment result of the first behavior judging subunit is present, and recording the behavior of issuing the information to be identified as the violation behavior in the behavior record; when the judgment result of the first behavior judgment subunit is that the information to be identified does not exist, the probability that the information to be identified is junk information is reduced, the information to be identified is determined to be non-junk information, and the behavior of issuing the information to be identified at this time is recorded as suspicious behavior in the behavior record;
the second behavior judgment subunit is used for judging whether an illegal behavior or a suspicious behavior exists in the behavior record or not when the probability that the information to be identified is the junk information is smaller than the junk information threshold value;
the second modification determining subunit is used for increasing the probability that the information to be identified is the junk information when the judgment result of the second behavior judging subunit is present, determining that the information to be identified is the junk information, and recording the behavior of issuing the information to be identified in the behavior record as an illegal behavior; and when the judgment result of the second behavior judgment subunit is that the second behavior judgment subunit does not exist, determining that the information to be identified is non-spam, and recording the behavior of issuing the information to be identified as normal behavior in the behavior record.
The embodiment of the invention converts the information to be recognized, can avoid the condition that certain keywords can not be recognized due to the fact that certain words in the information are subjected to form conversion when the information is issued, can more accurately recognize the keywords in the information, and reduces the missing judgment or the misjudgment of the junk information. Secondly, the embodiment of the invention can also match the variable character strings in the information to be identified by using the regular expression, and determine the variable character strings of the same type as the same keyword. Because the variable character strings have higher probability of appearing in the spam information, the variable character strings are combined with the corresponding keywords for identification, and the identification accuracy can be further improved. And thirdly, the embodiment of the invention can be used for identifying by combining with the behavior record of the user, thereby reducing the missing judgment or the misjudgment of the junk information.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The method and the device for identifying spam provided by the invention are introduced in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (12)

1. A method for identifying spam, comprising:
converting information to be identified according to a preset comparison table;
extracting keywords from the converted information to be identified, wherein the step of extracting keywords from the converted information to be identified comprises the following steps: identifying variable character strings in the converted information to be identified, and determining the variable character strings of the same type as the same keyword;
searching a spam probability and a non-spam probability corresponding to the keywords from a pre-generated first keyword library;
determining whether the information to be identified is spam according to the spam probability and the non-spam probability corresponding to the keyword, including: calculating the probability that the information to be identified is junk information according to the junk information probability and the non-junk information probability corresponding to the keyword; comparing the probability that the information to be identified is junk information with a preset junk information threshold value; acquiring a behavior record of a user who issues the information to be identified, wherein the behavior record comprises: violation or suspicious behavior; determining whether the information to be identified is junk information according to the comparison result and the behavior record;
wherein, the determining whether the information to be identified is spam according to the comparison result and the behavior record comprises: and when the probability that the information to be identified is the junk information is not smaller than the junk information threshold value, judging whether an illegal behavior or a suspicious behavior exists in the behavior record.
2. The method according to claim 1, wherein the identifying the variable character string in the converted information to be identified comprises:
and matching the regular expression with the converted information to be identified to identify the variable character string in the converted information to be identified.
3. The method according to claim 1, wherein the extracting keywords from the converted information to be recognized comprises:
judging whether at least one word identical to a keyword in a pre-generated second keyword library exists in the converted information to be identified, wherein the second keyword library comprises at least one keyword;
and if so, determining the at least one word which is the same as the keyword in the pre-generated second keyword library as the converted keyword of the information to be identified.
4. The method of claim 1, wherein the look-up table comprises: complex and simplified characters comparison table, special and common characters comparison table,
the converting the information to be identified according to the preset comparison table comprises the following steps:
acquiring traditional Chinese characters and special characters in the information to be identified;
searching the traditional Chinese character in the information to be identified from the traditional Chinese character and simple Chinese character comparison table, and converting the traditional Chinese character into a simple Chinese character;
and searching the special characters in the information to be identified from the special character and common character comparison table, and converting the special characters into common characters.
5. The method of claim 1, wherein the first corpus of keywords is generated by:
collecting garbage information samples and non-garbage information samples;
respectively converting the junk information samples and the non-junk information samples according to the preset comparison table;
extracting key words from the converted junk information samples and non-junk information samples;
calculating the probability of the keyword appearing in the converted spam sample and the probability of the keyword appearing in the converted non-spam sample;
taking the probability of the keyword appearing in the converted spam sample as the spam probability corresponding to the keyword, and taking the probability of the keyword appearing in the converted non-spam sample as the non-spam probability corresponding to the keyword;
and storing the keywords and the junk information probability and the non-junk information probability corresponding to the keywords to generate a first keyword library.
6. The method according to claim 1, wherein the determining whether the information to be identified is spam according to the comparison result and the behavior record comprises:
if the probability that the information to be identified is the junk information is not smaller than the junk information threshold value and the behavior record has an illegal behavior or suspicious behavior, determining that the information to be identified is the junk information, and recording the behavior of issuing the information to be identified in the behavior record as the illegal behavior;
if the probability that the information to be identified is the junk information is not smaller than the junk information threshold value and no illegal behavior or suspicious behavior exists in the behavior record, reducing the probability that the information to be identified is the junk information, determining that the information to be identified is the non-junk information, and recording the behavior of issuing the information to be identified as the suspicious behavior in the behavior record;
when the probability that the information to be identified is junk information is smaller than the junk information threshold value, judging whether an illegal behavior or a suspicious behavior exists in the behavior record;
if the probability that the information to be identified is the junk information is smaller than the junk information threshold value and the behavior record has an illegal behavior or suspicious behavior, the probability that the information to be identified is the junk information is increased, the information to be identified is determined to be the junk information, and the behavior of issuing the information to be identified this time is recorded as the illegal behavior in the behavior record;
and if the probability that the information to be identified is the junk information is smaller than the junk information threshold value and no illegal behavior or suspicious behavior exists in the behavior record, determining that the information to be identified is the non-junk information, and recording the behavior of issuing the information to be identified as the normal behavior in the behavior record.
7. An apparatus for identifying spam, comprising:
the information conversion module is used for converting the information to be identified according to a preset comparison table;
the information extraction module is used for extracting keywords from the converted information to be identified, wherein the information extraction module comprises: the recognition submodule is used for recognizing the variable character string in the converted information to be recognized; the first determining submodule is used for determining the variable character strings of the same type as the same keyword;
the probability searching module is used for searching the junk information probability and the non-junk information probability corresponding to the keywords from a pre-generated first keyword library;
an information determining module, configured to determine whether the information to be identified is spam according to a spam probability and a non-spam probability corresponding to the keyword, where the information determining module includes: the system comprises a calculation submodule, a comparison submodule, a record acquisition submodule and a correction determination submodule, wherein the calculation submodule is used for calculating the probability that the information to be identified is junk information according to the junk information probability and the non-junk information probability corresponding to the keyword;
the comparison submodule is used for comparing the probability that the information to be identified is the junk information with a preset junk information threshold;
a record obtaining sub-module, configured to obtain a behavior record of a user who issues the information to be identified, where the behavior record includes: violation or suspicious behavior;
the modification determining submodule is used for determining whether the information to be identified is junk information according to the comparison result and the behavior record;
wherein the revision determination submodule includes: and the first behavior judgment subunit is configured to judge whether an illegal behavior or a suspicious behavior exists in the behavior record when the probability that the information to be identified is spam is not smaller than the spam threshold.
8. The apparatus of claim 7,
and the recognition sub-module utilizes the regular expression to match with the converted information to be recognized, and recognizes the variable character string in the converted information to be recognized.
9. The apparatus of claim 7, wherein the information extraction module comprises:
the keyword judgment sub-module is used for judging whether at least one word which is the same as a keyword in a pre-generated second keyword library exists in the converted information to be identified, and the second keyword library comprises at least one keyword;
and the second determining submodule is used for determining the at least one word which is the same as the keyword in the pre-generated second keyword library as the converted keyword of the information to be identified when the judgment result of the judging submodule is existence.
10. The apparatus of claim 7, wherein the look-up table comprises: complex and simplified characters comparison table, special and common characters comparison table,
the information conversion module includes:
the acquisition submodule is used for acquiring traditional Chinese characters and special characters in the information to be identified;
the conversion submodule is used for searching the traditional Chinese character in the information to be identified from the traditional Chinese character and simple Chinese character comparison table and converting the traditional Chinese character into a simple Chinese character; and searching the special characters in the information to be identified from the special character and common character comparison table, and converting the special characters into common characters.
11. The apparatus of claim 7, further comprising:
the sample collection module is used for collecting garbage information samples and non-garbage information samples;
the sample conversion module is used for converting the junk information samples and the non-junk information samples according to the preset comparison table respectively;
the sample extraction module is used for extracting keywords from the converted junk information samples and non-junk information samples;
the probability calculation module is used for calculating the probability of the keyword appearing in the converted junk information sample and the probability of the keyword appearing in the converted non-junk information sample, taking the probability of the keyword appearing in the converted junk information sample as the junk information probability corresponding to the keyword, and taking the probability of the keyword appearing in the converted non-junk information sample as the non-junk information probability corresponding to the keyword;
and the generating module is used for storing the keywords, and generating a first keyword library according to the junk information probability and the non-junk information probability corresponding to the keywords.
12. The apparatus of claim 7, wherein the rework determination sub-module further comprises:
the first modification determining subunit is used for determining that the information to be identified is the junk information when the judgment result of the first behavior judging subunit is present, and recording the behavior of issuing the information to be identified as the violation behavior in the behavior record; when the judgment result of the first behavior judgment subunit is that the information to be identified is not present, reducing the probability that the information to be identified is junk information, determining that the information to be identified is non-junk information, and recording the behavior of issuing the information to be identified as suspicious behavior in behavior records;
the second behavior judgment subunit is configured to, when the probability that the information to be identified is spam is smaller than the spam threshold, judge whether an illegal behavior or a suspicious behavior exists in the behavior record;
the second modification determining subunit is used for increasing the probability that the information to be identified is the junk information when the judgment result of the second behavior judging subunit is present, determining that the information to be identified is the junk information, and recording the behavior of issuing the information to be identified in the behavior record as an illegal behavior; and when the judgment result of the second behavior judgment subunit is not existed, determining that the information to be identified is non-spam information, and recording the behavior of issuing the information to be identified as normal behavior in the behavior record.
CN201310156662.3A 2013-04-28 2013-04-28 Method and device for identifying junk information Active CN103313248B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310156662.3A CN103313248B (en) 2013-04-28 2013-04-28 Method and device for identifying junk information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310156662.3A CN103313248B (en) 2013-04-28 2013-04-28 Method and device for identifying junk information

Publications (2)

Publication Number Publication Date
CN103313248A CN103313248A (en) 2013-09-18
CN103313248B true CN103313248B (en) 2017-04-12

Family

ID=49137926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310156662.3A Active CN103313248B (en) 2013-04-28 2013-04-28 Method and device for identifying junk information

Country Status (1)

Country Link
CN (1) CN103313248B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808580B (en) * 2014-12-29 2019-08-13 中国移动通信集团公司 A kind of information determination method and equipment based on prior model
CN105808602B (en) * 2014-12-31 2020-04-21 中国移动通信集团公司 Method and device for detecting junk information
CN104765784A (en) * 2015-03-20 2015-07-08 新浪网技术(中国)有限公司 Key words list maintenance method and system
CN105187408A (en) * 2015-08-17 2015-12-23 北京神州绿盟信息安全科技股份有限公司 Network attack detection method and equipment
CN107229638A (en) * 2016-03-24 2017-10-03 北京搜狗科技发展有限公司 A kind of text message processing method and device
CN105873064A (en) * 2016-03-28 2016-08-17 伍文华 Spam identification system and method of internet APP (Application)
CN105898722B (en) * 2016-03-31 2019-07-26 联想(北京)有限公司 A kind of discrimination method, device and the electronic equipment of improper short message
CN106934008B (en) * 2017-02-15 2020-07-21 北京时间股份有限公司 Junk information identification method and device
CN107135494B (en) * 2017-04-24 2020-06-19 北京小米移动软件有限公司 Spam short message identification method and device
CN111092803A (en) * 2018-10-23 2020-05-01 阿里巴巴集团控股有限公司 Message processing method, device, system and storage medium
CN109525951A (en) * 2018-12-03 2019-03-26 中国联合网络通信集团有限公司 Junk short message processing method, device and equipment
CN113420549B (en) * 2021-07-02 2023-06-13 珠海金山数字网络科技有限公司 Abnormal character string identification method and device
CN113591464B (en) * 2021-07-28 2022-06-10 百度在线网络技术(北京)有限公司 Variant text detection method, model training method, device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323929A (en) * 2011-08-23 2012-01-18 上海粱江通信技术有限公司 Method for realizing fuzzy matching of Chinese short message with keyword
CN102801859A (en) * 2012-08-03 2012-11-28 陈伟 Method and device for identifying junk short message, and mobile communication terminal with device
CN103024746A (en) * 2012-12-30 2013-04-03 清华大学 System and method for processing spam short messages for telecommunication operator

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102323929A (en) * 2011-08-23 2012-01-18 上海粱江通信技术有限公司 Method for realizing fuzzy matching of Chinese short message with keyword
CN102801859A (en) * 2012-08-03 2012-11-28 陈伟 Method and device for identifying junk short message, and mobile communication terminal with device
CN103024746A (en) * 2012-12-30 2013-04-03 清华大学 System and method for processing spam short messages for telecommunication operator

Also Published As

Publication number Publication date
CN103313248A (en) 2013-09-18

Similar Documents

Publication Publication Date Title
CN103313248B (en) Method and device for identifying junk information
US8161059B2 (en) Method and apparatus for collecting entity aliases
US8554540B2 (en) Topic map based indexing and searching apparatus
CN102053991B (en) Method and system for multi-language document retrieval
CN103546446B (en) Phishing website detection method, device and terminal
EP4040310A1 (en) Image and text data hierarchical classifiers
CN109508458B (en) Legal entity identification method and device
CN108027814B (en) Stop word recognition method and device
CN106776567B (en) Internet big data analysis and extraction method and system
JP2012510654A (en) System and method for matching entities
CN110929125A (en) Search recall method, apparatus, device and storage medium thereof
GB2509773A (en) Automatic genre determination of web content
CN103294778A (en) Method and system for pushing messages
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN112149386A (en) Event extraction method, storage medium and server
CN103646119A (en) Method and device for generating user behavior record
CN111563382A (en) Text information acquisition method and device, storage medium and computer equipment
US20210342393A1 (en) Artificial intelligence for content discovery
CN108345694B (en) Document retrieval method and system based on theme database
CN113051384B (en) User portrait extraction method based on dialogue and related device
CN109660621A (en) Content pushing method and service equipment
CN105512270B (en) Method and device for determining related objects
CN109145261B (en) Method and device for generating label
JP6107003B2 (en) Dictionary updating apparatus, speech recognition system, dictionary updating method, speech recognition method, and computer program
CN107577667B (en) Entity word processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: 100085 Beijing city Haidian District Qinghe Street No. 68 Huarun colorful city shopping center two floor 13

Applicant after: Xiaomi Technology Co., Ltd.

Address before: 100102 Beijing Wangjing West Road, a volume of stone world building, A, block, floor 12

Applicant before: Beijing Xiaomi Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant