Method and system for improved Simhash algorithm in text deduplication
Technical Field
The invention belongs to the technical field of information processing, and particularly relates to a method and a system for an improved Simhash algorithm in text deduplication.
Background
Currently, the current state of the art commonly used in the industry is such that:
in terms of removing redundant data, the Simhash algorithm is currently recognized as the best deduplication algorithm. The algorithm is a locality sensitive hash algorithm, and can perform probability dimension reduction on high-dimensional data, map the high-dimensional data into fingerprints with fewer bits and fixed, and then perform similarity comparison on the fingerprints to reflect the similarity degree between the data. Where similarity comparisons typically use hamming distance or edit distance. The Simhash algorithm has the advantages of high processing speed and high result accuracy.
Nowadays, the Simhash algorithm is widely applied to the fields of approximate text detection, redundant data deduplication, anomaly detection and the like. The director, Zhengqinghua and the like propose a multi-Simhash-based fingerprint algorithm, and similarity calculation is carried out on a plurality of fingerprint values through a k-dimensional multi-surface, so that the problems of single fingerprint and serious information loss are effectively solved; the reduction operation is added into the Simhash algorithm for Chen wave, Pan-Yong wave and the like, and a threshold value T is subtracted from the result of the finally combined result sequence string, so that the accuracy of the Simhash algorithm is improved. NIS, Qian Q and the like combine a Simhash algorithm and a CNN for malware detection, and the identification rate and performance of the malware are improved by converting the Simhash algorithm and the CNN into gray images.
In summary, the problems of the prior art are as follows:
(1) in the prior art, the data processing method has high processing speed but low result accuracy.
(2) The existing Simhash algorithm is deficient in the aspect of weight calculation, and the proportion of key characteristic items cannot be reflected in the generated hash fingerprint. (3) The prior art fails to embody the distribution information in the document feature vocabulary.
The difficulty of solving the technical problems is as follows:
in order to improve the text de-weight effect and the accuracy of the Simhash algorithm and solve the defect that the Simhash algorithm cannot embody the distribution information, the concept of information entropy is introduced, keywords in a document are weighted in an entropy weighting mode, a weight calculation formula is optimized, and key word distribution information is added in the hash calculation, so that the traditional Simhash algorithm is optimized, and finally the feasibility and the reasonability of the algorithm are demonstrated through a simulation experiment.
The significance of solving the technical problems is as follows:
the algorithm introduces TF-IDF and information entropy, text distribution information is increased by optimizing weight and threshold calculation in the Simhash algorithm, so that finally generated fingerprints can reflect the proportion of key information, and the relevance of fingerprint information and weight is analyzed. Simulation experiments show that: the performance of the Simhash algorithm can be effectively improved by optimizing the weight calculation, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of the duplication removal rate, the recall rate, the F value and the like, and a good effect is achieved in the aspect of text duplication removal.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an improved method and system for a Simhash algorithm in text deduplication.
The invention is realized in such a way that a method for an improved Simhash algorithm in text deduplication comprises the following steps: weighting based on a TF-IDF algorithm and the information entropy to obtain weights, sequencing according to distribution in the document, and performing XOR on the hash generated by each feature vocabulary and the position of the feature vocabulary;
after improved weight calculation, a weight threshold value W is introducedtAnd increasing text distribution information to enable the finally generated fingerprint to embody the proportion of key information, and analyzing the relevance of the fingerprint information and the weight.
Further, the method of the improved Simhash algorithm in text deduplication specifically comprises the following steps:
step 1, initialization:
determining the size of a data set and the storage cost of a Simhash digit and an f-dimensional vector space, and initializing f-digit binary numbers s to be 0;
step 2, document preprocessing:
performing word segmentation and word stop removal operation on the document to form a plurality of characteristic word items M { };
and step 3, weight calculation:
calculating TF-IDF value and left and right information entropy of the feature item after word segmentation, respectively, using the square average value of the TF value and the IDF value as the final weight of the feature item, and introducing a threshold value WtPreventing document feature distortion;
step 4, hash calculation:
performing hash calculation on the feature items in the second step, and introducing an exclusive or operation between a position factor and a hash as a final hash value of the feature items, wherein the hash value comprises position information of the feature items and is recorded as H { }, and the H { } is hash ();
and 5, accumulating the weight of the characteristic item generated in the third step and the f-bit hash value generated in the fourth step to perform the following operations:
step 6, compression transformation:
converting each digit of the finally generated secondary fingerprint vector V to finally generate a f-digit hash fingerprint S of the document;
further, n keywords are extracted from the document and are respectively { p }1,p2,p3,…pnW ═ W weight of each keyword1,w2,w3,…,wn}; generating hash values for n keywords, the result being H ═ H1,h2,h3,…,hnAnd after superposition, generating a secondary fingerprint F ═ F1,f2,f3,...fmM is the number of fingerprint bits, and finally F is the value of F in FiWhether the generated Simhash fingerprint is greater than 0 is S;
if some characteristic word p existskThe weight is
wk>>wj,j∈[1,n]∩j≠k;
S is represented by pkAnd (6) determining.
Further, after introducing the weight threshold, the weight is calculated as follows:
further, the information entropy is:
H(X)=-∑(xi∈X)P(xi)log2P(xi)
where X represents the information probability space X ═ X1:P(x1),x2:P(x2),...,xn:P(xn) H (x) represents a measure of uncertainty for the random variable x.
Further, the information entropy includes left and right information entropy, and the formula is as follows:
wherein W represents a word, EL(W) represents the left entropy of the word, P (aW | W) represents the probability of different words appearing on the left side of the word, and the a variable is a variance value representing the vocabulary combined with W; eR(W) is the right entropy.
Further, the entropy weighting calculation method includes:
averaging the left and right information entropies of the feature words; by Hk(w) to represent the amount of entropy information for the word; entropy factor HkAdding the weight value into a weight value calculation formula, taking the square average of the two as the word weight, and taking the following steps:
characteristic word tkIn document djThe more times, the fewer documents in which the feature word appears in the training set, and the larger the amount of information thereof, the higher the weight thereof.
Another object of the present invention is to provide a control system of an improved Simhash algorithm in text deduplication, which implements the improved method of a Simhash algorithm in text deduplication.
It is another object of the present invention to provide a redundant data storage medium for removing a Simhash algorithm in text deduplication, which implements the improved method for removing a Simhash algorithm in text deduplication.
In summary, the advantages and positive effects of the invention are:
the method has good effect on removing redundant data, and the Simhash algorithm is a local sensitive hash algorithm and can perform probability dimension reduction on high-dimensional data, map the high-dimensional data into fingerprints with fewer bits and fixed, and then perform similarity comparison on the fingerprints to reflect the similarity degree between the data. Where similarity comparisons typically use hamming distance or edit distance. The Simhash algorithm has the advantages of high processing speed and high result accuracy.
Aiming at the defects of the traditional Simhash algorithm in the aspect of weight calculation and the condition that the distribution information of a document feature vocabulary cannot be considered in the algorithm, the invention optimizes the weight calculation, uses TF-IDF and the square average of information entropy as the weight value of a feature word, introduces a weight threshold value in consideration of information distortion caused by overlarge partial weight, and introduces the position information of the feature word into the hash calculation on the basis of the weight threshold value, thereby improving the de-weight rate and the precision rate of the Simhash algorithm, and proving that the E-Simhash algorithm is superior to the traditional Simhash algorithm in all aspects through simulation experiments.
Through simulation experiments, full-network news data in a fox search laboratory is used as a document set, and in a de-duplication rate comparison experiment, the experimental result is shown in fig. 4. The similarity threshold T is modified, the duplication removal rates of the E-Simhash algorithm are respectively 0.833:0.679, 0.751:0.529, 0.687:0.476 and 0.661:0.451, the duplication removal rates are superior to those of the traditional Simhash algorithm, the duplication removal rates of the E-Simhash algorithm all show a descending trend along with the increase of article variation, and the simulation experiment result is shown in FIG. 5. Finally, in the comparison of precision, recall and F-value, as shown in FIG. 6, the E-Simhash algorithm is superior to the traditional Simhash algorithm in terms of precision 0.963:0.818, recall 0.867:0.621 and F1 values 0.912: 0.706. Simulation experiment results show that: the performance of the Simhash algorithm can be effectively improved by optimizing the weight calculation, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of the duplication removal rate, the recall rate, the F value and the like, and a good effect is achieved in the aspect of text duplication removal.
Drawings
Fig. 1 is a flowchart of a Simhash algorithm according to an embodiment of the present invention.
Fig. 2 is a diagram illustrating the effect of location on Simhash according to an embodiment of the present invention.
Fig. 3 is a process diagram of the E-Simhash algorithm according to the embodiment of the present invention.
Fig. 4 is a graph comparing the repetition rates at different hamming distances according to the embodiment of the present invention.
Fig. 5 is a graph comparing different threshold deduplication ratios provided by embodiments of the present invention.
FIG. 6 is a graph comparing overall performance provided by embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the prior art, the data processing method has high processing speed and low result accuracy.
To solve the above problems, the present invention will be described in detail with reference to specific analysis.
The method for the improved Simhash algorithm in text deduplication provided by the embodiment of the invention comprises the following steps: weighting based on a TF-IDF algorithm and the information entropy to obtain weights, sequencing according to distribution in the document, and performing XOR on the hash generated by each feature vocabulary and the position of the feature vocabulary;
after improved weight calculation, a weight threshold value W is introducedtAnd increasing text distribution information to enable the finally generated fingerprint to embody the proportion of key information, and analyzing the relevance of the fingerprint information and the weight.
The invention is further described below in conjunction with the definition analysis.
1. Analysis of Simhash algorithm
The principle of defining a SimHash algorithm is that for two given variables x, y, the hash function h always satisfies the following equation:
Prh∈F(h(x)=h(y))=sim(x,y) (1)
where sim (x, y) is an affinity function, and generally, the affinity of variables x, y is also expressed by a jacobian function, and sim (x, y) is expressed as follows:
h belongs to a hash function cluster F, and the following conditions need to be satisfied:
1) if d (x, y) ≦ d1Then Prh∈F(h(x)=h(y))≥p1;
2) If d (x, y) ≧ d2Then Prh∈F(h(x)=h(y))≤p2。
Is called F as (d)1,d2,p1,p1) The sensitive hash cluster function of (1). Where d (x, y) represents the distance between the variables x, y, colloquially, if x, y are similar enough, then the probability that they map to the same hash function is sufficiently large, otherwise the probability that the hash values are equal is sufficiently small.
Because the biggest difference between the traditional hash function and the Simhash function is local sensitivity, if some local slight modifications are made on input data, completely different results can be obtained after the traditional hash function operation, and the results of Simhash calculation are very similar, so that the similarity degree between source data can be expressed by using the fingerprint similarity degree generated by the Simhash function.
2. Simhash algorithm flow:
the Simhash algorithm process is that firstly, an f-dimension space is defined, then a vector corresponding to each feature is defined in the space, and then all vectors are combined with the weight of the vectors to carry out weighting and summation to obtain a sum vector as a result. Finally, the result is further compressed and converted, and the rule is as follows: and obtaining a corresponding f-bit signature information for each vector, if the value of the vector dimension is greater than 0, setting the position of the signature to be 1, and otherwise, setting the position to be 0. Through the conversion mode, the obtained signature information proves the information of the value of the vector in each dimension.
The Simhash algorithm flow chart is shown in fig. 1. The Simhash algorithm comprises the following specific steps:
step 1: initialization
And determining a Simhash bit number and an f-dimensional vector space according to the size of the data set and the storage cost, and initializing f-bit binary numbers s to be 0.
Step 2: document pre-processing
The method mainly comprises two parts, wherein the first part is word segmentation, searching characteristic words of a document, removing stop words of the document and the like. The second is weight assignment, which is generally ignored here that the weight calculation is set to 1.
And step 3: generating a hash value
Calculating an f-bit hash value of each feature word in the step two by using a traditional hash algorithm, and performing the following operations:
step 4, mule: compression transformation:
and (4) aiming at the finally generated vector V, performing conversion treatment on each bit.
And 5: fingerprint generation: and outputting the final signature S as the fingerprint of the document, and then calculating the similarity by using the hamming distance or the editing distance.
Step 6: and (3) distance calculation: similarity calculation is performed in the Simhash algorithm using Hamming distance. Hamming distance measures the similarity between two documents by comparing the number of different fingerprints in the two documents. The greater the hamming distance is, the lower the similarity representing the two character strings, otherwise, the higher the similarity of the two character strings is. For a binary string, the Hamming distance of two binary values can be calculated using an XOR operation.
The present invention will be further described with reference to the following examples.
Example 1:
let a, b be two binary numbers, where a is 00110 and b is 01110. It can be seen that the two binary numbers of a and b are different only in the second bit, so Hamming (a, b) is 1. The number of 1's in the XOR result can also be counted by using the XOR operation.
The total number of the capsules is 1, so the Hamming distance is 1.
The traditional Simhash algorithm is usually set to be 1 or the number of times of appearance of the feature words in the aspect of weight calculation, information loss is easily caused, the accuracy of the final Simhash fingerprint is reduced, the Simhash algorithm can know that the Simhash algorithm does not show word distribution information, and the final generated Simhash fingerprint cannot be influenced after the key feature words are adjusted.
As shown in fig. 2, the final meanings may be different from each other when the positions of the two keywords are adjusted, but the fingerprints generated by the conventional Simhash algorithm are the same.
Example 2
In order to improve the text deduplication effect and accuracy of the Simhash algorithm and solve the defect that the Simhash algorithm cannot embody distribution information, the Simhash algorithm (E-Simhash for short) based on information entropy weighting is provided. The algorithm introduces TF-IDF and information entropy, text distribution information is increased by optimizing weight and threshold calculation in the Simhash algorithm, so that finally generated fingerprints can reflect the proportion of key information, and the relevance of fingerprint information and weight is analyzed.
Simulation experiments show that: the performance of the Simhash algorithm can be effectively improved by optimizing the weight calculation, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of the duplication removal rate, the recall rate, the F value and the like, and a good effect is achieved in the aspect of text duplication removal.
In the present invention, (1) the word frequency-inverse file frequency includes:
the word frequency-inverse document frequency (TF-IDF) algorithm is a commonly used text feature weight calculation method, and a feature word tkIn document djThe TF-IDF value in (1) is denoted as tfidf (t)k,dj) The following definitions are provided:
definition II: characteristic order tkIn document djOf (d) is a frequency tf (t)k,dj) Is composed of
In the formula nj,kRepresentation feature word tkIn document djNumber of occurrences, Σinj,iRepresenting a document djThe number of all feature words in (1).
Defining three: inverse document frequency idf (t)k,dj) Is a coefficient that trades off the importance of the token, defined as:
in the formula: { j: t is tk∈djIs a word containing characteristics tkFor document reviews, | D | is the total number of files in the corpus.
Defining four: TF-IDF function, the word frequency weight of the characteristic word is defined as:
wk=tfidf(tk,dj)=tf(tk,dj)*idf(tk) (5)
in the present invention, (2) the information entropy includes:
the entropy of information represents a measure of the uncertainty of the result before a random event occurs, and the amount of information one gets from the event after the random event occurs.
According to the definition of information entropy:
H(X)=-∑(xi∈X)P(xi)log2P(xi) (6)
where X represents the information probability space X ═ X (X)1:P(x1),x2:P(x2),...,xn:P(xn) H (X) represents a measure of uncertainty of the random variable X.
In the invention, (3) left and right information entropy
Left-right entropy refers to the entropy of the left boundary and the entropy of the right boundary of a multi-word expression. The formula for the left and right entropy is as follows:
wherein W represents a word, EL(W) represents the left entropy of the word, P (aW | W) represents the probability of different words appearing on the left side of the word, and the a variable is a variance value representing the vocabulary combined with W. ER(W) is the same as above for right entropy.
In the present invention, (4) the entropy weighting calculation method includes:
the invention adopts an entropy weighting calculation method
The entropy of the left and right information of the feature word is averaged. By Hk(w) to represent the amount of entropy information for the word. Entropy factor HkAdding the weight value into a weight value calculation formula, taking the square average of the two as the word weight, and taking the following steps:
the physical meaning of the above formula is: characteristic word tkIn document djThe more times of occurrence in the training set, the more the feature word appears in the training setThe fewer documents of (a), and the greater the amount of information thereof, the higher the weight thereof.
In the present invention, the Simhash algorithm (E-Simhash) based on entropy weighting specifically includes:
weighting is carried out on the information entropy based on the TF-IDF algorithm to obtain weights, sequencing is carried out according to the distribution of the weights in the document, and the hash generated aiming at each characteristic vocabulary is subjected to exclusive OR with the position of the hash.
However, after improved weight calculation, partial feature sub-weights are too large due to factors such as incompleteness of a training set and the like, and finally precision ratio is reducedt. The following is a demonstration of the problem caused by the uneven weighting.
Let n keywords extracted from a document be { p }1,p2,p3,...pnW ═ W weight of each keyword1,w2,w3...,wn}. Generating hash values for n keywords, the result being H ═ H1,h2,h3,...,hnAnd after superposition, generating a secondary fingerprint F ═ F1,f2,f3,...fmM is the number of fingerprint bits, and finally F is the value of F in FiIf the value is greater than 0, the Simhash fingerprint is S.
If there is a certain characteristic word pkWeight of which
wk>wj,j∈[1,n]∩j≠k (11)
S is mainly composed of pkAnd (6) determining. The following was demonstrated:
is provided with hi={ai1,ai2,ai3,...,aim},aijIs a binary variable, then
Extracting wkThen there is
Due to wk>>wjTherefore:
so at this time:
finally F is mainly and pkCorrelation, prove complete.
The above proof also reflects the effect of the weights on the Simhash fingerprint.
After the weight threshold is introduced, the weight calculation at this time is as shown in equation (16):
in summary, the flow of the E-Simhash algorithm is shown in FIG. 3.
The E-Simhash algorithm is different from the traditional Simhash algorithm in the following three points, the information entropy is mainly introduced on the basis of TF-IDF to calculate the weight of the feature words, the square mean of the information entropy and the feature words is used as the final weight of the feature words, meanwhile, in order to avoid fingerprint distortion caused by the fact that the weight is too high, a weight threshold value is introduced, and the calculation mode is shown in a formula (16). And finally, carrying out exclusive or with the position of the feature word when the feature word hash is generated, so that the hash contains the position distribution information of the document.
The present invention is further described below with reference to specific experiments and simulations.
Simulation experiment and analysis: the invention mainly simulates a real application scene and verifies whether the performance of the E-Simhash algorithm is superior to that of the traditional Simhash algorithm.
Experimental environment and data set:
the experimental environment was deployed on a desktop computer with the following machine parameters:
table 1 experimental environmental parameters
The data set is from 2012 versions of news data of the whole network in a dog search laboratory, is classified news of nearly 20 columns from a plurality of news sites, eliminates data with less than 800 characters, and randomly selects 1565 news from the classified news for subsequent experiments.
Firstly, randomly selecting a plurality of news from 1565 news according to a modification proportion to carry out random operations such as modification, deletion, displacement, replacement and the like, controlling the similarity of the modified articles and the original articles to have a certain threshold value T, generating a sample set to be tested, then comparing the sample set with the algorithm in the patent by using the traditional Simhash, and counting relevant indexes of the experiment.
Analysis of Experimental results
In the experimental results, four indexes are commonly used for evaluation, namely, the duplication removal rate, the precision check rate, the recall rate and the F value, wherein the duplication removal rate is the ratio of the number of samples with correct classification to the total number of samples, and is predicted as the ratio of the number of the homologous articles to the total number of the articles in the experiment.
The invention is further described below in connection with specific experiments.
Experiment I, weight removal rate comparison:
randomly selecting 1162 news from 1565 news, randomly modifying, selecting different Hamming distances, comparing the accuracy of the two algorithms, wherein in the test, T is 15%, namely each news keeps no more than 15% of modification, the fingerprint length is 128 bits, and the word weight threshold W is settThe results are shown in fig. 4 below, at 90.
The experimental result shows that when the Hamming distance is more than 2, the E-Simhash algorithm has high de-weight rate. In practical application, the Hamming distance is about 10 generally, so that the de-duplication effect of the E-Simhash algorithm is better.
Experiment two: modify T-threshold comparison:
the similarity threshold T of the text is modified in the experiment, the Hamming distance is 10 under the modification of 5%, 10%, 15% and 20%, namely the Hamming distance is similar to the Hamming distance when the Hamming distance is lower than 10, and the deduplication rates of the two algorithms are compared. As shown in fig. 5, the duplication elimination rates of the E-Simhash algorithm are better than those of the conventional Simhash algorithm by 0.833:0.679, 0.751:0.529, 0.687:0.476 and 0.661:0.451, respectively, and all the duplication elimination rates show a decreasing trend as the article changes. Experimental results show that under different modification thresholds T, the E-Simhash algorithm is superior to the traditional Simhash algorithm.
Experiment three: precision, recall and F-value comparison:
in the experiment, an article is randomly selected from a news set to be randomly modified, 90% of similarity with the original article is guaranteed, and the precision ratio, the recall ratio and the F1 value based on the Simhash fingerprint and the E-Simhash algorithm are compared. Wherein the Hamming distance is 10; the experiment was performed 100 times and their average was taken as the final result, and the result is shown in fig. 6. According to experimental data, the E-Simhash algorithm is superior to the traditional Simhash algorithm in the aspects of precision ratio of 0.963:0.818, recall ratio of 0.867:0.621 and F1 value of 0.912: 0.706. The result shows that the precision ratio, the recall ratio and the F value of the E-Simhash algorithm are greatly improved compared with those of the common Simhash algorithm, and the superiority of the E-Simhash algorithm can be proved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.