CN118070810B - Text duplicate checking method based on Simhash algorithm - Google Patents
Text duplicate checking method based on Simhash algorithm Download PDFInfo
- Publication number
- CN118070810B CN118070810B CN202410328017.3A CN202410328017A CN118070810B CN 118070810 B CN118070810 B CN 118070810B CN 202410328017 A CN202410328017 A CN 202410328017A CN 118070810 B CN118070810 B CN 118070810B
- Authority
- CN
- China
- Prior art keywords
- word
- semantic
- core
- obtaining
- text data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 44
- 239000011159 matrix material Substances 0.000 claims abstract description 38
- 230000006870 function Effects 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 abstract description 5
- 238000012512 characterization method Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 238000012935 Averaging Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of text processing, in particular to a text duplication checking method based on a Simhash algorithm, which comprises the following steps: collecting semantic vectors of each word in text data for text duplication; acquiring semantic vectors and semantic correlation matrixes of all sentences according to the semantic vectors of all words, and acquiring a correlation core degree by combining the occurrence times of the words in the sentences in the text data; obtaining a core candidate word list according to the text data, and obtaining a semantic recognition degree by combining the associated core degree; obtaining a semantic identification correlation matrix according to the semantic identification; obtaining semantic representative weights of words in the core candidate word list according to the semantic identification correlation matrix; obtaining a hash value of the text data according to the semantic representative weight; and obtaining the hash value of the text to be checked, and obtaining the check result by combining the hash value of the text data, so that the text check precision based on the Simhash algorithm can be improved.
Description
Technical Field
The application relates to the technical field of text processing, in particular to a text duplicate checking method based on a Simhash algorithm.
Background
Text duplication refers to identifying and comparing the similarity between texts, and is generally used for detecting scenes such as plagiarism, copyright infringement, data duplication removal and the like. When facing the task of searching for duplicate text, a hash algorithm is generally adopted to convert the text into a hash value with a fixed length, so as to conveniently compare the similarity between the texts and store data. The simhash algorithm is a common text duplication checking method, which is a locality sensitive hash technology, and the core idea is to convert texts into a binary code with a fixed length, called feature words, and then judge the similarity between the texts by comparing the hamming distances between the feature words.
Because the simhash algorithm is adopted to extract the characteristics of the articles, the keywords of the articles are obtained through TF-IDF word frequency-inverse document frequency, and finally the whole articles are expressed as a binary code with a fixed length through the extracted characteristics, and the similarity is determined by comparing the hamming distances of the codes of the two articles. In this way, although an article can be compressed to a code with a fixed length, when the key word of the article is obtained by adopting the TF-IDF method, the characterization capability of the key word on the article is reduced only by considering the word frequency of the article, so that the effect is poor when similarity detection is carried out, and the text duplicate checking precision is lower.
Disclosure of Invention
In order to solve the technical problems, the invention provides a text duplication checking method based on a Simhash algorithm, which aims to solve the existing problems.
The text duplication checking method based on the Simhash algorithm adopts the following technical scheme:
the embodiment of the invention provides a text duplication checking method based on a Simhash algorithm, which comprises the following steps:
Collecting text data for text duplication;
Acquiring semantic vectors of each word in the text data; acquiring a semantic relevance matrix of each sentence according to semantic vectors of all words contained in each sentence in the text data; taking the average value of the semantic vectors of all words in the sentence as the semantic vector of the sentence; obtaining the association core degree of each word in the sentence according to the semantic vector, the semantic correlation matrix and the occurrence frequency of the word in the sentence in the text data; acquiring a core candidate word list according to the text data, and acquiring the semantic identity of each word in the core candidate word list by combining the associated core degree; obtaining semantic recognition probability according to the semantic recognition degree of each word in the core candidate word list; obtaining semantic identification co-occurrence probability according to the semantic identification degree of each word in the core candidate word list, and obtaining a semantic identification correlation matrix by combining the semantic identification probability; obtaining semantic representative weights of words in the core candidate word list according to the semantic identification correlation matrix; obtaining core words of text data according to semantic representative weights, and obtaining hash codes of the core words;
taking the accumulated result of hash codes of all core words as a hash value of text data; obtaining a text to be checked, and obtaining a hash value of the text to be checked by adopting an obtaining method which is the same as the hash value of the text data; and obtaining a duplicate checking result of the duplicate checking text according to the hash value of the text data and the duplicate checking text.
Further, the acquiring the semantic vector of each word in the text data includes:
a pre-trained BERT language model is used on the text data to obtain semantic vectors for each word in the text data.
Further, the obtaining the semantic relevance matrix of each sentence according to the semantic vectors of all words contained in each sentence in the text data includes:
for each sentence in the text data, cosine similarity between the semantic vector of each word in the sentence and the semantic vector of other words is used as each element in the semantic correlation matrix of the sentence.
Further, the obtaining the association core degree of each word in the sentence according to the semantic vector, the semantic relevance matrix and the number of times that the word in the sentence appears in the text data includes:
for each sentence, calculating cosine similarity between the semantic vector of the sentence and the semantic vector of the sentence after the ith word in the sentence is removed; taking the difference value between the 2 and the cosine similarity as the semantic strippable degree of the ith word in the sentence;
the frequency of occurrence of the jth word in the sentence in the text data is recorded as the word frequency of the jth word in the sentence; calculating the sum value of the elements of the ith row and the jth column and 1 in the semantic correlation matrix of the sentence; obtaining the product of the sum and the word frequency of the jth word; calculating the ratio of the product to the semantic strippability of the ith word; and taking the sum value of the ratio of the ith word to all words in the sentence as the association core degree of the ith word in the sentence.
Further, the obtaining the core candidate vocabulary according to the text data and obtaining the semantic recognition degree of each word in the core candidate vocabulary by combining the association core degree includes:
Arranging all words in the text data according to the sequence from big word frequency to small word frequency to obtain word frequency sequence of the text data; marking a list formed by a preset number of words in a word frequency sequence of the text data as a core candidate word list;
For the mth word in the core candidate word list, calculating a TF-IDF value of the mth word in text data by using simhash algorithm; marking sentences containing the mth word in the text data as associated sentences of the mth word in a core candidate word list; calculating the average value of the association core degree of the mth word in all the association sentences; and taking the product of the mean value and the TF-IDF value as the semantic recognition degree of the m-th word in the core candidate word list.
Further, the semantic recognition probability includes:
And taking the product of the word frequency and the semantic recognition degree of each word in the core candidate word list as the semantic recognition probability of each word in the core candidate word list.
Further, the obtaining the semantic recognition co-occurrence probability according to the semantic recognition degree of each word in the core candidate word list and combining the semantic recognition probability to obtain the semantic recognition correlation matrix comprises the following steps:
calculating the sum of the semantic recognition degree of the t word and the semantic recognition degree of the v word in the core candidate word list; acquiring the number of sentences containing the t word and the v word in the core candidate word list; taking the product of the sum and the number as the semantic recognition co-occurrence probability between the t word and the v word in the core candidate word list;
Calculating the product of the semantic recognition probability of the t word and the semantic recognition probability of the v word in the core candidate word list, and marking the product as a first product; acquiring the ratio of the semantic identification co-occurrence probability to the first product; obtaining a logarithmic function taking 2 as a base number and taking the ratio as a true number; obtaining the sum value of the calculation result of the logarithmic function and 1, and recording the sum value as a first sum value; taking the product of the first sum and the semantic recognition co-occurrence probability as the semantic recognition correlation between the t word and the v word in the core candidate word list; and taking the semantic identification correlation as an element of a t th row and a v th column in a semantic identification correlation matrix.
Further, the obtaining the semantic representative weight of each word in the core candidate vocabulary according to the semantic recognition correlation matrix includes:
Obtaining a logarithmic function taking 2 as a base number and taking the semantic identification correlation as a true number, and recording the logarithmic function as a first logarithmic function; obtaining the opposite number of the product of the calculation result of the first logarithmic function and the semantic identification correlation;
And taking the sum value of the t-th word and the opposite numbers of all words in the core candidate word list as the semantic representative weight of the t-th word in the core candidate word list.
Further, the obtaining the core words of the text data according to the semantic representative weights and obtaining hash codes of the core words includes:
Arranging all words in the core candidate word list according to the sequence from big to small of semantic representative weights to obtain a semantic representative weight sequence; taking the first preset number of words in the semantic representative weight sequence as core words of the text data; hash algorithm is used for each core word of the text data to obtain hash codes of each core word.
Further, the obtaining the duplicate checking result of the duplicate checking text according to the hash value of the text data and the duplicate checking text includes:
Calculating the Hamming distance between the hash value of the text data and the hash value of the text to be checked; and if the Hamming distance is smaller than a preset repetition threshold, the repeated text to be checked is repeated content.
The invention has at least the following beneficial effects:
According to the method, the core words which can represent the articles are obtained through analyzing semantic information in the articles and spatial distribution among the words, and are converted into binary codes through hash codes to calculate the repeatability among the articles. Specifically, firstly, obtaining a correlation core degree according to the semantic vector of each word in text data of an article, and reflecting the importance degree of each word in a sentence on check; further analyzing the difference of associated core degrees of the same word appearing in the article for a plurality of times, calculating the semantic recognition degree of the word by combining the word frequency of the corresponding word, reflecting the characterization capability of the word in the article, and preliminarily determining the core candidate word list of the article; in order to further extract core words capable of representing the central content of the article, a semantic recognition correlation matrix is obtained according to the semantic recognition degree of each word in the core candidate word list, so that semantic representative weights are obtained, finally, the core words capable of representing the article are selected according to the semantic recognition co-occurrence probability among the words, and are converted into binary codes for repeat comparison. The method and the device realize the distribution of the semantics of the article and the words, and more comprehensively analyze the importance degree of each word in the article, thereby selecting the word with more representativeness as the core word of the article to participate in calculation, and improving the precision of text duplicate checking by using simhash algorithm.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a step flow chart of a text duplication checking method based on Simhash algorithm provided by the invention;
FIG. 2 is a flow chart of semantic representative weight acquisition.
Detailed Description
In order to further describe the technical means and effects adopted by the invention to achieve the preset aim, the following is a text duplication checking method based on Simhash algorithm according to the invention, and the specific implementation, structure, characteristics and effects thereof are described in detail below with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the text duplication checking method based on the Simhash algorithm provided by the invention with reference to the accompanying drawings.
The invention provides a text duplication checking method based on a Simhash algorithm, in particular to a text duplication checking method based on a Simhash algorithm, referring to fig. 1, comprising the following steps:
and S001, processing the acquired articles to obtain processed text data.
The present embodiment obtains articles from arXiv datasets for academic paper review. Before processing the text content, it is generally required to perform word segmentation, where a jieba chinese word segmentation tool is used to divide sentences in the article to be processed into individual words, and jieba word segmentation tool is a known technology, which is not described in detail in this embodiment.
Since keywords representing the current article are required to be determined according to the importance of the words in the article in simhash algorithm, the stop words which have no practical meaning are required to be removed in the preprocessing stage, here, the stop words are removed from the acquired articles in arXiv dataset by using the Hadamard stop vocabulary, and the processed text data are obtained, wherein simhash algorithm is a well-known technology and will not be repeated.
Step S002, obtaining semantic recognition degree according to semantic vectors of words in the text data, preliminarily determining a core candidate word list, and obtaining a semantic recognition correlation matrix according to the semantic recognition degree; and obtaining semantic representative weights of the words in the core candidate word list according to the semantic recognition correlation matrix.
In the original simhash algorithm, TF-IDF word frequency-inverse document frequency is typically used as the weight of each word in the article to highlight the important vocabulary in the article as the core word of the article. But only by the characteristics of word frequency distribution, the spatial characteristics and semantic characteristics of the word distribution lack of the core words in the article are determined. Therefore, in this embodiment, semantic representative weights need to be calculated according to semantic information of the article and the distribution characteristics of the words, and a specific flow of obtaining the semantic representative weights is shown in fig. 2.
First, semantic information of each word is acquired, and then semantic identity of each word is calculated. In order to obtain the semantic identity of each word in an article, a semantic vector of each word needs to be obtained first, where each word in the text data is converted into a semantic vector by an existing pre-trained BERT language model. Specifically, sentences in text data are input into a public pre-training BERT language model, semantic vectors representing semantic information of each word can be obtained from an embedded layer in the text data, and the pre-training BERT language model is a known technology and is not repeated.
For a sentence W in the text data,The semantic relativity matrix A is constructed, wherein N 1 is the number of words in a sentence, and the method specifically comprises the following steps:
Wherein A i,j represents an element of the ith row and the jth column in the semantic relevance matrix of the sentence W, and reflects the semantic relevance between the ith word and the jth word in the sentence; cos () represents a cosine similarity function, and the calculation of the cosine similarity is a known technique and will not be described in detail; a semantic vector representing the i-th word W i in sentence W; a semantic vector representing the j-th word W j in sentence W.
The semantic vector of each word in the sentence contains information such as the semantic meaning, grammar structure and the like of each word, the cosine similarity between the semantic vector and the grammar structure represents the semantic relevance between the two words, and the greater the cosine value is, the greater the semantic relevance between the two words is, and otherwise, the smaller the semantic relevance between the two words is.
Then, through the semantic relativity matrix A of the sentence W, the association core degree of each word in the sentence W is calculated, and the method is concretely as follows:
wherein B i represents the association core of the ith word in the sentence W; n 1 represents the number of words in sentence W; a i,j represents the element of the ith row and j column in the semantic relativity matrix of the sentence W, and represents the semantic relativity between the ith word and the jth word in the sentence; The word frequency of the jth word in the sentence W is represented, and the word frequency is the number of times the word appears in text data; q i denotes the semantic strippability of the ith word in sentence W; cos () represents a cosine similarity function; em W represents the semantic vector of the sentence W, and the semantic vector of the sentence is obtained by averaging the semantic vectors of all words; Meaning vectors of sentences after the ith word in the sentence W is removed are represented, and the meaning vectors are obtained by averaging the meaning vectors of the words.
When the semantic relativity between the ith word and other words in the sentence is higher and the corresponding word frequency is higher, the word associated with the ith word in the sentence is represented to be more important, namely the association core degree of the ith word in the sentence is higher, otherwise, the word associated with the ith word in the sentence is represented to be less important, namely the association core degree of the ith word in the sentence is lower. Similarly, when the semantic change degree of the sentence W before and after the ith word is removed is larger, the semantic strippable degree of the word is smaller, so that the association core degree of the word is larger, otherwise, the semantic strippable degree of the word is larger, and so that the association core degree of the word is smaller.
Thus, the association core degree of each word in each sentence in the article can be obtained.
Since the same word appears in the article many times and the associated core degree of each occurrence is different, it is necessary to determine the final associated core degree of each word and further calculate the semantic recognition degree of the word by combining the word frequency of the corresponding word. Wherein, all words in the text data are arranged according to the order of word frequency from big to small to obtain word frequency sequence of the text data, a list formed by the first 20% words in the word frequency sequence in the text data is taken as a core candidate word list and is recorded asN 2 represents the number of words contained in the core candidate vocabulary. And for the m-th word in the core candidate word list, marking the sentence containing the m-th word in the core candidate word list in the text data as the associated sentence of the m-th word in the core candidate word list. The specific calculation of the semantic identity is as follows:
Wherein, Representing the semantic identity of the mth word in the core candidate word list of the text data; representing the number of associated sentences of the mth word in the core candidate vocabulary; representing the association core degree of the mth word in the core candidate word list in the article in the kth associated sentence; the TF-IDF value of the mth word in the core candidate vocabulary in the article in the whole text data is represented, wherein the calculation of the TF-IDF value in simhash algorithm is a known technology and will not be described herein.
When the average value of the association core degree of the word in the core candidate word list is larger, the word is shown to be stronger and more representative in the article, namely, the semantic recognition degree of the word is higher, otherwise, the word is shown to be weaker and less representative in the article, namely, the semantic recognition degree of the word is lower; similarly, when the word frequency-inverse document frequency value of a word in the core candidate vocabulary is higher, the word is represented more and the semantic recognition degree is higher, whereas when the word frequency-inverse document frequency value of a word in the core candidate vocabulary is lower, the word is represented less and the semantic recognition degree is lower.
So far, the semantic identifiers of all words in the core candidate vocabulary are obtained.
Further, a semantic recognition correlation matrix of the text data is calculated according to the semantic recognition degree, and final text data representative words are determined. In order to enable the finally obtained representative words to more comprehensively represent the subjects of the articles, semantic recognition correlation between every two words in the core candidate word list is calculated through combining Point Mutual Information (PMI) ideas, and finally a semantic recognition correlation matrix C is obtained.
When the final core word is determined from the core candidate word list, if only the semantic recognition degree is considered, the selected article representative word may be a related paraphrasing word, and may only express the same thing, so that the subject of the article cannot be comprehensively represented. Therefore, the semantic recognition correlation between words in the core candidate vocabulary is used to distinguish words with greater correlation. The specific calculation process of the semantic recognition correlation matrix is as follows:
Wherein C t,v represents an element of the t-th row and v-th column in the semantic recognition correlation matrix C, and represents semantic recognition correlation between the t-th word and the v-th word in the core candidate word list; p (T t,Tv) represents the semantic recognition co-occurrence probability between the T-th word and the v-th word in the core candidate vocabulary; p (T t) represents the semantic recognition probability of the T-th word in the core candidate vocabulary; p (T v) represents the semantic recognition probability of the v-th word in the core candidate vocabulary; log 2 () represents a base 2 logarithmic function. Representing the semantic recognition degree of the t-th word in the core candidate word list; Representing the semantic identity of the v-th word in the core candidate word list; count (T t,Tv) represents the number of sentences that contain both the T-th word and the v-th word in the core candidate vocabulary; count (T t) represents the word frequency of the T-th word in the core candidate vocabulary.
When the semantic identification co-occurrence probability between two words in the core candidate word list is larger, the stronger the correlation between the two words is indicated, the larger the semantic identification correlation is, otherwise, the weaker the correlation between the two words is indicated, and the smaller the semantic identification correlation is; and when the ratio of the co-occurrence probability of the semantic recognition to the product of the respective semantic recognition probabilities of the two words in the core candidate word list is larger, the more relevant the two words are indicated, namely the more relevant the semantic recognition is, and otherwise, the less relevant the two words are indicated, namely the less relevant the semantic recognition is.
So far, the semantic recognition correlation matrix C reflecting the semantic recognition correlation among all words in the core candidate vocabulary of the text data can be obtained.
Next, the semantic representative weight α is calculated by the semantic recognition correlation matrix C in the following manner:
Wherein the method comprises the steps of Representing semantic representative weights of the t-th word in the core candidate word list; n 2 represents the number of words contained in the core candidate vocabulary; log 2 () represents a base 2 logarithmic function; c t,v represents the elements of row v and column t in the semantic recognition correlation matrix C.
When the sum of semantic identification correlations between the t-th word and all other words in the core candidate word list is larger, the independent characterizations of the words are shown to be worse, the semantic representativeness weights of the words are smaller, and otherwise, the independent characterizations of the words are shown to be better, and the semantic representativeness weights of the words are larger.
So far, the semantic representative weight of each word in the core candidate vocabulary can be obtained.
Step S003, determining the hash value of the text data through the semantic representative weight of each word in the core candidate word list, and combining the hash value of the text to be checked to obtain a check result.
When the simhash algorithm is adopted for text duplicate checking, the duplicate degree of two articles can be rapidly calculated mainly in the face of a large amount of text data. The conventional operation is to select a word with a higher score as a core word of an article by calculating the TF-IDF word frequency-inverse document frequency value of each word of the article, and map the word into a corresponding binary code through a hash algorithm.
In order to be able to select words more in line with the core of an article to characterize the article, all words in the core candidate word list are arranged in the order from the big to the small of semantic representativeness weights, a semantic representativeness weight sequence is obtained, and the first 5 words in the semantic representativeness weight sequence are selected as core words of text data. Then, a corresponding hash code is generated for the core word by a hash algorithm, the hash code is usually a binary code with a fixed size, 64 bits are taken here, the accumulated result of the hash codes of all the core words is used as a hash value of text data, and the hash code is obtained by the hash algorithm and is not described in detail.
And acquiring the text of the article to be checked, recording the text as the text to be checked again, and acquiring the hash value of the text to be checked by adopting the same acquisition method as the hash value of the text data. And finally, judging the similarity between the articles by calculating the Hamming distance between the Hamming value of the text data and the Hamming value of the text to be checked, and if the Hamming distance is smaller than the repetition threshold 3, considering that the content of the article to be checked and arXiv data sets are repeated.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; the technical solutions described in the foregoing embodiments are modified or some of the technical features are replaced equivalently, so that the essence of the corresponding technical solutions does not deviate from the scope of the technical solutions of the embodiments of the present application, and all the technical solutions are included in the protection scope of the present application.
Claims (4)
1. A text duplicate checking method based on Simhash algorithm is characterized by comprising the following steps:
Collecting text data for text duplication;
Acquiring semantic vectors of each word in the text data; acquiring a semantic relevance matrix of each sentence according to semantic vectors of all words contained in each sentence in the text data; taking the average value of the semantic vectors of all words in the sentence as the semantic vector of the sentence; obtaining the association core degree of each word in the sentence according to the semantic vector, the semantic correlation matrix and the occurrence frequency of the word in the sentence in the text data; acquiring a core candidate word list according to the text data, and acquiring the semantic identity of each word in the core candidate word list by combining the associated core degree; obtaining semantic recognition probability according to the semantic recognition degree of each word in the core candidate word list; obtaining semantic identification co-occurrence probability according to the semantic identification degree of each word in the core candidate word list, and obtaining a semantic identification correlation matrix by combining the semantic identification probability; obtaining semantic representative weights of words in the core candidate word list according to the semantic identification correlation matrix; obtaining core words of text data according to semantic representative weights, and obtaining hash codes of the core words;
Taking the accumulated result of hash codes of all core words as a hash value of text data; obtaining a text to be checked, and obtaining a hash value of the text to be checked by adopting an obtaining method which is the same as the hash value of the text data; obtaining a duplicate checking result of the duplicate text to be checked according to the hash value of the text data and the duplicate text to be checked;
The obtaining the semantic relevance matrix of each sentence according to the semantic vector of all words contained in each sentence in the text data comprises the following steps:
for each sentence in the text data, taking cosine similarity between semantic vectors of each word in the sentence and semantic vectors of other words as each element in a semantic correlation matrix of the sentence;
the obtaining the association core degree of each word in the sentence according to the semantic vector, the semantic correlation matrix and the occurrence times of the word in the sentence in the text data comprises the following steps:
for each sentence, calculating cosine similarity between the semantic vector of the sentence and the semantic vector of the sentence after the ith word in the sentence is removed; taking the difference value between the 2 and the cosine similarity as the semantic strippable degree of the ith word in the sentence;
The frequency of occurrence of the jth word in the sentence in the text data is recorded as the word frequency of the jth word in the sentence; calculating the sum value of the elements of the ith row and the jth column and 1 in the semantic correlation matrix of the sentence; obtaining the product of the sum and the word frequency of the jth word; calculating the ratio of the product to the semantic strippability of the ith word; taking the sum of the ratio of the ith word to all words in the sentence as the association core degree of the ith word in the sentence;
The obtaining the core candidate word list according to the text data and combining the associated core degree to obtain the semantic recognition degree of each word in the core candidate word list comprises the following steps:
Arranging all words in the text data according to the sequence from big word frequency to small word frequency to obtain word frequency sequence of the text data; marking a list formed by a preset number of words in a word frequency sequence of the text data as a core candidate word list;
For the mth word in the core candidate word list, calculating a TF-IDF value of the mth word in text data by using simhash algorithm; marking sentences containing the mth word in the text data as associated sentences of the mth word in a core candidate word list; calculating the average value of the association core degree of the mth word in all the association sentences; taking the product of the mean value and the TF-IDF value as the semantic recognition degree of the m-th word in the core candidate word list;
The semantic recognition probability comprises:
Taking the product of the word frequency and the semantic recognition degree of each word in the core candidate word list as the semantic recognition probability of each word in the core candidate word list;
The obtaining the semantic recognition co-occurrence probability according to the semantic recognition degree of each word in the core candidate word list and combining the semantic recognition probability to obtain the semantic recognition correlation matrix comprises the following steps:
calculating the sum of the semantic recognition degree of the t word and the semantic recognition degree of the v word in the core candidate word list; acquiring the number of sentences containing the t word and the v word in the core candidate word list; taking the product of the sum and the number as the semantic recognition co-occurrence probability between the t word and the v word in the core candidate word list;
calculating the product of the semantic recognition probability of the t word and the semantic recognition probability of the v word in the core candidate word list, and marking the product as a first product; acquiring the ratio of the semantic identification co-occurrence probability to the first product; obtaining a logarithmic function taking 2 as a base number and taking the ratio as a true number; obtaining the sum value of the calculation result of the logarithmic function and 1, and recording the sum value as a first sum value; taking the product of the first sum and the semantic recognition co-occurrence probability as the semantic recognition correlation between the t word and the v word in the core candidate word list; taking the semantic identification correlation as an element of a t th row and a v th column in a semantic identification correlation matrix;
the obtaining the semantic representative weight of each word in the core candidate word list according to the semantic identification correlation matrix comprises the following steps:
Obtaining a logarithmic function taking 2 as a base number and taking the semantic identification correlation as a true number, and recording the logarithmic function as a first logarithmic function; obtaining the opposite number of the product of the calculation result of the first logarithmic function and the semantic identification correlation;
And taking the sum value of the t-th word and the opposite numbers of all words in the core candidate word list as the semantic representative weight of the t-th word in the core candidate word list.
2. The text duplication checking method based on Simhash algorithm as claimed in claim 1, wherein the obtaining the semantic vector of each word in the text data comprises:
a pre-trained BERT language model is used on the text data to obtain semantic vectors for each word in the text data.
3. The text duplication checking method based on Simhash algorithm as claimed in claim 1, wherein the obtaining the core words of the text data according to the semantic representativeness weights and obtaining the hash codes of the core words includes:
Arranging all words in the core candidate word list according to the sequence from big to small of semantic representative weights to obtain a semantic representative weight sequence; taking the first preset number of words in the semantic representative weight sequence as core words of the text data; hash algorithm is used for each core word of the text data to obtain hash codes of each core word.
4. The text duplication checking method based on Simhash algorithm as claimed in claim 1, wherein the obtaining the duplication checking result of the text to be checked according to the hash value of the text data and the text to be checked includes:
Calculating the Hamming distance between the hash value of the text data and the hash value of the text to be checked; and if the Hamming distance is smaller than a preset repetition threshold, the repeated text to be checked is repeated content.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410328017.3A CN118070810B (en) | 2024-03-21 | 2024-03-21 | Text duplicate checking method based on Simhash algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410328017.3A CN118070810B (en) | 2024-03-21 | 2024-03-21 | Text duplicate checking method based on Simhash algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118070810A CN118070810A (en) | 2024-05-24 |
CN118070810B true CN118070810B (en) | 2024-07-26 |
Family
ID=91098991
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410328017.3A Active CN118070810B (en) | 2024-03-21 | 2024-03-21 | Text duplicate checking method based on Simhash algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118070810B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649749A (en) * | 2016-12-26 | 2017-05-10 | 浙江传媒学院 | Chinese voice bit characteristic-based text duplication checking method |
CN107908622A (en) * | 2017-11-22 | 2018-04-13 | 昆明理工大学 | A kind of transcription comparison method based on synonymous conjunctive word |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112926339B (en) * | 2021-03-09 | 2024-02-09 | 北京小米移动软件有限公司 | Text similarity determination method, system, storage medium and electronic equipment |
CN115759082A (en) * | 2022-11-17 | 2023-03-07 | 浙江浙里信征信有限公司 | Text duplicate checking method and device based on improved Simhash algorithm |
-
2024
- 2024-03-21 CN CN202410328017.3A patent/CN118070810B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649749A (en) * | 2016-12-26 | 2017-05-10 | 浙江传媒学院 | Chinese voice bit characteristic-based text duplication checking method |
CN107908622A (en) * | 2017-11-22 | 2018-04-13 | 昆明理工大学 | A kind of transcription comparison method based on synonymous conjunctive word |
Also Published As
Publication number | Publication date |
---|---|
CN118070810A (en) | 2024-05-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109299228B (en) | Computer-implemented text risk prediction method and device | |
CN111160017A (en) | Keyword extraction method, phonetics scoring method and phonetics recommendation method | |
CN110377714A (en) | Text matching technique, device, medium and equipment based on transfer learning | |
CN112347223B (en) | Document retrieval method, apparatus, and computer-readable storage medium | |
Isa et al. | Indobert for indonesian fake news detection | |
CN113672931B (en) | Software vulnerability automatic detection method and device based on pre-training | |
CN112307364B (en) | Character representation-oriented news text place extraction method | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
CN112036705A (en) | Quality inspection result data acquisition method, device and equipment | |
CN111866004B (en) | Security assessment method, apparatus, computer system, and medium | |
CN110852071B (en) | Knowledge point detection method, device, equipment and readable storage medium | |
CN116756347B (en) | Semantic information retrieval method based on big data | |
CN111930933A (en) | Detection case processing method and device based on artificial intelligence | |
CN114386421A (en) | Similar news detection method and device, computer equipment and storage medium | |
CN112562736A (en) | Voice data set quality evaluation method and device | |
CN119377415B (en) | Chinese bad language theory detection method and system | |
CN106815209B (en) | Uygur agricultural technical term identification method | |
CN115168590A (en) | Text feature extraction method, model training method, device, equipment and medium | |
CN114925702A (en) | Text similarity recognition method and device, electronic equipment and storage medium | |
CN114943229A (en) | Software defect named entity identification method based on multi-level feature fusion | |
CN118070810B (en) | Text duplicate checking method based on Simhash algorithm | |
CN112035670B (en) | Multi-modal rumor detection method based on image emotional tendency | |
CN115510826A (en) | Table relevance recommendation method based on multi-way recall and ESIM (electronic signature model) fine-ranking | |
Li et al. | Attention-based LSTM-CNNs for uncertainty identification on Chinese social media texts | |
Lai et al. | An unsupervised approach to discover media frames |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |