CN116226681B

CN116226681B - Text similarity judging method and device, computer equipment and storage medium

Info

Publication number: CN116226681B
Application number: CN202310151312.1A
Authority: CN
Inventors: 马俊霖
Original assignee: Beijing Maxtech Co ltd
Current assignee: Beijing Maxtech Co ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-11-28
Anticipated expiration: 2043-02-22
Also published as: CN116226681A

Abstract

The application discloses a text similarity judging method, a text similarity judging device, computer equipment and a storage medium. The method comprises the following steps: acquiring a target text and a comparison text set and cleaning data; word cutting; extracting key words and splicing to generate spliced texts; calculating simhash values and MD5 fingerprints of the target spliced text and all comparison spliced texts; carrying out data barrel separation on all comparison spliced texts according to the text length; judging a barrel corresponding to the target spliced text according to the text length of the target spliced text; calculating the Hamming distance between the target spliced text and each comparison spliced text under the corresponding barrel; if the Hamming distance between the comparison spliced text and the target spliced text is smaller than the preset Hamming distance threshold, the comparison spliced text is judged to be similar to the target spliced text. The application can reduce the comparison process and improve the comparison efficiency; and the target text and a longer text with completely different main contents and only the reference target text can be prevented from being judged to be similar text, so that the judgment accuracy is improved.

Description

Text similarity judging method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a text similarity determination method, a text similarity determination device, a computer device, and a storage medium.

Background

With the advent of the big data age, the data information is rapidly increased, the occupied space of the data is larger and larger, and huge storage problems are brought to the massive data. It was found that the proportion of redundant data in the stored data was greater than six, and the proportion of redundancy continued to increase in the future. Redundant data reduces the efficiency of the user in retrieving and querying data, and a large amount of storage resources are wasted in storing the redundant data, and the user does not want to see a pile of retrieval results that are identical or similar in content. On the other hand, data crawled by data mining developers through a network also faces the problems of data repetition and redundancy. Therefore, document similarity detection and deduplication have become important research subjects at home and abroad.

The initial text deduplication technique is to refine a paragraph of text content, calculate cos values, and then make text similarity decisions, which has obvious disadvantages, namely inefficiency and inaccurate decisions. The similarity judging method of an improved version appears later, namely, a word is cut on a text, the TF-IDF value of the text is obtained, then the word of the text top is selected according to the length of the text, the word is sequenced and combined, then an MD5 fingerprint is generated, finally, similarity number combination is carried out in es, compared with the method, the efficiency is improved obviously, the occupied resources are small, and the accuracy is not improved obviously.

Disclosure of Invention

Based on the above, a text similarity judging method, a device, a computer device and a storage medium are provided for solving the technical problem that the accuracy of the existing similarity judging method is low.

In order to achieve the above object, the present application provides the following technical solutions:

in a first aspect, a text similarity determination method includes:

s1, acquiring a target text and a comparison text set, and cleaning data of each comparison text in the target text and the comparison text set;

s2, cutting words from each comparison text in the target text and the comparison text set, extracting key words of the target text, splicing to generate a target spliced text, and extracting corresponding key words of each comparison text in the comparison text set, splicing to generate a comparison spliced text;

s3, respectively calculating simhash values and MD5 fingerprints of the target spliced text and all comparison spliced texts;

s4, carrying out data sub-buckets on all the comparison spliced texts according to the text length, and storing simhash values and MD5 fingerprint sub-buckets of all the comparison spliced texts into redis;

s5, judging a barrel corresponding to the target spliced text according to the text length of the target spliced text;

s6, acquiring all comparison splice texts under the barrels corresponding to the target splice texts, and calculating the Hamming distance between the target splice texts and each comparison splice text under the barrels corresponding to the target splice texts;

and S7, if the Hamming distance between the comparison spliced text and the target spliced text is smaller than the preset Hamming distance threshold, judging that the comparison spliced text is similar to the target spliced text, and assigning the MD5 fingerprint of the comparison spliced text to the target spliced text.

Optionally, step S1 further includes:

performing abstract extraction on each comparison text in the target text and comparison text set after data cleaning;

and simplifying each comparison text in the target text and the comparison text set by using the relation tree.

Further optionally, the abstracting and the word segmentation are performed, in particular using HanLP.

Optionally, the accented vocabulary includes subjects, predicates, objects, and regional words.

Optionally, step S2 further includes:

after the completion of the word segmentation for each comparison text in the target text or the comparison text set, the nonsensical vocabulary is removed.

Optionally, the preset hamming distance threshold is 5.

Optionally, step S7 further includes:

if the Hamming distance between the comparison spliced text and the target spliced text is not smaller than the preset Hamming distance threshold, the simhash value and the MD5 fingerprint of the target spliced text are stored under a corresponding bucket in redis.

In a second aspect, a text similarity determination apparatus includes:

the text acquisition module is used for acquiring a target text and a comparison text set, and cleaning data of each comparison text in the target text and the comparison text set;

the text word segmentation module is used for segmenting each comparison text in the target text and the comparison text set, extracting key words of the target text, splicing the key words to generate a target spliced text, extracting corresponding key words of each comparison text in the comparison text set, splicing the key words to generate a comparison spliced text;

the computing module is used for respectively computing simhash values and MD5 fingerprints of the target spliced text and all comparison spliced texts;

the data sub-bucket module is used for carrying out data sub-buckets on all the comparison spliced texts according to the text length, and storing simhash values and MD5 fingerprint sub-buckets of all the comparison spliced texts into redis;

the judging module is used for judging the barrel corresponding to the target spliced text according to the text length of the target spliced text;

the Hamming distance calculation module is used for acquiring all comparison spliced texts under the barrels corresponding to the target spliced texts and calculating the Hamming distance between the target spliced texts and each comparison spliced text under the barrels corresponding to the target spliced texts;

and the similarity judging module is used for judging that the comparison spliced text is similar to the target spliced text if the Hamming distance between the comparison spliced text and the target spliced text is smaller than the preset Hamming distance threshold value, and assigning the MD5 fingerprint of the comparison spliced text to the target spliced text.

In a third aspect, a computer device comprises a memory storing a computer program and a processor implementing the steps of the method of any of the first aspects when the computer program is executed.

In a fourth aspect, a computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

The application has at least the following beneficial effects:

in the text similarity judging method provided by the embodiment of the application, a target text and a comparison text set are cleaned, words are cut, key words are extracted and spliced to generate a spliced text, then a simhash value and MD5 fingerprints of the spliced text are calculated, data are divided into barrels according to the text length, the barrel corresponding to the target spliced text is judged according to the text length of the target spliced text, the Hamming distance between the target spliced text and each comparison spliced text under the barrel corresponding to the target spliced text is calculated, and if the Hamming distance between the comparison spliced text and the target spliced text is smaller than a preset Hamming distance threshold, the comparison spliced text is judged to be similar to the target spliced text; before similarity determination is carried out by utilizing Hamming distance, firstly carrying out data barrel division on all comparison spliced texts according to text length, further judging a barrel corresponding to a target spliced text according to the text length of the target spliced text, and carrying out similarity comparison on the target spliced text and the comparison spliced texts in the corresponding barrel; therefore, on one hand, the comparison process can be reduced, the comparison efficiency is improved, on the other hand, the comparison of the target culture with all texts can be avoided, the target text and a longer text which only references the target text and the main content is not the target text content are not judged to be similar texts, and therefore the judgment accuracy can be improved.

Drawings

Fig. 1 is a flow chart of a text similarity determination method according to an embodiment of the present application;

FIG. 2 is a simplified flow chart of a text similarity determination method according to an embodiment of the present application;

FIG. 3 is a block diagram of a text similarity determination apparatus according to an embodiment of the present application;

fig. 4 is an internal structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In one embodiment, as shown in fig. 1, there is provided a text similarity determination method, including the steps of:

s1, acquiring a target text and a comparison text set, and cleaning data of each comparison text in the target text and the comparison text set.

After each text is entered, the text is first cleaned, some special symbols are mainly removed, and some spaces, line-feed symbols and the like affect the contents of the word segmentation.

Further, step S1 further includes:

And extracting abstracts from the cleaned data through text key information by utilizing HanLP, and carrying out a simplifying process on the text by utilizing a relation tree. The relation tree is mainly a paraphrasing word and word comparison with consistent meaning.

S2, cutting words from each comparison text in the target text and the comparison text set, extracting key words of the target text, splicing to generate a target spliced text, extracting corresponding key words of each comparison text in the comparison text set, splicing to generate a comparison spliced text.

In addition, after the completion of the word segmentation for each of the target text or the set of comparison texts, the nonsensical words therein are removed. That is, the kanlp is used to perform word segmentation, after the word segmentation is completed, some nonsensical adverbs, interjectors and the like are removed, and then important point words such as subjects, predicates, objects, regional words and the like are taken out and spliced to generate spliced texts.

In other words, the initial texts are subjected to word segmentation and key word merging, that is, the spliced texts and the initial texts are in one-to-one correspondence, and each initial text has a corresponding spliced text.

S3, respectively calculating simhash values and MD5 fingerprints of the target spliced text and all comparison spliced texts.

S4, carrying out data sub-buckets on all the comparison spliced texts according to the text length, and storing simhash values and MD5 fingerprint sub-buckets of all the comparison spliced texts into redis.

And calculating a simhash value of the spliced text, obtaining an MD5 fingerprint, and carrying out data barrel separation on all comparison spliced texts according to the text length.

In general, the lengths of similar text are also relatively close. And classifying the data sub-buckets, namely classifying the comparison spliced texts according to the texts, and classifying the comparison spliced texts with the text lengths close to each other under one bucket. The number of buckets, the text length corresponding to each bucket, may be predefined. For example, one bucket may be defined for storing text of 50-200 words in length, and one bucket for storing text of 200-350 words in length. The specific definition of the bucket may be determined according to the actual situation. After the data is divided into barrels, each barrel is recorded with a comparison spliced text with the corresponding text length of the barrel

S5, judging the barrel corresponding to the target spliced text according to the text length of the target spliced text.

Each barrel corresponds to a text length with a certain interval, and according to the text length of the target spliced text, the text length interval of the target spliced text can be judged, so that the barrel corresponding to the target spliced text can be judged.

S6, acquiring all comparison splice texts under the barrels corresponding to the target splice texts, and calculating the Hamming distance between the target splice texts and each comparison splice text under the barrels corresponding to the target splice texts.

The comparison splice text corresponds to a comparison text, the target splice text corresponds to a target text, and the comparison splice text is judged to be similar to the target splice text, namely, a certain comparison text in the comparison text set is judged to be similar to the target text.

The preset hamming distance threshold may be, but is not limited to, set to 5, where a smaller hamming distance indicates a more similar text to the target splice. If the Hamming distance between the multiple comparison spliced texts and the target spliced text is smaller than the preset Hamming distance threshold, judging that one comparison spliced text with smaller Hamming distance is a similar text of the target spliced text, and assigning the MD5 fingerprint of the comparison spliced text to the target spliced text.

Further, if the Hamming distance between the comparison spliced text and the target spliced text is not smaller than the preset Hamming distance threshold, the simhash value and the MD5 fingerprint of the target spliced text are stored under a corresponding bucket in redis.

In general, based on the position of the bucket, data below the bucket is acquired and then a similarity determination is made by the Hamming distance. If the Hamming distance is smaller than the preset Hamming distance threshold after all the comparison is completed, the corresponding MD5 fingerprint is taken and assigned to the current target text. If the Hamming distance is not smaller than the preset Hamming distance threshold, simhash and MD5 of the target text are stored below a corresponding bucket in redis, so that other target texts can be conveniently compared next time; this operation is equivalent to expanding the data in the bucket, i.e., expanding the aligned file set.

The method selects a simhash algorithm, firstly cleans the text, then extracts the abstract, cuts words, combines according to word vectors, obtains the simhash value of the combined data and MD5 fingerprint, and stores the simhash value in redis. Finally, the Hamming distances are used for comparison, and similar MD5 fingerprints are used.

Another flow chart of the method can be seen in fig. 2.

In the above method for judging the similarity of the texts, the method comprises the steps of cleaning a target text and a comparison text set, extracting abstracts, cutting words, extracting key words, splicing to generate spliced texts, then solving a simhash value and MD5 fingerprints of the texts formed after splicing, carrying out data barreling according to the text length, judging a barrel corresponding to the target spliced texts according to the text length of the target spliced texts, calculating the Hamming distance between the target spliced texts and each comparison spliced text under the corresponding barrel, and judging that the Hamming distance between the comparison spliced texts and the target spliced texts is smaller than a preset Hamming distance threshold value, wherein the comparison spliced texts are similar to the target spliced texts; before similarity determination is carried out by utilizing Hamming distance, firstly, carrying out data barrel division on all comparison spliced texts according to the text length, further judging a barrel corresponding to a target spliced text according to the text length of the target spliced text, and then carrying out similarity comparison on the target spliced text only with the comparison spliced texts in the corresponding barrels, namely carrying out similarity comparison on the target text only with the comparison texts with the text lengths similar to the target spliced text; on the one hand, the comparison process can be reduced, the comparison efficiency is improved, and on the other hand, because the text lengths of the common similar texts are relatively close, the comparison of the target culture with all texts can be avoided, and the target text and a longer text which only references the target text and the main content is not the target text content are not judged to be similar texts, so that the judgment accuracy can be improved.

In summary, the text similarity judging method provided by the embodiment of the application is accurate and has good efficiency. The polymerization effect is also improved greatly from the feedback of the product end.

It should be understood that, although the steps in the flowcharts of fig. 1-2 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 1-2 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.

In one embodiment, as shown in fig. 3, there is provided a text similarity determination apparatus including the following program modules:

the text acquisition module 301 is configured to acquire a target text and a comparison text set, and perform data cleaning on each comparison text in the target text and the comparison text set;

the text word segmentation module 302 is configured to segment each comparison text in the target text and the comparison text set, extract key vocabularies of the target text, splice the key vocabularies to generate a target spliced text, and extract corresponding key vocabularies of each comparison text in the comparison text set, splice the key vocabularies to generate a comparison spliced text;

the calculation module 303 is configured to calculate simhash values and MD5 fingerprints of the target spliced text and all comparison spliced texts respectively;

the data sub-bucket module 304 is configured to perform data sub-bucket on all the comparison spliced texts according to the text length, and store simhash values and MD5 fingerprint sub-buckets of all the comparison spliced texts into redis;

a judging module 305, configured to judge, according to the text length of the target spliced text, a bucket corresponding to the target spliced text;

the hamming distance calculating module 306 is configured to obtain all comparison splice texts under the bucket corresponding to the target splice text, and calculate a hamming distance between the target splice text and each comparison splice text under the bucket corresponding to the target splice text;

and the similarity determination module 307 is configured to determine that the comparison spliced text is similar to the target spliced text if the hamming distance between the comparison spliced text and the target spliced text is less than the preset hamming distance threshold, and assign an MD5 fingerprint of the comparison spliced text to the target spliced text.

For a specific limitation of a text similarity determination device, reference may be made to the above limitation of a text similarity determination method, and the description thereof will not be repeated here. Each of the modules in the above-described one text similarity determination apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a text similarity determination method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 4 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided, including a memory and a processor, the memory having stored therein a computer program, involving all or part of the flow of the methods of the embodiments described above.

In one embodiment, a computer readable storage medium having a computer program stored thereon is provided, involving all or part of the flow of the methods of the embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include Read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, or the like. Volatile memory can include Random access memory (Random AccessMemory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can take many forms, such as static random access memory (StaticRandomAccessMemory, SRAM) or dynamic random access memory (DynamicRandomAccessMemory, DRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A text similarity determination method, characterized by comprising:

s4, carrying out data barrel division on all the comparison spliced texts according to the text lengths of the comparison spliced texts, classifying all the comparison spliced texts, and classifying the comparison spliced texts with the text lengths close to each other under one barrel; according to the data barrel division results of all the comparison spliced texts, the simhash values and MD5 fingerprints of all the comparison spliced texts are stored in different barrels in redis;

2. The text similarity determination method according to claim 1, wherein step S1 further comprises:

3. The text similarity determination method according to claim 2, characterized in that said summarizing and word-cutting is performed, in particular by means of HanLP.

4. The text similarity determination method according to claim 1, wherein the key words include subjects, predicates, objects, and regional words.

5. The text similarity determination method according to claim 1, wherein step S2 further comprises:

6. The text similarity determination method according to claim 1, wherein the preset hamming distance threshold is 5.

7. The text similarity determination method according to claim 1, wherein step S7 further comprises:

8. A text similarity determination apparatus, comprising:

the data barrel dividing module is used for dividing all the comparison spliced texts into barrels according to the text lengths of the comparison spliced texts, classifying all the comparison spliced texts, and classifying the comparison spliced texts with the text lengths close to each other under one barrel; according to the data barrel division results of all the comparison spliced texts, the simhash values and MD5 fingerprints of all the comparison spliced texts are stored in different barrels in redis;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.