CN118260385A

CN118260385A - Thesis duplicate checking system and method based on text feature extraction technology

Info

Publication number: CN118260385A
Application number: CN202410443102.4A
Authority: CN
Inventors: 张洪涛; 王俊; 谭凌励; 桂宁; 方涛; 朱珩
Original assignee: Guangdong Wanfang Data Information Technology Co ltd
Current assignee: Guangdong Wanfang Data Information Technology Co ltd
Priority date: 2024-04-12
Filing date: 2024-04-12
Publication date: 2024-06-28

Abstract

The invention discloses a paper duplicate checking system and method based on text feature extraction technology, which relate to the technical field of data processing.

Description

Thesis duplicate checking system and method based on text feature extraction technology

Technical Field

The invention belongs to the technical field of electronic digital data processing, and particularly relates to a paper duplicate checking system and method based on a text feature extraction technology.

Background

Paper review refers to the comparison and analysis of a paper with existing academic documents, network resources and other related documents by using special software or algorithms to detect the parts of the paper similar to or repeated with the published content of others, and the main purpose of paper review is to prevent academic mishaps, especially plagiarism and plagiarism. By checking the thesis, originality and academic integrity of academic research can be ensured, and healthy development of academic world is promoted; paper review will typically compare the similarity and repeatability of text; in general, the duplicate checking system compares papers to be detected with documents in a database, detects similar text paragraphs, sentences or words, and gives a matching result of similarity; the paper review is a very important link in academia and educational institutions, which helps maintain the seriousness and reliability of the academia, and protects the efforts and achievements of researchers.

Chinese patent application CN108897781a discloses a paper graph duplication checking system comprising: a paper database; the acquisition module is used for acquiring the graph contained in each paper in the paper database, extracting a blank closed region of the graph, further extracting the edge contour of the blank closed region, repeating the extraction operation, acquiring the region contour corresponding to each graph, and establishing a region contour database; the acquisition module also acquires the target graph in the target paper, extracts the region outline of the target graph by using the method, compares the region outline of the target graph with all the region outlines in the region outline database, calculates the similarity, and marks the similarity nearby the target graph in the target paper if the similarity is higher than 0.6. The method can quickly check the graph of the target paper, and has higher accuracy rate of checking the graph.

The existing paper duplicate checking tool mainly detects whether similar parts exist in the paper by comparing the contents of the existing document database and the Internet, so that the existing paper duplicate checking tool mainly focuses on detecting the similarity of characters, however, even if two articles do not have similar sentences or paragraphs, the research field, the problem to be solved and the way to solve the problem may have the same meaning, therefore, whether true duplicate exists cannot be completely determined by the existing paper duplicate checking tool, and the true duplicate checking effect cannot be achieved.

Disclosure of Invention

Aiming at the problems in the related art, the invention provides a paper duplicate checking system and method based on a text feature extraction technology, so as to overcome the technical problems in the prior art.

In order to solve the technical problems, the invention is realized by the following technical scheme:

the invention discloses a paper duplicate checking method based on a text feature extraction technology, which comprises the following steps:

S1, inputting a to-be-checked repeated paper, and performing word segmentation operation on the to-be-checked repeated paper to obtain a paper word segmentation set; setting a database document set, and inquiring in a database according to the document word segmentation set to obtain words contained in each document in the database document set in the paper word segmentation set, so as to form a document containing word segmentation matrix;

S2, setting weight values of corresponding word segmentation according to the word segmentation set of the paper to obtain a word segmentation weight set, obtaining a document containing word segmentation weight matrix according to the word segmentation weight set, and calculating the repeatability between the document and the heavy paper to be checked according to the document containing word segmentation weight matrix to obtain a document repeatability set; setting a repeatability threshold, comparing and analyzing according to the document repeatability set and the repeatability threshold, and judging that the papers to be checked and the papers in the database document set are repeated when the document repeatability in the document repeatability set is larger than or equal to the repeatability threshold; otherwise, constructing an initial field SVM model, an initial problem SVM model and an initial solution SVM model, and performing primary optimization and secondary optimization to obtain a secondary optimized field SVM model, a secondary optimized problem SVM model and a secondary optimized solution SVM model;

S3, classifying the duplicate paper to be checked and the database document set by adopting the field SVM model after secondary optimization, the problem SVM model after secondary optimization and the solution SVM model after secondary optimization to respectively obtain a first classification result set and a second classification result set;

S4, comparing the first classification result set with the second classification result set, and judging whether the paper to be checked is repeated or not according to the comparison result;

firstly, word segmentation operation is carried out on a heavy paper to be checked to obtain a paper word segmentation set, so that information in the heavy paper to be checked can be segmented, inquiry is carried out in a database document set according to the information, so that the inquired document has a certain similarity with the content of the heavy paper to be checked, as the importance of each word in the paper to the paper is different, the weight of each word in the paper word segmentation set is determined to obtain a word segmentation weight set, and then repetition calculation is carried out on the heavy paper to be checked, so that the calculated repetition is more accurate, when the repetition exceeds a repetition threshold, the repeated documents in the heavy paper to be checked and the database document set are judged, and when the repetition does not exceed the repetition threshold, further analysis and judgment are needed because whether the text part is repeated can not accurately reflect whether the heavy paper to be checked is repeated, and the accuracy of the repeated check is improved; by constructing an initial field SVM model, an initial problem SVM model and an initial solution SVM model and performing primary optimization and secondary optimization, a secondarily optimized field SVM model, a secondarily optimized problem SVM model and a secondarily optimized solution SVM model can be obtained, deep analysis can be performed on each paper in the aspects of fields, solutions and solutions of documents in a database, when the three aspects are repeated with other papers, the paper to be checked is judged to be repeated, otherwise, the paper to be checked is not repeated, and through the above-mentioned re-checking scheme, deep re-checking operation can be performed on the paper, so that the true condition of the paper can be reflected more by the re-checking result, and the result is more accurate and has reference value.

Preferably, the step S1 includes the steps of:

S11, performing word segmentation operation on the to-be-examined heavy paper to obtain a paper word segmentation set c '= { c' ₁,c″₂,...,c″_i,...,c″_c′ };

S12, setting a database literature set c= { c ₁,c₂,...,c_i,...,c_c″′},c_i to represent the ith literature in the database, and c' "to represent the total number of the literature in the database; querying a database according to the paper word segmentation set c "= { c" ₁,c″₂,...,c″_i,...,c″_c′ } to obtain words in the paper word segmentation set c "= { c" ₁,c″₂,...,c″_i,...,c″_c′ } contained in each document in the database document set c= { c ₁,c₂,...,c_i,...,c_c″′ }, wherein the constituting document contains a word segmentation matrix d as follows:

Wherein d _ii represents the i-th word included in the i-th document in the database document set c= { c ₁,c₂,...,c_i,...,c_c″ };

By performing word segmentation operation on the duplicate paper to be checked, a data basis is provided for subsequent first-round duplicate checking.

Preferably, the step S11 includes the steps of:

s111, dividing the to-be-checked repeated paper according to paragraphs, and obtaining a paper paragraph set a= { a ₁,a₂,...,a_i,...,a_a′},a_i to represent an ith paragraph for dividing the to-be-checked repeated paper, wherein a' represents the total number of the divided paragraphs for the to-be-checked repeated paper; converting each paragraph in the paper paragraph set a= { a ₁,a₂,...,a_i,...,a_a′ } into a character string, obtaining a paragraph character string set a '= { a' ₁,a″₂,...,a″_i,...,a″_a′},a″_i to represent a character string corresponding to the ith paragraph which is divided from the to-be-checked heavy paper, A "_ij represents the j-th character in a _i", and a' "_i represents the total number of characters in a" _i;

S112, constructing a Chinese character index table, inquiring whether a '_i0 belongs to a single word according to the Chinese character index table, forming a' _i0 into words when a '_i0 belongs to the single word, and then inquiring and judging a' _i1; when a '_i0 does not belong to a single word, all word sets with a' _i0 as the prefix are queried according to the Chinese character index table to obtain a first word set B _j represents the jth word prefixed with a "_i0,A total number representing words prefixed with a "_i0;

S113, forming a character string a '_i0a″_i1 by a' _i0 and a character a '_i1 next to the character string a' _i0a″_i1 and the first word set Matching the first two characters of all words in a), and forming a ' _i0a″_i1 into words if character strings with the same first two characters as a ' _i0a″_i1 and the same total length as a ' _i0a″_i1 exist; otherwise, term a "_i0;

s114, repeating the operations in S112 and S113 for each paragraph string in the paragraph string set a ' = { a ' ₁,a″₂,...,a″_i,...,a″_a′ } to obtain a paragraph word segmentation matrix b ', as follows:

Wherein b "_ij represents the j-th word segment obtained after the word segmentation of a" _i, and b' "_i represents the total number of word segments obtained after the word segmentation of a" _i;

S115, performing duplication removal operation on each word in the paragraph segmentation matrix b' to obtain a duplication-removed paragraph segmentation matrix; merging the segment word segmentation matrixes after the duplication removal to obtain a paper word segmentation set c= { c ₁,c₂,...,c_i,...,c_c′},c_i, wherein c' represents the total number of words in the paper word segmentation set;

By comparing and matching each character of the duplicate paper to be checked with the Chinese character index table, the duplicate paper to be checked can be accurately segmented, and data support is provided for subsequent inquiry.

Preferably, the step S2 includes the steps of:

S21, setting the weight value of the corresponding word segmentation according to the article word segmentation set c '= { c' ₁,c″₂,...,c″_i,...,c″_c′ } to obtain a word segmentation weight set, obtaining a document containing word segmentation weight matrix e 'according to the word segmentation weight set d' "= { d '" ₁,d″′₂,...,d″′_i,...,d″′_c′ } and the document containing word segmentation matrix d, calculating the repeatability between the document and the article to be checked according to the document containing word segmentation weight matrix e' to obtain a document repeatability set;

S22, setting a repetition degree threshold d ', and comparing and analyzing according to the document repetition degree set and the repetition degree threshold d ', wherein when the document repetition degree in the document repetition degree set is greater than or equal to d ', the repeated papers to be checked and the papers in the database document set are judged; otherwise, constructing an initial field SVM model, an initial problem SVM model and an initial solution SVM model, and performing primary optimization and secondary optimization to obtain a secondary optimized field SVM model, a secondary optimized problem SVM model and a secondary optimized solution SVM model;

Because the importance of each word in the paper to the paper is different, the weight of each word in the paper word segmentation set is determined to obtain a word segmentation weight set, and then the repetition degree of the paper to be checked is calculated, so that the calculated repetition degree is more accurate, and further the judgment is more accurate.

Preferably, the step S21 includes the steps of:

S211, setting a weight value of a corresponding word according to the paper word segmentation set c '= { c' ₁,c″₂,...,c″_i,...,c″_c′ }, the resulting segmentation weight set d '"= { d'" ₁,d″′₂,...,d″′_i,...,d″′_c′},d″′_i represents the article the weight value of the ith word in the word segmentation set c "= { c" ₁,c″₂,...,c″_i,...,c″_c′ };

S212, obtaining a document containing word segmentation weight matrix e ' according to the word segmentation weight set d ' = { d ' ₁,d″′₂,...,d″_i,...,d″′_c′ } and the document containing word segmentation matrix d, wherein the method comprises the following steps of:

Wherein e' _ii represents the weight value of the i-th word included in the i-th document in the database document set c= { c ₁,c₂,...,c_i,...,c_c″ };

S213, calculating the repeatability between the literature and the duplicate paper to be checked according to the word segmentation weight matrix e' contained in the literature, and obtaining a literature repeatability set e= { e ₁,e₂,...,e_i,...e_c″′},e_i to represent the repeatability between the ith literature and the duplicate paper to be checked; the calculation formula is as follows:

According to the weight of each word and the word segmentation condition contained in the database document set, the repeatability between the repeated paper to be checked and the document is calculated, so that the calculation result is more accurate, and the real repeated condition of the repeated paper to be checked and the database document set can be reflected more accurately.

Preferably, in S22, an initial domain SVM model, an initial problem SVM model and an initial solution SVM model are constructed, and primary optimization and secondary optimization are performed to obtain a secondary optimized domain SVM model, a secondary optimized problem SVM model and a secondary optimized solution SVM model; the method specifically comprises the following steps:

S221 randomly collecting a paper training sample set e "= { e" ₁,e″₂,...,e″_i,...,e″_e″′ } and a paper test sample set h "= { h" ₁,h″₂,...,h″_i,...,h″_h″′};e″_i representing an ith paper training sample of the paper training sample set, e' "represents the total number of paper training samples in the set of paper training samples; h "_i denotes the ith paper test sample in the paper test sample set;

Randomly selecting f% proportion paper training samples and paper test samples from the paper training sample set e '= { e' ₁,e″₂,...,e″_i,...,e″_e″′ } and the paper test sample set h '= { h' ₁,h″₂,...,h″_i,...,h″_h″′ } respectively as field training sample sets And a field test sample set l= { l ₁,l₂,...,l_i,...,l_l1′ }; and training the sample set e '= { e' ₁,e″₂,...,e″_i,...,e″_e″′ } from the paper test sample set h '= { h' ₁,h″₂,...,h″_i,...,h″_h″′ 'respectively selecting an f'% proportion paper training sample and a paper test sample as a problem training sample setAnd problem test sample setAnd then respectively selecting a g'% proportion of paper training sample and a paper testing sample from the paper training sample set e "= { e" ₁,e″₂,...,e″_i,...,e″_e″′ } and the paper testing sample set h "= { h" ₁,h″₂,...,h″_i,...,h″_h″′ } as solution training sample setsAnd solution test sample setE "_i、l_i、f′_i、l″_i、g″_i and l'" _i represent an i-th domain training sample in the domain training sample set, an i-th domain testing sample in the domain testing sample set, an i-th problem training sample in the problem training sample set, an i-th problem testing sample in the problem testing sample set, an i-th solution training sample in the solution training sample set, and an i-th solution testing sample in the solution testing sample set, respectively; f ₁″、f″₂、f″₃、l′₁、l′₂ and l' ₃ respectively represent the total number of field training samples in the field training sample set, the total number of field testing samples in the field testing sample set, the total number of problem training samples in the problem training sample set, the total number of problem testing samples in the problem testing sample set, the total number of solution training samples in the solution testing sample set, and the total number of solution training samples in the solution testing sample set;

Training a sample set according to the domain Setting corresponding field training sample label setTraining a sample set according to the problemSetting a corresponding problem training sample label setTraining a sample set according to the solutionSetting corresponding solution training sample tag sets

Testing a sample set according to the fieldSetting corresponding field test sample label setTesting a sample set according to the problemSetting a corresponding problem test sample label setTesting a sample set according to the solutionSetting corresponding solution test sample tag sets

S222, constructing an initial field SVM model, an initial problem SVM model and an initial solution SVM model, and setting the kernel functions of the initial SVM model, the initial problem SVM model and the initial solution SVM model as Gaussian kernel functions, wherein the function expression is as follows:

m″′(n,n′)＝exp(-α||n-n′||²)

wherein: n and n' represent two different data sample points, α represents a kernel parameter;

S223, setting a first field training error threshold n ', a first problem training error threshold n' and a first solution training error threshold o; training the field training sample set Sum field training sample tag setTraining the initial domain SVM model to optimize the punishment factor parameter beta and the distribution decision parameter χ after the data in the initial domain SVM model are mapped to the new feature space; stopping training when the training errors of the field training sample set and the field training sample label set in the initial field SVM model are smaller than a first field training error threshold value n', obtaining once optimized beta, marking delta, once optimized chi, marking epsilon and once optimized field SVM model;

Training the problem into a sample set And problem training sample tag setTraining the initial problem SVM model to optimize the punishment factor parameters beta 'and the distribution decision parameters χ' after the data are mapped to the new feature space in the initial problem SVM model; stopping training when the training errors of the problem training sample set and the problem training sample label set in the initial problem SVM model are smaller than a first problem training error threshold value n ', and obtaining once optimized beta ', namely delta ', once optimized χ ', namely epsilon ' and once optimized problem SVM model;

Training the solution to a sample set And solution training sample tab setTraining the initial solution SVM model to optimize the punishment factor parameter beta 'and the distribution decision parameter χ' after the data are mapped to the new feature space; stopping training when the training error of the solution training sample set and the solution training sample label set in the initial solution SVM model is smaller than a first solution training error threshold o, and obtaining once optimized beta ', denoted as delta', once optimized χ ', denoted as epsilon', and once optimized solution SVM model;

S224, testing the field test sample set Sum field test sample tab setInputting the optimized domain SVM model into a domain SVM model after primary optimization, and performing secondary optimization on the domain SVM model after primary optimization to obtain a domain SVM model after secondary optimization;

Testing the problem sample set And problem test sample tab setInputting the problem SVM model after primary optimization, and performing secondary optimization on the problem SVM model after primary optimization to obtain a problem SVM model after secondary optimization;

testing the solution sample set And solution test sample tab setInputting the solution SVM model after primary optimization, and performing secondary optimization on the solution SVM model after primary optimization to obtain a solution SVM model after secondary optimization;

by collecting the paper training sample set and the paper testing sample set and dividing the two sample sets into three parts, the three parts are respectively used as training data for classifying the field, training data for classifying the solution problem, training data for classifying the solution mode, testing data for classifying the field, testing data for classifying the solution problem and testing data for classifying the solution mode, and data support is provided for carrying out primary optimization and secondary optimization on three SVM models in the follow-up process so as to obtain a more accurate SVM model.

Preferably, the step S224 includes the steps of:

S2241, constructing a first ant colony population Second ant colony populationAnd a third ant colony populationO '_i、o″_i and o' "_i represent the ith ant in the first ant colony, the ith ant in the second ant colony, and the ith ant in the third ant colony, respectively; p ₁、p₂ and p ₃ represent the size of the first ant colony, the size of the second ant colony and the size of the third ant colony, respectively;

simultaneously setting a first maximum iteration number p ', a second maximum iteration number p ' and a third maximum iteration number p '; setting a first current iteration number r, a second current iteration number r 'and a third current iteration number r';

Setting the importance factor of the pheromone as r' and the importance factor of the heuristic function as s;

S2242, setting the first ant colony group The initial position of 50% of ants in the population is delta, and the initial position of the other 50% of ants is epsilon; setting the second ant colony populationThe initial position of 50% of ants in the population is delta ', and the initial position of the other 50% of ants is epsilon'; setting the third ant colony populationThe initial position of 50% of ants is delta ', and the initial position of the other 50% of ants is epsilon';

s2243, setting initial concentration of pheromone released to each ant individual in the first ant colony group, the second ant colony group and the third ant colony group as S';

the calculation formula of the movement probability of the first ant colony group in each iteration is set as follows:

Wherein s '₁ (r) represents the moving probability of the first ant colony population in the r-th round of iteration, s' _1r represents the average error value of the first ant colony population in the r-th round of iteration for classifying the field test sample set and the field test sample tag set by adopting the optimized field SVM model once;

the calculation formula of the movement probability of the second ant colony group in each iteration is set as follows:

Wherein s ' ₂ (r ') represents the probability of movement of the second ant colony population in the r ' round of iteration, s ' _2r′ represents the average error value of the second ant colony population in the r ' round of iteration for classifying the problem test sample set and the problem test sample tag set by adopting the problem SVM model after one time of optimization;

the calculation formula of the movement probability of the third ant colony group in each iteration is set as follows:

s "₃ (r") represents the probability of movement of the third ant colony population in the r "round of iteration, s'" _3r″ represents the average error value of the third ant colony population in the r "round of iteration for classifying the solution test sample set and the solution test sample tag set using the once optimized solution SVM model;

S2244, starting an iteration process for the first ant colony, the second ant colony and the third ant colony according to the calculation formula of the movement probability of the first ant colony in each iteration, the calculation formula of the movement probability of the second ant colony in each iteration and the calculation formula of the movement probability of the third ant colony in each iteration; the method comprises the following steps:

The first ant colony population obtains a first optimal position and a second optimal position in each iteration process, wherein the first optimal position represents an optimal value of a penalty factor parameter in the field SVM model, and the second optimal position represents an optimal value of a distribution decision parameter after data in the field SVM model is mapped to a new feature space; the second ant colony population obtains a third optimal position and a fourth optimal position in each iteration process, wherein the third optimal position represents an optimal value of a penalty factor parameter in the problem SVM model, and the fourth optimal position represents an optimal value of a distribution decision parameter after data in the problem SVM model is mapped to a new feature space; the third ant colony population obtains a fifth optimal position and a sixth optimal position in each iteration process, wherein the fifth optimal position represents an optimal value of a penalty factor parameter in the solution SVM model, and the sixth optimal position represents an optimal value of a distribution decision parameter after data are mapped to a new feature space in the solution SVM model;

S2245, when r is more than or equal to p', stopping iteration of the first ant colony population to obtain a first final penalty factor parameter t and a distribution decision parameter u after the first final data are mapped to a new feature space;

when r ' is not less than p ", stopping iteration of the second ant colony population to obtain a second final penalty factor parameter t ' and a distribution decision parameter u ' after the second final data are mapped to a new feature space;

When r '. Gtoreq.p' the third ant colony population stops iterating, obtaining a third final penalty factor parameter t 'and a distribution decision parameter u' after the third final data are mapped to the new feature space;

S2246, the first final penalty factor parameter t and the first final data are mapped to a new distribution decision parameter u after feature space as parameters of the field SVM model after primary optimization, and the field SVM model after secondary optimization is obtained;

Mapping the second final penalty factor parameter t 'and the second final data to a new distribution decision parameter u' after feature space as parameters of the problem SVM model after primary optimization to obtain a problem SVM model after secondary optimization;

Mapping the third final penalty factor parameter t 'and the third final data to a new distribution decision parameter u' after feature space as parameters of the once optimized solution SVM model to obtain a twice optimized solution SVM model;

the method has the advantages that the punishment factor parameters and the distribution decision parameters of the data in the SVM model are mapped to the new feature space by adopting the ant colony algorithm, so that the optimization speed of the SVM is improved, the optimization process is more accurate, and the situation of local optimization is not easy to fall into.

Preferably, the step S3 includes the steps of:

S31, inputting each document in the database document set c= { c ₁,c₂,...,c_i,...,c_c″′ } into a secondarily optimized field SVM model, a secondarily optimized problem SVM model and a secondarily optimized solution SVM model respectively for classification operation to obtain a first classification result set t '= { t' ₁,t″′₂,...,t″′_i,...,t″′_c″′ ', where t' "_i represents a result set of classifying an i-th document in the database document set, and t '" _i＝{t″′_i1,t″′_i2,t″′_i3},t″′_i1、t″′_i2 and t' "_i3 represent classification results of the i-th document in the database document set in the secondarily optimized domain SVM model, the secondarily optimized problem SVM model, and the secondarily optimized solution SVM model, respectively;

s32, inputting the to-be-checked heavy paper into a secondarily optimized field SVM model, a secondarily optimized problem SVM model and a secondarily optimized solution SVM model respectively for classification operation to obtain a second classification result set u '= { u' ₁,u″′₂,u″′₃;

and classifying each document in the document set of the database and the papers to be checked by adopting the optimized SVM model, so that the classification is more accurate, and an accurate basis is provided for the subsequent comparison analysis.

Preferably, the process of S4 is as follows:

Comparing the first classification result set t ' "= { t '" ₁,t″′₂,...,t″′_i,...,t″′_c″′ } with a second classification result set u ' "= { u '" ₁,u″′₂,u″′₃ } and if t ' "_i＝{t″′_i1,t″′_i2,t″′_i3 } exists in the first classification result set t '" = { t ' "₁,t″′₂,...,t″′_i,...,t″′_c″′ } so that t '" _i1＝u″′₁、t″′_i2＝u″′₂ and t ' "_i3＝u″′₃ are enabled, judging that the paper to be checked is repeated with the paper in the database; otherwise, judging that the paper to be checked and the paper in the database are not repeated;

the three aspects of the paper field, the paper solving problem and the paper solving problem are compared, so that the comparison is more comprehensive, and the comparison result is more accurate.

The paper duplicate checking system based on the text feature extraction technology comprises a paper word segmentation module, a document duplicate calculation module, a first paper duplicate judgment module, an SVM model primary optimization module, an SVM model secondary optimization module, a paper classification module and a second paper duplicate judgment module;

The paper word segmentation module is used for carrying out word segmentation operation on the to-be-checked repeated paper;

the document repetition degree calculating module calculates the repetition degree between the document and the repeated paper to be checked according to the document containing word segmentation weight matrix to obtain a document repetition degree set;

the first paper repetition judging module is used for comparing and analyzing according to the document repetition set and the repetition threshold value to judge whether the paper to be checked is repeated or not;

the SVM model primary optimization module is used for primary optimization of an initial field SVM model, an initial problem SVM model and an initial solution SVM model;

the SVM model secondary optimization module is used for carrying out secondary optimization on the field SVM model after primary optimization, the problem SVM model after primary optimization and the solution SVM model after primary optimization;

The paper classification module is used for classifying the duplicate paper to be checked and the database document set by adopting the field SVM model after secondary optimization, the problem SVM model after secondary optimization and the solution SVM model after secondary optimization;

the second paper repetition judging module is used for comparing the first classification result set with the second classification result set and judging whether the paper to be checked is repeated or not according to the comparison result.

The invention has the following beneficial effects:

1. According to the method, a paper segmentation module, a document repetition degree calculation module, a first paper repetition judgment module, a SVM model primary optimization module, a SVM model secondary optimization module, a paper classification module and a second paper repetition judgment module are arranged, firstly, a paper segmentation operation is carried out on a paper to be checked to obtain a paper segmentation set, so that information in the paper to be checked can be segmented, inquiry is carried out in a database document set according to the information, thus, the inquired document has a certain similarity with the paper to be checked in content, and as the importance of each term in the paper is different, the weight of each term in the paper segmentation set is determined to obtain a term weight set, and then the repeated degree calculation is carried out on the paper to be checked, so that the calculated repeated degree is more accurate; by constructing an initial field SVM model, an initial problem SVM model and an initial solution SVM model and performing primary optimization and secondary optimization, a secondarily optimized field SVM model, a secondarily optimized problem SVM model and a secondarily optimized solution SVM model can be obtained, deep analysis can be performed on the field aspect, the solution aspect and the solution aspect of each paper in a database document set to be tested, and when the three aspects are repeated with other papers, the repeated papers to be tested are judged, otherwise, the papers to be tested are not repeated, and through the repeated test scheme, deep repeated test operation can be performed on the papers, so that the real situation of the papers can be reflected more, and the results are more accurate and have reference value.

2. According to the invention, each character of the duplicate paper to be checked is compared and matched with the Chinese character index table, so that accurate word segmentation can be performed on the duplicate paper to be checked, data support is provided for subsequent query, and meanwhile, data support is provided for subsequent primary optimization and secondary optimization on three SVM models by collecting the paper training sample set and the paper testing sample set and dividing the two sample sets into three parts, wherein the three parts are respectively used as training data for classifying the domain aspect, training data for classifying the solution aspect, testing data for classifying the domain aspect, testing data for classifying the solution aspect and testing data for classifying the solution aspect.

3. According to the invention, the penalty factor parameters and the distribution decision parameters of the data in the SVM model are mapped to the new feature space by adopting the ant colony algorithm, so that the SVM optimization speed is improved, and meanwhile, the optimization process is more accurate and is not easy to fall into the local optimal condition.

Of course, it is not necessary for any one product to practice the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the invention, the drawings that are needed for the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the invention, and that it is also possible for a person skilled in the art to obtain the drawings from these drawings without inventive effort.

Fig. 1 is a schematic diagram of a document duplication checking system for checking a document based on text feature extraction technology.

Detailed Description

The following description of the technical solutions in the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, based on the embodiments in the invention, which a person of ordinary skill in the art would obtain without inventive faculty, are within the scope of the invention.

In the description of the present invention, it should be understood that the terms "open," "upper," "lower," "top," "middle," "inner," and the like indicate an orientation or positional relationship, merely for convenience of description and to simplify the description, and do not indicate or imply that the components or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the invention.

the step S1 comprises the following steps:

the step S11 comprises the following steps:

S111, dividing the to-be-checked repeated paper according to paragraphs, and obtaining a paper paragraph set a= { a ₁,a₂,...,a_i,...,a_a′},a_i to represent an ith paragraph for dividing the to-be-checked repeated paper, wherein a' represents the total number of the divided paragraphs for the to-be-checked repeated paper; converting each paragraph in the paper paragraph set a= { a ₁,a₂,...,a_i,...,a_a′ } into a character string, obtaining a paragraph character string set a '= { a' ₁,a″₂,...,a″_i,...,a″_a},a″_i to represent a character string corresponding to the ith paragraph which is divided from the to-be-checked heavy paper, A "_ij represents the j-th character in a" _i, and a' "_i represents the total number of characters in a" _i;

The step S2 comprises the following steps:

the step S21 comprises the following steps:

S212, obtaining a document containing word segmentation weight matrix e ' according to the word segmentation weight set d ' = { d ' ₁,d″′₂,...,d″′_i,...,d″′_c′ } and the document containing word segmentation matrix d, wherein the method comprises the following steps of:

S22, constructing an initial field SVM model, an initial problem SVM model and an initial solution SVM model, and performing primary optimization and secondary optimization to obtain a field SVM model after secondary optimization, a problem SVM model after secondary optimization and a solution SVM model after secondary optimization; the method specifically comprises the following steps:

Randomly selecting f% proportion paper training samples and paper test samples from the paper training sample set e '= { e' ₁,e″₂,...,e″_i,...,e″_e″′ } and the paper test sample set h '= { h' ₁,h″₂,...,h″_i,...,h″_h″′ } respectively as field training sample sets Sum field test sample setAnd training the sample set e '= { e' ₁,e″₂,...,e″_i,...,e″_e″′ } from the paper test sample set h '= { h' ₁,h″₂,...,h″_i,...,h″_h″′ 'respectively selecting an f'% proportion paper training sample and a paper test sample as a problem training sample setAnd problem test sample setAnd then respectively selecting a g'% proportion of paper training sample and a paper testing sample from the paper training sample set e "= { e" ₁,e″₂,...,e″_i,...,e″_e″′ } and the paper testing sample set h "= { h" ₁,h″₂,...,h″_i,...,h″_h″′ } as solution training sample setsAnd solution test sample setE "_i、l_i、f′_i、l″_i、g″_i and l'" _i represent an i-th domain training sample in the domain training sample set, an i-th domain testing sample in the domain testing sample set, an i-th problem training sample in the problem training sample set, an i-th problem testing sample in the problem testing sample set, an i-th solution training sample in the solution training sample set, and an i-th solution testing sample in the solution testing sample set, respectively; f "₁、f″₂、f″₃、l′₁、l′₂ and l" ₃ represent the total number of domain training samples in the domain training sample set, the total number of domain testing samples in the domain testing sample set, the total number of problem training samples in the problem training sample set, the total number of problem testing samples in the problem testing sample set, the total number of solution training samples in the solution testing sample set, and the total number of solution training samples in the solution testing sample set, respectively;

m′(n,n′)＝exp(-α||n-n′||²)

The step S224 includes the steps of:

the step S3 comprises the following steps:

The process of S4 is as follows:

Comparing the first classification result set t ' = { t ' "₁,t″′₂,...,t″′_i,...,t″′_c″′ } with a second classification result set u '" = { u ' "₁,u″′₂,u″′₃ } and if t '" _i＝{t″′_i1,t″′_i2,t″′_i3 } exists in the first classification result set t ' "= { t '" ₁,t″′₂,...,t″′_i,...,t″′_c″′ } so that t ' "_i1＝u″′₁、t″′_i2＝u″′₂ and t '" _i3＝u″′₃ are enabled, judging that the paper to be checked is repeated with the paper in the database; otherwise, judging that the paper to be checked and the paper in the database are not repeated.

In the description of the present specification, the descriptions of the terms "one embodiment," "example," "specific example," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above disclosed preferred embodiments of the invention are merely intended to help illustrate the invention. The preferred embodiments are not exhaustive or to limit the invention to the precise form disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best understand and utilize the invention.

Claims

1. The paper duplicate checking method based on the text feature extraction technology is characterized by comprising the following steps of:

S4, comparing the first classification result set with the second classification result set, and judging whether the paper to be checked is repeated or not according to the comparison result.

2. The method for paper duplicate checking based on text feature extraction technology according to claim 1, wherein said S1 comprises the steps of:

Where d _ii denotes the i-th word included in the i-th document in the database document set c= { c ₁,c₂,...,c_i,...,c_c″ }.

3. The method for paper duplicate checking based on text feature extraction technology according to claim 2, wherein said S11 comprises the steps of:

s111, dividing the to-be-checked repeated paper according to paragraphs, and obtaining a paper paragraph set a= { a ₁,a₂,...,a_i,...,a_a′},a_i to represent an ith paragraph for dividing the to-be-checked repeated paper, wherein a' represents the total number of the divided paragraphs for the to-be-checked repeated paper; converting each paragraph in the paper paragraph set a= { a ₁,a₂,...,a_i,...,a_a′ } into a character string, obtaining a paragraph character string set a '= { a' ₁,a″₂,...,a″_i,...,a″_a′},a″_i to represent a character string corresponding to the ith paragraph which is divided from the to-be-checked heavy paper, A "_ij represents the j-th character in a" _i, and a' "_i represents the total number of characters in a" _i;

S115, performing duplication removal operation on each word in the paragraph segmentation matrix b' to obtain a duplication-removed paragraph segmentation matrix; and merging the segment word segmentation matrixes after the duplication removal to obtain a paper word segmentation set c= { c ₁,c₂,...,c_i,...,c_c′},c_i, wherein c' represents the total number of words in the paper word segmentation set.

4. The method for paper duplicate checking based on text feature extraction technology according to claim 1, wherein said S2 comprises the steps of:

S22, setting a repetition degree threshold d ', and comparing and analyzing according to the document repetition degree set and the repetition degree threshold d ', wherein when the document repetition degree in the document repetition degree set is greater than or equal to d ', the repeated papers to be checked and the papers in the database document set are judged; otherwise, an initial field SVM model, an initial problem SVM model and an initial solution SVM model are built, primary optimization and secondary optimization are carried out, and a field SVM model after secondary optimization, a problem SVM model after secondary optimization and a solution SVM model after secondary optimization are obtained.

5. The method for paper duplicate checking based on text feature extraction technology according to claim 4, wherein said S21 comprises the steps of:

Wherein e' _ii represents the weight value of the jth word included in the ith document in the database document set c= { c ₁,c₂,...,c_i,...,c_c″ };

6. the method for searching for duplicate paper based on text feature extraction technology as claimed in claim 4, wherein in S22, an initial field SVM model, an initial problem SVM model and an initial solution SVM model are constructed, and primary optimization and secondary optimization are performed to obtain a field SVM model after secondary optimization, a problem SVM model after secondary optimization and a solution SVM model after secondary optimization; the method specifically comprises the following steps:

Randomly selecting f% proportion paper training samples and paper test samples from the paper training sample set e '= { e' ₁,e″₂,...,e″_i,...,e″_e″′ } and the paper test sample set h '= { h' ₁,h″₂,...,h″_i,...,h″_h″′ } respectively as field training sample sets Sum field test sample setAnd training the sample set e '= { e' ₁,e″₂,...,e″_i,...,e″e_″′ } from the paper test sample set h '= { h' ₁,h″₂,...,h″_i,...,h″_h″′ 'respectively selecting an f'% proportion paper training sample and a paper test sample as a problem training sample setAnd problem test sample setAnd then respectively selecting a g'% proportion of paper training sample and a paper testing sample from the paper training sample set e "= { e" ₁,e″₂,...,e″_i,...,e″_e″′ } and the paper testing sample set h "= { h" ₁,h″₂,...,h″_i,...,h″_h″′ } as solution training sample setsAnd solution test sample setE "_i、l_i、f_i′、l″_i、g″_i and l'" _i represent an i-th domain training sample in the domain training sample set, an i-th domain testing sample in the domain testing sample set, an i-th problem training sample in the problem training sample set, an i-th problem testing sample in the problem testing sample set, an i-th solution training sample in the solution training sample set, and an i-th solution testing sample in the solution testing sample set, respectively; f ₁″、f″₂、f″₃、l′₁、l′₂ and l' ₃ respectively represent the total number of field training samples in the field training sample set, the total number of field testing samples in the field testing sample set, the total number of problem training samples in the problem training sample set, the total number of problem testing samples in the problem testing sample set, the total number of solution training samples in the solution testing sample set, and the total number of solution training samples in the solution testing sample set;

m″′(n,n′)＝exp(-α||n-n′||²)

Training the solution to a sample set And solution training sample tab setTraining the initial solution SVM model to optimize the punishment factor parameter beta 'and the distribution decision parameter χ' after the data are mapped to the new feature space; stopping training when the training error of the solution training sample set and the solution training sample label set in the initial solution SVM model is smaller than a first solution training error threshold o, and obtaining once optimized beta ', namely delta ^″, once optimized χ ', namely epsilon ' and once optimized solution SVM model;

testing the solution sample set And solution test sample tab setAnd inputting the solution SVM model after primary optimization, and performing secondary optimization on the solution SVM model after primary optimization to obtain the solution SVM model after secondary optimization.

7. The method for paper duplicate checking based on text feature extraction technology of claim 6, wherein said S224 comprises the steps of:

S2241, constructing a first ant colony population Second ant colony populationAnd a third ant colony populationO _i′、o_i "and o _i'" represent the ith ant in the first ant colony, the ith ant in the second ant colony, and the ith ant in the third ant colony, respectively; p ₁、p₂ and p ₃ represent the size of the first ant colony, the size of the second ant colony and the size of the third ant colony, respectively;

And mapping the third final penalty factor parameter t 'and the third final data to a new distribution decision parameter u' after characteristic space as parameters of the once optimized solution SVM model to obtain a twice optimized solution SVM model.

8. The method for paper duplicate checking based on text feature extraction technology according to claim 1, wherein said S3 comprises the steps of:

S32, inputting the to-be-checked heavy paper into a secondarily optimized field SVM model, a secondarily optimized problem SVM model and a secondarily optimized solution SVM model respectively for classification operation, and obtaining a second classification result set u ' = { u ' ₁,u″′₂,u″′₃ '.

9. The method for paper duplicate checking based on text feature extraction technology of claim 1, wherein the process of S4 is as follows:

Comparing the first classification result set t ' "= { t '" ₁,t″′₂,...,t″′_i,...,t″′_c″′ } with a second classification result set u ' "= { u '" ₁,u″′₂,u″′₃ } and if t ' "_i＝{t″′_i1,t″′_i2,t″′_i3 } exists in the first classification result set t '" = { t ' "₁,t″′₂,...,t″′_i,...,t″′_c″′ } so that t '" _i1＝u″′₁、t″′_i2＝u″′₂ and t ' "_i3＝u″′₃ are enabled, judging that the paper to be checked is repeated with the paper in the database; otherwise, judging that the paper to be checked and the paper in the database are not repeated.

10. A system for implementing the text feature extraction technique-based paper duplication method of any one of claims 1-9, wherein: the system comprises a paper word segmentation module, a document repetition degree calculation module, a first paper repetition judgment module, an SVM model primary optimization module, an SVM model secondary optimization module, a paper classification module and a second paper repetition judgment module;