Disclosure of Invention
The embodiment of the invention provides a data cleaning method and a data cleaning device, which are used for overcoming the problems of low efficiency, no universality and usability of the existing data cleaning method.
An embodiment of the present invention provides a data cleaning method, including:
selecting a historical form which has the same description object as a current form from a historical form library, wherein the current form contains m title fields, the historical form contains n title fields, and m and n are positive integers;
calculating the similarity between each title field in the m title fields and each title field in the n title fields according to a first preset algorithm;
for any similarity SIM (i, j), if the ith header field is judged to be matched with the jth header field according to a preset matching rule, acquiring a constraint condition of the jth header field; wherein i represents the ith title field in the m title fields, j represents the jth title field in the n title fields, the value of i comprises all natural numbers not greater than m, and the value of j comprises all natural numbers not greater than n;
and performing data cleaning on data which does not meet the constraint condition in the data corresponding to the ith header field.
By combining the historical forms with the same description objects in the historical form library, the constraint condition of the title field of the historical form is adaptively applied to the title field of the current form, data corresponding to the title field of the current form is cleaned based on the constraint condition, and research personnel are not required to write and maintain a cleaning algorithm code program every time the data is cleaned, so that the use threshold of a user is reduced, the wide applicability is realized, and the working intensity of manually cleaning the data is reduced; the automatic cleaning of big data in the database is realized, the efficiency and the accuracy of data cleaning are improved, and the accuracy and the reliability of a data source are improved.
Further, for any similarity SIM (i, j), the determining that the ith header field and the jth header field match according to a preset matching rule includes:
and if the similarity SIM (i, j) is greater than a first preset value, judging that the ith header field in the m header fields is matched with the jth header field in the n header fields.
Further, for any similarity SIM (i, j), the determining that the ith header field and the jth header field match according to a preset matching rule includes:
if the similarity SIM (i, j) is not more than a first preset value and m and n are more than 1, determining k title fields according to the association relation of preset fields of the current form, and determining the maximum similarity SIM (subscriber identity Module) in the similarity between the s title field and each title field in the m title fields for the s title field in the k title fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j);
If SIM0(i, j) is greater than the first preset value, it is determined that the ith header field of the m header fields matches the jth header field of the n header fields.
Further, for any similarity SIM (i, j), the determining that the ith header field and the jth header field match according to a preset matching rule includes:
if the similarity SIM (i, j) is not more than a first preset value and m and n are more than 1, determining k title fields according to the association relation of preset fields of the current form, and determining the maximum similarity SIM (subscriber identity Module) in the similarity between the s title field and each title field in the n title fields for the s title field in the k title fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtainTo the corrected similarity SIM0(i,j);
If SIM0(i, j) is not more than the first preset value, then SIM is tested0(i, j) continuously correcting for Y times to obtain SIM0+Y(i, j) if SIM0+Y(i, j) when the number of the m header fields is greater than the first preset value, judging that the ith header field in the m header fields is matched with the jth header field in the n header fields;
wherein, in the y correction, SIM is selected according to the similaritysTo SIM by second preset algorithm0+y-1(i, j) correcting to obtain SIM0+y(i, j), the value of Y includes a positive integer not greater than Y.
Further, the second preset algorithm is a formula as follows:
wherein SIM is similarity to be corrected, SIM*For the corrected similarity, a is a preset weight coefficient.
In the process of determining the matched title fields, the similarity of the title fields is corrected by utilizing the incidence relation among the preset title fields in the form, so that the corrected more accurate similarity is obtained, more matched title fields can be determined, more constraint conditions can be obtained, and the efficiency of data cleaning is improved.
Further, calculating the similarity between each title field in the m title fields and each title field in the n title fields according to a first preset algorithm, including:
according to the coincidence degree of the title field names of the m title fields in the current form and the title field names of the n title fields in the historical form, obtaining the similarity between each title field in the m title fields and each title field in the n title fields; or
According to the title field names of m title fields in the current form and the title field names of n title fields in the historical form, acquiring the similarity between each title field in the m title fields and each title field in the n title fields according to a third-party knowledge base; or
And according to the coincidence degree of the field value sets of the m title fields in the current form and the field value sets of the n title fields in the historical form, acquiring the similarity between each title field in the m title fields and each title field in the n title fields.
Further, according to the coincidence degree of the field value sets of the m title fields in the current form and the field value sets of the n title fields in the history form, obtaining the similarity between each title field in the m title fields and each title field in the n title fields, including:
when the field value sets of the m title fields in the current form are discrete, the similarity is determined by the following formula:
wherein,the field representing the ith header field takes a value set,a field value set representing a jth header field; or
When the field value sets of the m title fields in the current form are continuous, the similarity is determined by the following formula:
therein, max1Denotes the maximum value, min, of the ith header field1Denotes the minimum value, max, of the ith header field2Denotes the maximum value, min, of the jth header field2Indicating the minimum value of the jth header field.
Further, after the data cleaning is performed on the current form, the method further includes:
and storing the current form and the constraint condition of the current form into a historical form library.
The following describes a data cleaning apparatus provided in an embodiment of the present invention, where the apparatus and the method correspond to each other one to one, so as to implement the data cleaning method in the above embodiment, and have the same technical features and technical effects, which are not described again in the present invention.
Another aspect of the embodiments of the present invention provides a data cleaning apparatus, including:
the history form acquisition module is used for selecting a history form which has the same description object as the current form from a history form library, wherein the current form contains m title fields, the history form contains n title fields, and m and n are positive integers;
the similarity calculation module is used for calculating the similarity between each title field in the m title fields and each title field in the n title fields acquired by the history form acquisition module according to a first preset algorithm;
the matching module is used for obtaining the constraint condition of the jth header field if the ith header field is judged to be matched with the jth header field according to the preset matching rule aiming at any similarity SIM (i, j) calculated by the similarity calculation module; wherein i represents the ith title field in the m title fields, j represents the jth title field in the n title fields, the value of i comprises all natural numbers not greater than m, and the value of j comprises all natural numbers not greater than n;
and the data cleaning module is used for cleaning the data which do not accord with the constraint condition obtained by the matching module in the data corresponding to the ith header field.
Further, the matching module is specifically configured to:
and aiming at any similarity SIM (i, j) obtained by calculation of the similarity calculation module, if the similarity SIM (i, j) is greater than a first preset value, judging that the ith header field in the m header fields is matched with the jth header field in the n header fields, and acquiring the constraint condition of the jth header field.
Further, the matching module is specifically configured to:
aiming at any similarity SIM (i, j) obtained by calculation of the similarity calculation module, if the similarity SIM (i, j) is not more than a first preset value and m and n are more than 1, k title fields are determined according to the association relation of preset fields of the current form, and for the s-th title field in the k title fields, the maximum similarity SIM (subscriber identity module) is determined in the similarity between the s-th title field and each title field in the n title fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j);
If SIM0(i, j) is greater than the first preset value, it is determined that the ith header field of the m header fields matches the jth header field of the n header fields.
Further, the matching module is specifically configured to:
aiming at any similarity SIM (i, j) obtained by calculation of the similarity calculation module, if the similarity SIM (i, j) is not more than a first preset value and m and n are more than 1, determining k title fields according to the association relation of preset fields of the current form, and for the s title field in the k title fields, determining the similarity between the s title field and the n title fieldsDetermining the maximum similarity SIM in the similarity between each header field in the fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j);
If SIM0(i, j) is not more than the first preset value, then SIM is tested0(i, j) continuously correcting for Y times to obtain SIM0+Y(i, j) if SIM0+Y(i, j) when the number of the m header fields is greater than the first preset value, judging that the ith header field in the m header fields is matched with the jth header field in the n header fields;
wherein, in the y correction, SIM is selected according to the similaritysTo SIM by second preset algorithm0+y-1(i, j) correcting to obtain SIM0+y(i, j), the value of Y includes a positive integer not greater than Y.
Further, the second preset algorithm is a formula as follows:
wherein SIM is similarity to be corrected, SIM*For the corrected similarity, a is a preset weight coefficient.
Further, the similarity calculation module includes:
the first calculation unit is used for acquiring the similarity between each title field in the m title fields and each title field in the n title fields according to the coincidence degree of the title field names of the m title fields in the current form and the title field names of the n title fields in the historical form; or
The second calculation unit is used for acquiring the similarity between each title field in the m title fields and each title field in the n title fields according to the title field names of the m title fields in the current form and the title field names of the n title fields in the historical form and according to a third-party knowledge base; or
And the third calculating unit is used for acquiring the similarity between each title field in the m title fields and each title field in the n title fields according to the coincidence degree of the field value sets of the m title fields in the current form and the field value sets of the n title fields in the history form.
Further, the third computing unit is specifically configured to:
when the field value sets of the m title fields in the current form are discrete, the similarity is determined by the following formula:
wherein,the field representing the ith header field takes a value set,a field value set representing a jth header field; or
When the field value sets of the m title fields in the current form are continuous, the similarity is determined by the following formula:
therein, max1Denotes the maximum value, min, of the ith header field1Denotes the minimum value of the ith header field, max2Denotes the maximum value, min, of the jth header field2Indicating the minimum value of the jth header field.
Further, on the basis of any one of the above embodiments of the apparatus, the apparatus further comprises:
and the storage module is used for storing the current form and the constraint condition of the current form into a historical form library.
Another aspect of the embodiments of the present invention further provides a data cleaning apparatus, including: memory, treater and bus, memory and treater are connected with the bus respectively, wherein:
the memory is used for storing data and program codes;
and the processor is used for reading the program codes stored in the memory and executing the data cleaning method.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In data storage, many statistical data are usually stored in the form of a form, such as a work type and content form, a staff basic information form, a floating population information form, and the like, and as the data in the form and the form are gradually increased, error data caused by the processes of acquisition, transmission, storage, processing, and the like may occur in the form, thereby affecting subsequent data query and analysis based on the form. The embodiment of the invention provides a data cleaning method aiming at possible error data, and considering that a plurality of historical forms subjected to data cleaning are stored in a form library, when the current form to be cleaned is consistent with the contents in the historical forms, the current form can be cleaned by referring to the constraint conditions of the data cleaning of the historical forms, and the data cleaning method can be widely applied to the data cleaning of the data stored in the form.
The embodiment of the invention provides a data cleaning method, which is used for automatically cleaning a large amount of data stored in a database, finding out possible error data and removing or correcting the error data.
Usually, a large amount of data is stored in a database in the form of a form, and a plurality of forms are stored in the database. A form stores a plurality of columns of data, and a column of data is typically composed of a header field and data corresponding to the header field. The table one shown below shows one common form. Illustratively, the table one includes a plurality of title fields with title field names "name", "ID", "gender", etc. Each title field has different attributes according to different data corresponding to the title field, and different association relations exist among the title fields. When data which does not meet the constraint condition exists in the data corresponding to the title field, the data which does not meet the constraint condition is error data, and the constraint condition comprises the attribute of the title field or the incidence relation among the title fields and the like. For example, the corresponding data in the header field "ID" has uniqueness, that is, the duplicate data "4" should not exist in the plurality of data corresponding to the header field "ID"; the title field 'city' has a specific value range attribute, namely, data of 'degree' which is not in a specific value range should not exist in the title field 'city'; there is a one-to-one correspondence between the title field "city" and the title field "area code", i.e., there is an error in the correspondence between "achievement" in the title field "city" in table one and "029" in the title field "area code". Several possible error data are exemplarily marked in table one in an underline manner.
Watch 1
In the embodiment of the present invention, a data cleaning method is provided for solving the problem that error data exists in a large amount of data as described above, where the method is used to perform data cleaning on a form, and for convenience of description, the form to be subjected to data cleaning is referred to as a current form. The method comprises the steps of firstly obtaining a history form similar to a current form, determining a title field matched with the title field of the current form in the history form, and then carrying out data cleaning on the corresponding matched title field in the current form according to a constraint condition of the matched title field in the history form to clean data which does not meet the constraint condition.
Fig. 1 is a schematic flow chart of a data cleaning method according to a first embodiment of the present invention. The execution subject of the embodiment is a data cleaning device, which may be provided in a processor. As shown in fig. 1, the method of this embodiment may include:
step 101, selecting a historical form which has the same description object as a current form from a historical form library, wherein the current form contains m title fields, the historical form contains n title fields, and m and n are positive integers;
102, calculating the similarity between each title field in the m title fields and each title field in the n title fields according to a first preset algorithm;
103, aiming at any similarity SIM (i, j), if the matching of the ith header field and the jth header field is judged according to a preset matching rule, acquiring a constraint condition of the jth header field; wherein i represents the ith title field in the m title fields, j represents the jth title field in the n title fields, the value of i comprises all natural numbers not greater than m, and the value of j comprises all natural numbers not greater than n;
and 104, performing data cleaning on data which does not meet the constraint condition in the data corresponding to the ith header field.
In order to distinguish the title fields in the history form from the title fields in the current form, one title field of m title fields in the current form is recorded as an ith title field, one title field of n title fields in the history form is recorded as a jth title field, where m and n are positive integers, a value of i includes all natural numbers not greater than m, a value of j includes all natural numbers not greater than n, and a similarity SIM (i, j) indicates a similarity between the ith title field and the jth title field.
Specifically, in step 101, for the current form, first, a history form having the same description object as the current form is obtained by comparison or query among a plurality of history forms stored in the history form library.
Specifically, in the history form library, there may be a history form having the same description object as the current form, that is, containing the same title field, and when the number of the same title field is greater and the number of the different title fields is smaller, it indicates that the current form is more similar to the history form, and the history form having the same description object as the current form may be screened out in the history form library by referring to the similarity principle.
Specifically, in step 102, the similarity between the m title fields in the current form and the n title fields in the history form is calculated. Illustratively, traversing all the title fields in the current form according to a first preset algorithm, and obtaining the similarity between each title field and each title field in the history form to obtain a similarity SIM (i, j), wherein i represents the ith title field in m title fields, j represents the jth title field in n title fields, the value of i includes all natural numbers not greater than m, and the value of j includes all natural numbers not greater than n.
In a specific implementation process, a specific implementation manner of obtaining the similarity SIM (i, j) may be any one or combination of the following implementation manners.
According to the overlap ratio of the title field names of the m title fields in the current form and the title field names of the n title fields in the historical form, the similarity between each title field in the m title fields and each title field in the n title fields is obtained.
For example, for a title field in the current form, if the title field name is "name", when a title field with the title field name of "name" also exists in the history form, the two title fields may be considered to be completely overlapped, and the similarity between the two title fields is 1; the coincidence degree between the title field of the title field name "in the current form and the title field of the title field name" monthly income "in the history form is 0, and therefore the similarity between these two title fields can be considered as 0.
According to another possible implementation manner, according to the title field names of the m title fields in the current form and the title field names of the n title fields in the historical form, the similarity between each title field in the m title fields and each title field in the n title fields is obtained according to a third-party knowledge base.
For example, when the field names "name" and "name" are stored in the third-party knowledge base as synonyms, the similarity between the field with the field name "in the current form and the field with the field name" in the history form may be considered to be 1.
In another possible implementation manner, the similarity between each title field in the m title fields and each title field in the n title fields is obtained according to the coincidence degree of the field value sets of the m title fields in the current form and the field value sets of the n title fields in the history form.
Specifically, for the difference of the field value set types of the title field in the form, the possible implementation manners include the following two cases:
in the first case, when the value set of the field of the title field is a discrete type, the similarity between the discrete type title field in the current form and each title field in the history form is obtained according to the following formula (1);
wherein, formula (1) is:
wherein,the field representing the ith header field takes a value set,indicates the j-th titleThe fields of the field take a set of values.
When the similarity is calculated by adopting the method, the similarity between the discrete title field in the current form and the non-discrete title field in the historical form can be considered as 0, and the calculation by adopting the formula (1) is not needed. As can be seen from equation (1), the more the same value in the field value sets of the two header fields, the higher the similarity.
For example, common discrete title fields are "city", "academic calendar", and the like. For example, when a value set for the "work city" title field includes: beijing, Shanghai and Shenzhen, and the value set of another title field 'work place' comprises: the similarity between the two fields is calculated by adopting the formula (1), and the similarity is 75% between the two fields, namely beijing, shanghai and shenzhen.
In the second case, when the value set of the fields of the title field is a continuous type, the similarity between the continuous type title field in the current form and all the title fields in the history form is obtained according to the following formula (2);
wherein, the formula (2) is:
therein, max1Denotes the maximum value, min, of the ith header field1Denotes the minimum value, max, of the ith header field2Denotes the maximum value, min, of the jth header field2Indicating the minimum value of the jth header field. For example, common consecutive header fields are "age", "payroll", and the like.
For the similarity between each title field in the current form and each title field in the history form, which is obtained by any one of the above implementation manners, the numeric value range of the similarity is a real number from 0 to 1. A similarity of 0 indicates that the two title fields from the current form and the history form, respectively, do not have any identical attributes or associations, and are considered to be invalid similarities. In the process of actually calculating the similarity, the similarity between any title field in the current form and each title field in the historical form usually has at most 1 valid similarity greater than 0, and the remaining similarity values are all 0.
Further, more accurate similarity can be obtained according to the similarity obtained by the different implementation modes and the preset weight of the similarity obtained by each implementation mode.
Specifically, in step 103, in all the similarities obtained in step 102, for any similarity SIM (i, j), if it is determined that the ith header field and the jth header field are matched according to a preset matching rule, the constraint condition of the jth header field is obtained.
For any form, according to the value set of each title field in the form and the relationship between multiple title fields, a constraint condition can be constructed for the form, and the constraint condition generally comprises the association relationship between the title field attribute and the title field.
Optionally, the title field attribute may be one or more of the following: reliability, uniqueness, labels, field synonyms, value ranges, and the like. Optionally, the association relationship may also be one or more of the following: correlation, order preservation, one-to-one mapping, etc. The constraint condition mainly limits the value range of each title field in the form and the interrelation of data corresponding to a plurality of title fields, and the data which do not meet the constraint condition can be regarded as error data which needs to be cleaned.
The history forms in the history form library respectively correspond to own constraint conditions, and when two title fields from the history forms and the current forms are matched title fields, the two title fields can be considered to have the same or similar title field attribute and the incidence relation between the title fields, namely, the data corresponding to the two title fields meet the same constraint conditions. Therefore, the constraint condition of the jth header field can be obtained after the ith header field and the jth header field are determined to be matched.
Specifically, in step 104, since the ith header field and the jth header field are matched, the ith header field is considered to be in accordance with the constraint condition of the jth header field, the constraint condition of the jth header field can be directly used to examine the data corresponding to the ith header field, the data which are not in accordance with the constraint condition in all the data corresponding to the ith header field are determined, and the data are cleaned. By determining as many pairs of title field matches as possible, more complete data cleaning of the title fields in the current form can be performed.
When the numerical values which do not accord with the attribute of the title field or the relation between the title fields in the current form are cleaned according to the constraint condition, the cleaning process comprises deleting the error value and providing a correction value. For example, when the title field is known to be "gender" according to the constraint condition, the value range of the title field can be "male, female, unknown", and when certain data corresponding to the "gender" title field in the current form is detected to be "beijing", the data is considered to be error data and needs to be cleaned, and after the "beijing" is deleted, the data can be corrected by the "unknown" according to the value range of the "gender" title field, so that the data cleaning of the current form is completed. Illustratively, when it is detected that certain data corresponding to the title field of "gender" in the current form is "male", it can be known that the synonym of "male" is "male" according to the value range and the synonym of the field, so that the "male" in the current form can be corrected to be "male", and data cleaning is completed.
Further, when data is cleaned in step 104, when data is modified and replaced, the source data and the replacement data to be modified can be displayed to the operator through the display screen, and the source data and the replacement data are corrected after the operator confirms the source data or are not corrected after the operator rejects the source data, so that the accuracy of data cleaning can be improved by adding the confirmation step.
In the specific implementation of the data cleaning process, a third-party knowledge base or an expert knowledge base can be called to inquire the title field and synonyms, near-synonyms, associated expanded words and the like of the data corresponding to the title field.
Optionally, when the history form with the same description object cannot be obtained in step 101, or when the history form with the same description object is obtained in step 101, the constraint conditions such as the attribute of the title field and the relationship between the title fields can be directly and manually input; or the data cleaning device automatically matches the title field according to the preset title field attribute and the relation between the title fields stored in the expert knowledge base, configures the constraint conditions stored in the expert knowledge base for the title field in the current form, and cleans the data.
Further, on the basis of the above embodiment, the data cleaning method provided in the embodiment of the present invention further includes:
and storing the current form and the constraint condition of the current form into a historical form library.
The current form after data cleaning each time and the constraint conditions of the current form are stored in the historical form library, so that the historical form library is expanded, and the data cleaning method provided by the embodiment of the invention is conveniently applied again for data cleaning in the follow-up process.
According to the data cleaning method provided by the embodiment of the invention, the constraint condition adaptability of the title field of the historical form is adaptively applied to the title field of the current form by combining the historical forms with the same description object in the historical form library, and the data corresponding to the title field of the current form is cleaned based on the constraint condition, so that the compiling and maintenance of a cleaning algorithm code program are not required when research personnel carry out data cleaning each time, the use threshold of a user is reduced, the data cleaning method has wide applicability, and the working intensity of manual data cleaning is reduced; the automatic cleaning of big data in the database is realized, the efficiency and the accuracy of data cleaning are improved, and the accuracy and the reliability of a data source are improved.
The following describes in detail the determination of matching between two header fields according to a preset matching rule, with reference to a specific embodiment, based on the above embodiment.
Fig. 2 is a schematic diagram of a form scene of a second embodiment of the data cleaning method of the present invention. Fig. 2 schematically shows a current form and a screened history form having the same description object, to which the data cleaning method of the present invention is applied, and indicates similarity between partial header fields in the current form and the history form, and fig. 2 also shows an association relationship between partial header fields preset in part of the current form and an association relationship between partial header fields in the history form.
After the similarity SIM (i, j) of each title field in the current form and each title field in the history form is obtained, whether the ith title field and the jth title field corresponding to the similarity SIM (i, j) are matched or not is judged according to a preset matching rule aiming at any similarity SIM (i, j).
Specifically, a specific implementation manner of the preset matching rule may be any one of the following implementation manners.
The first feasible implementation mode is as follows:
and if the similarity SIM (i, j) is greater than a first preset value, judging that the ith header field in the m header fields is matched with the jth header field in the n header fields.
For example, with reference to fig. 2, as to the obtained similarity 1 between the "professional" title field in the current form and the "professional" title field in the history form, it is determined whether the similarity 1 is greater than a first preset value of 0.9, and since the similarity 1 is greater than the first preset value of 0.9, it is determined that the ith title field and the jth title field are mutually matched title fields, and both have similar title field attributes and an association relationship between the title fields, that is, it is considered that the "professional" title field in the current form and the "professional" title field in the history form are mutually matched, and both should have the same constraint condition. For example, when the "job" title field in the history form has a similar title field of "work", the "job" title field in the current form also has a similar title field of "work". The data cleaning of the current form is realized by directly applying the constraint conditions of the 'occupation' title field of the historical form to the 'occupation' title field in the current form. The value range of the first preset value is a real number from 0 to 1, which can be preset in advance or modified appropriately in the matching process.
The second feasible implementation mode is as follows:
if the similarity SIM (i, j) is not more than a first preset value and m and n are more than 1, determining k title fields according to the association relation of preset fields of the current form, and determining the maximum similarity SIM (subscriber identity Module) in the similarity between the s title field and each title field in the m title fields for the s title field in the k title fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j);
If SIM0(i, j) is greater than the first preset value, it is determined that the ith header field of the m header fields matches the jth header field of the n header fields.
Specifically, when the similarity SIM (i, j) is determined to be smaller than the first preset value, it cannot be directly determined that the ith header field and the jth header field are matched. When m and n are greater than 1, a plurality of title fields do exist in the current form and the historical form, the similarity SIM (i, j) can be corrected according to the similarity of other title fields which have incidence relations with the title field corresponding to the similarity SIM (i, j) to obtain more accurate similarity, and when the similarity SIM (i, j) is corrected0(i, j) is greater than the first preset value, it can be determined that the ith header field and the jth header field are matched as well.
Optionally, the first preset value may be a fixed value, or may be adaptively changed to another value when the similarity is corrected. Optionally, before comparing the similarity SIM (i, j) with the first preset value, the method further includes determining that the similarity SIM (i, j) is greater than 0, and for the ith header field and the jth header field of which the similarity SIM (i, j) is 0, directly determining that the ith header field and the jth header field are not matched, without performing correction.
Illustratively, the method for correcting the similarity SIM (i, j) is as follows:
and determining k title fields associated with the ith title field according to the association relationship of the preset fields of the current form, wherein k is the total number of the title fields associated with the ith title field determined according to the association relationship of the preset fields of the current form, and is less than m. For the s-th header field of the k header fields, determining the maximum similarity SIM among the similarities between the s-th header field and each of the m header fieldssWherein, the value of s includes all natural numbers not greater than K, namely, K similarity SIMs can be determined according to the association relation of the preset fields1、SIM2…SIMk-1、SIMk. Then, K similarity SIMs1、SIM2…SIMk-1、SIMkCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j)。
Specifically, in the correction, the second preset algorithm is the following formula (3):
wherein SIM is similarity to be corrected, SIM*For the corrected similarity, a is a preset weight coefficient.
When the similarity pairs are corrected, all the similarity SIMs (i, j) smaller than the first preset value can be sorted, and the correction is performed in sequence from high to low. And correcting all the similarity SIMs (i, j) smaller than the first preset value to obtain the corrected similarity so as to improve the matching rate of the title fields in the current form and the historical form and carry out more sufficient data cleaning on the current form.
In conjunction with fig. 2, with reference to the multiple similarities in the current and historical forms that have been calculated in fig. 2, wherein, the similarity between the 'occupation' title field in the current form and the 'occupation' title field in the historical form is 1 and is more than a first preset value of 0.9, the two title fields can be directly determined to be matched with each other, thus, the data of the "professional" title field in the current form can be cleaned according to the constraint conditions of the "professional" title field in the historical form, for example, if the data corresponding to the "professional" title field in the current form contains data such as "Beijing", "male", "2000", and the like, according to the condition that the value-taking set defined in the constraint condition of "professional" title field in the history form is "official, programmer and none", the data "Beijing", "Man" and "2000" can be considered as error data and need to be modified. The similarity between the "monthly income" title field in the current form and the "monthly salary" title field in the history form is 0.7, and the similarity between the "academic calendar" title field in the current form and the "highest academic calendar" title field in the history form is 0.8. Meanwhile, the 'monthly income' title field in the current form is associated with the 'academic calendar' title field and the 'professional' title field, when the 'monthly salary' title field in the history form is associated with the 'highest academic calendar' title field and the 'professional' title field, the modified similarity SIM of the title field 'monthly income' and the title field 'monthly salary' is adopted0(i, j), which can be determined by the following equation:
when a is 0.4, the corrected similarity is 0.817, when the first preset value is still 0.9, the title field monthly income and the title field monthly salary are not matched, if the first preset value is modified to be 0.81 in the correction process, the title field monthly income and the title field monthly salary are matched with each other, and the constraint conditions related to the title field monthly salary in the history form can be applied to the current form for data cleaning of the current form.
The feasible implementation mode three is as follows:
if the similarity SIM (i, j) is not more than a first preset value and m and n are more than 1, determining k title fields according to the association relation of preset fields of the current form, and determining the maximum similarity SIM (subscriber identity Module) in the similarity between the s title field and each title field in the n title fields for the s title field in the k title fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j);
If SIM0(i, j) is not more than the first preset value, then SIM is tested0(i, j) continuously correcting for Y times to obtain SIM0+Y(i, j) if SIM0+Y(i, j) when the number of the m header fields is greater than the first preset value, judging that the ith header field in the m header fields is matched with the jth header field in the n header fields;
wherein, in the y correction, SIM is selected according to the similaritysTo SIM by second preset algorithm0+y-1(i, j) correcting to obtain SIM0+y(i, j), the value of Y includes a positive integer not greater than Y.
Specifically, as with the modification method in the second feasible implementation manner, any similarity SIM (i, j) may be modified first to obtain a modified similarity SIM (i, j)0(i, j), the detailed modification method is not described herein again.
When the similarity SIM after correction0(i, j) is greater than the first preset value, the i-th header field is judged to be matched with the j-th header field, and when the similarity SIM after correction is detected0If (i, j) is still smaller than the first preset value, the corrected similarity SIM can be corrected0(i, j) performing a plurality of successive corrections. Optionally, in the multiple correction processes, the SIMs are all based on the similaritysAnd carrying out iterative correction on the similarity obtained by the last correction through a second preset algorithm. Particularly, in the iterative process, in the y correction, the SIM is corrected according to the similaritysTo SIM by second preset algorithm0+y-1(i, j) correcting to obtain SIM0+y(i, j), wherein the value of Y comprises a positive integer not greater than Y, and Y is a preset value. When the SIM is corrected for Y times0+YAnd (i, j) when the (i, j) is larger than the first preset value, judging that the ith header field is matched with the jth header field. Optionally, in the iterative correction, if the SIM is obtained0+XAnd (i, j) is greater than a first preset value, and X is smaller than Y, stopping iterative correction. The second preset algorithm is similar to the second implementation manner, and is not described in detail herein.
Optionally, when the continuous correction is performed for multiple times, the corrected similarity may be used to replace the similarity before the correction, so as to improve the correction efficiency. In the specific replacement process, the following two replacement modes are included.
One possible alternative mode is:
for all similarities, before the y-th correction, all similarity SIMs larger than the first preset value are determined0+y-1(i, j) replacing the similarity before correction with the similarity values; and determining all similarity SIMs smaller than the first preset value0+y-1(i, j) forming a similarity set E, and performing the y-th correction in the similarity set E according to the sequence of the similarity from high to low to obtain the corrected similarity SIM0+y(i, j), then determining all similarity SIMs greater than a first preset value0+y(i, j) using a similarity SIM greater than a first preset value0+y(i, j) substitution of similarity SIM before correction0+y-1(i, j). Before the y-th correction is carried out on the similarity to be corrected, the similarity before correction is replaced by the corrected similarity, and therefore correction efficiency can be improved.
Another possible alternative mode:
for all similarities, before correcting the similarity SIM (i, j), determining all the similarities SIM (i, j) smaller than a first preset value to form a similarity set F, correcting the similarities in the similarity set F according to the sequence from high to low, and obtaining each corrected similarity SIM0(i, j), the similarity SIM is judged0(i, j) whether the similarity is greater than a first preset value or not, if so, replacing the similarity before correction by the corrected similarity until the similarity is combined with the correction and replacement of all the similarities in F, and then, replacing all the corrected similarities SIM still smaller than the first preset value0(i, j) composition similarity set F*In the similarity set F*In the above-mentioned method, correction and replacement are carried out in the same manner as the first correction until the Y-times continuous correction is completed.
According to the data cleaning method provided by the embodiment of the invention, the constraint condition adaptability of the title field of the historical form is adaptively applied to the title field of the current form by combining the historical forms with the same description object in the historical form library, and the data corresponding to the title field of the current form is cleaned based on the constraint condition, so that the compiling and maintenance of a cleaning algorithm code program are not required when research personnel carry out data cleaning each time, the use threshold of a user is reduced, the data cleaning method has wide applicability, and the working intensity of manual data cleaning is reduced; the automatic cleaning of big data in the database is realized, the efficiency and the accuracy of data cleaning are improved, and the accuracy and the reliability of a data source are improved.
Fig. 3 is a schematic structural diagram of a data cleansing apparatus according to a first embodiment of the present invention. As shown in fig. 3, the apparatus of the present embodiment may include:
a history form obtaining module 301, configured to select a history form having the same description object as a current form from a history form library, where the current form includes m title fields, and the history form includes n title fields, where m and n are positive integers;
the similarity calculation module 302 is configured to calculate, according to a first preset algorithm, a similarity between each title field of the m title fields and each title field of the n title fields, which are acquired by the history form acquisition module 301;
a matching module 303, configured to, for any similarity SIM (i, j) calculated by the similarity calculation module 302, obtain a constraint condition of the jth header field if it is determined that the ith header field and the jth header field are matched according to a preset matching rule; wherein i represents the ith title field in the m title fields, j represents the jth title field in the n title fields, the value of i comprises all natural numbers not greater than m, and the value of j comprises all natural numbers not greater than n;
and the data cleaning module 304 is configured to perform data cleaning on data that does not meet the constraint condition obtained by the matching module 303 in the data corresponding to the ith header field.
The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.
Further, on the basis of the above embodiment, the matching module 303 is specifically configured to:
and aiming at any similarity SIM (i, j) obtained by calculation of the similarity calculation module, if the similarity SIM (i, j) is greater than a first preset value, judging that the ith header field in the m header fields is matched with the jth header field in the n header fields, and acquiring the constraint condition of the jth header field.
Further, on the basis of the above embodiment, the matching module 303 is specifically configured to:
aiming at any similarity SIM (i, j) obtained by calculation of the similarity calculation module, if the similarity SIM (i, j) is not more than the second similarity SIMWhen m and n are greater than 1 and a preset value, k title fields are determined according to the incidence relation of the preset fields of the current form, and for the s-th title field in the k title fields, the SIM with the maximum similarity is determined in the similarity between the s-th title field and each title field in the n title fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j);
If SIM0(i, j) is greater than the first preset value, it is determined that the ith header field of the m header fields matches the jth header field of the n header fields.
Further, on the basis of the above embodiment, the matching module 303 is specifically configured to:
aiming at any similarity SIM (i, j) obtained by calculation of the similarity calculation module, if the similarity SIM (i, j) is not more than a first preset value and m and n are more than 1, k title fields are determined according to the association relation of preset fields of the current form, and for the s-th title field in the k title fields, the maximum similarity SIM (subscriber identity module) is determined in the similarity between the s-th title field and each title field in the n title fieldssThe value of s comprises all natural numbers not greater than k, k is the total number of the title fields associated with the ith title field determined according to the association relation of the preset fields of the current form, and k is smaller than m;
according to similarity SIMsCorrecting the SIM (i, j) by a second preset algorithm to obtain the corrected similarity SIM0(i,j);
If SIM0(i, j) is not more than the first preset value, then SIM is tested0(i, j) continuously correcting for Y times to obtain SIM0+Y(i, j) if SIM0+Y(i, j) is greater than the first preset value, then judgeDetermining that the ith header field of the m header fields matches the jth header field of the n header fields;
wherein, in the y correction, SIM is selected according to the similaritysTo SIM by second preset algorithm0+y-1(i, j) correcting to obtain SIM0+y(i, j), the value of Y includes a positive integer not greater than Y.
Further, the second preset algorithm is a formula as follows:
wherein SIM is similarity to be corrected, SIM*For the corrected similarity, a is a preset weight coefficient.
Further, on the basis of any one of the above device embodiments, the similarity calculation module 302 includes:
the first calculation unit is used for acquiring the similarity between each title field in the m title fields and each title field in the n title fields according to the coincidence degree of the title field names of the m title fields in the current form and the title field names of the n title fields in the historical form; or
The second calculation unit is used for acquiring the similarity between each title field in the m title fields and each title field in the n title fields according to the title field names of the m title fields in the current form and the title field names of the n title fields in the historical form and according to a third-party knowledge base; or
And the third calculating unit is used for acquiring the similarity between each title field in the m title fields and each title field in the n title fields according to the coincidence degree of the field value sets of the m title fields in the current form and the field value sets of the n title fields in the history form.
Further, on the basis of the foregoing embodiment, the third calculating unit is specifically configured to:
when the field value sets of the m title fields in the current form are discrete, the similarity is determined by the following formula:
wherein,the field representing the ith header field takes a value set,a field value set representing a jth header field; or
When the field value sets of the m title fields in the current form are continuous, the similarity is determined by the following formula:
therein, max1Denotes the maximum value, min, of the ith header field1Denotes the minimum value, max, of the ith header field2Denotes the maximum value, min, of the jth header field2Indicating the minimum value of the jth header field.
Further, on the basis of any one of the above embodiments of the apparatus, the apparatus further comprises:
and the storage module is used for storing the current form and the constraint condition of the current form into a historical form library.
FIG. 4 is a schematic structural diagram of a data cleansing apparatus according to the present invention. The apparatus may be used to perform the data cleansing method as described in the above embodiments. As shown in fig. 4, the apparatus includes: a processor 401, a memory 402 and a bus 405, wherein the processor 401 and the memory 402 are respectively connected to the bus 405, and wherein:
memory 402 is used to store data and store program code;
the processor 401 is configured to read the program code stored in the memory 402 and execute the data cleaning method.
Specifically, a large amount of data and program codes are stored in the memory 402, the data is stored in a form, and the processor 401 cleans and corrects error data which may exist in the memory 402 by implementing the data cleaning method of the present invention; optionally, the user equipment further includes a display 403, where the display 403 is used to display the cleaning and correction results of the processor 401, and may also be used to display the intermediate processing procedure of the processor 401; optionally, the user equipment further includes a memory 404, and preset data such as a third-party database, a history list library, an expert knowledge library and the like are stored in the memory 404, so that the processor 401 can call the data cleaning method of the present invention conveniently, and optionally, the preset data such as the third-party database, the history list library, the expert knowledge library and the like can also be stored in the memory 402. The third-party knowledge base comprises a near-meaning word base of a preset title field, the historical form base comprises historical forms which are subjected to data cleaning, the data accuracy in the historical forms is high, the historical forms correspond to respective constraint conditions, and the expert knowledge base comprises constraint conditions preset based on expert knowledge. The device modules in fig. 4 may be integrated in the same computer or may be connected only through a network.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.