Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides the following technical scheme for achieving the purposes: the quality evaluation and grading early warning method for the public accumulation fund data comprises the following steps:
collecting m groups of evaluation data from an accumulation system, wherein m is an integer greater than 1;
data cleaning is carried out on m groups of evaluation data;
performing quality evaluation on m groups of evaluation data after data cleaning, and automatically identifying quality problems;
And generating corresponding early warning instructions for the evaluation data for data cleaning and quality problem identification.
Further, the assessment data includes at least personal information, transaction records, and balance records;
the personal information comprises the name, the ID card number, the telephone number and the address of the public accumulation account holder;
The transaction records comprise a payment record and an extraction record;
the payment record comprises a single payment amount, a single payment date and a payment unit;
The extraction record comprises a single extraction amount and a single extraction date;
the balance record is the balance information in the public accumulation account;
one set of evaluation data corresponds to one of the accumulation accounts in the accumulation system.
Further, the step of performing data cleansing on the m groups of evaluation data includes:
step 1: deleting repeated data in m groups of evaluation data;
step 2: unreasonable data in the m sets of evaluation data is deleted.
Further, the method for deleting the repeated data in the m groups of evaluation data comprises the following steps:
Calculating a corresponding hash value of each data in each group of personal information in m groups of evaluation data by adopting a hash function, and marking the hash value as a personal hash value; comparing each personal hash value, and judging whether the same personal hash value exists or not; if the same personal hash value exists, marking the data with the same personal hash value as repeated data, reserving one repeated data in n repeated data with the same personal hash value, deleting the rest n-1 repeated data from m groups of evaluation data, wherein n is an integer larger than 1;
And if the name and the identification card number in the personal information in one set of evaluation data are deleted, deleting the corresponding evaluation data from m sets of evaluation data.
Further, the step of deleting unreasonable data in the m sets of evaluation data includes:
Step 201: judging whether the identification card number or the telephone number in each group of personal information in the m groups of evaluation data is marked as unreasonable data;
The identification card number and the telephone number in each group of personal information in the m groups of evaluation data are respectively obtained by adopting a character string function, the character string length corresponding to the identification card number is marked as the identification card length, and the character string length corresponding to the telephone number is marked as the telephone length; presetting an identity card length threshold value and a telephone length threshold value, comparing each identity card length with the identity card length threshold value, and marking the corresponding identity card number as unreasonable data if the identity card length is inconsistent with the identity card length threshold value; if the length of the identity card is consistent with the length threshold value of the identity card, the corresponding identity card number is not marked as unreasonable data; comparing each telephone length with a telephone length threshold, and if the telephone length is inconsistent with the telephone length threshold, marking the corresponding telephone number as unreasonable data; if the telephone length is consistent with the telephone length threshold, the corresponding telephone number is not marked as unreasonable data;
Step 202: judging whether the single payment amount, the single extraction amount or the balance record in the m groups of evaluation data is marked as unreasonable data;
Analyzing the single payment amount, the single extraction amount and the balance record in the m groups of evaluation data; if the single payment amount is less than 0, marking the corresponding single payment amount as unreasonable data, and if the single payment amount is greater than or equal to 0, not marking the corresponding single payment amount as unreasonable data; if the single extraction amount is smaller than 0, marking the corresponding single extraction amount as unreasonable data, and if the single extraction amount is larger than or equal to 0, not marking the corresponding single extraction amount as unreasonable data; if the balance record is smaller than 0, marking the corresponding balance record as unreasonable data, and if the balance record is larger than or equal to 0, not marking the corresponding balance record as unreasonable data;
step 203: judging whether the single payment date or the single extraction date in the m groups of evaluation data is marked as unreasonable data;
Marking the single payment date and the single extraction date in the m groups of evaluation data as date data, inputting the date data into a trained date analysis model, and judging whether the date data is unreasonable or not;
The training process of the date analysis model comprises the following steps:
Setting corresponding judgment results for a pieces of date data in advance, wherein a is an integer larger than 1, the judgment results comprise reasonable date and unreasonable date, and different digital labels are set for the reasonable date and unreasonable date;
Marking the digital label of the judgment result as a judgment label, and converting the date data and the corresponding judgment label into a corresponding group of feature vectors;
Taking each group of feature vectors as the input of a date analysis model, wherein the date analysis model takes a group of prediction judgment labels corresponding to each group of date data as the output, and takes an actual judgment label corresponding to each group of date data as a prediction target, wherein the actual judgment label is the preset digital label of the judgment result corresponding to the date data; taking the sum of prediction errors of all date data as a training target; training the date analysis model until the sum of the prediction errors reaches convergence, and stopping training; the date analysis model is a deep neural network model;
Obtaining a corresponding judgment result according to the predicted judgment label; if the judgment result is that the date is reasonable, the corresponding date data is not marked as unreasonable data; if the judgment result is that the date is unreasonable, marking the corresponding date data as unreasonable data;
step 204: unreasonable data is deleted from the m sets of evaluation data.
Further, the method for performing quality evaluation on the m groups of evaluation data after data cleaning comprises the following steps:
The identification card numbers in the m groups of evaluation data are sent to a query server, wherein the query server comprises a social security query server, a household registration query server and an operator query server; the social security inquiring server inquires according to the identity card number, generates a corresponding name and a payment unit and feeds back the name and the payment unit; the household registration inquiry server inquires according to the ID card number, generates a corresponding address and feeds back the address; the operator inquiry server inquires according to the ID card number, generates a corresponding telephone number and feeds back the telephone number;
The name, the payment unit, the address and the telephone number acquired through feedback are respectively calculated to corresponding hash values by adopting a hash function, and the hash values are marked as comparison hash values; calculating corresponding hash values of payment units in the m groups of evaluation data by adopting a hash function, and marking the corresponding hash values as personal hash values;
constructing an analysis set by comparing hash values corresponding to the identification card numbers; constructing an evaluation set by using the personal hash value corresponding to the identification card number; marking the name, the payment unit, the address and the telephone number as analysis data; comparing the analysis set and the evaluation set which correspond to the same identification card number, and comparing each comparison hash value in the analysis set with the corresponding personal hash value in the evaluation set;
if the comparison hash value is the same as the corresponding personal hash value, the analysis data corresponding to the comparison hash value and the personal hash value are not marked as problem data;
If the comparison hash value is different from the corresponding personal hash value, marking the analysis data corresponding to the comparison hash value and the personal hash value as problem data;
if the analysis data corresponding to each comparison hash value in one analysis set is marked as problem data, marking the corresponding analysis set as a problem set, and marking the identification card number corresponding to the problem set as problem data; then the telephone numbers corresponding to the problem set are sent to an operator inquiry server, and the operator inquiry server inquires according to the telephone numbers, generates corresponding names and feeds back; calculating a corresponding hash value of the fed back name by adopting a hash function, and marking the hash value as a secondary comparison hash value; marking names corresponding to the problem set as comparison names, and comparing the comparison hash values corresponding to the comparison names with the secondary comparison hash values; if the comparison hash value is consistent with the secondary comparison hash value, not marking the telephone number and the name corresponding to the problem set as problem data; if the comparison hash value is inconsistent with the secondary comparison hash value, marking the telephone number and the name corresponding to the problem set as problem data;
if the identification card number in the evaluation data is deleted, the undeleted name, payment unit, address and telephone number in the corresponding evaluation data are marked as question data.
Further, the early warning instructions comprise an advanced early warning instruction and a low-level early warning instruction; the high-level early warning instructions comprise a first-level early warning instruction and a second-level early warning instruction.
Further, the method for generating the corresponding early warning instruction for the evaluation data for data cleaning comprises the following steps:
if the identification card number or the name in the group of evaluation data is marked as problem data or deleted, generating a key error instruction;
if the identification card number and the name in one group of evaluation data are marked as problem data or deleted at the same time, generating a multi-key error instruction;
If transaction records, telephone numbers or addresses exist in a group of evaluation data and are marked as problem data or deleted, generating a general error instruction;
If the transaction record, the telephone number or the address in the set of evaluation data are marked as problem data or deleted, generating a multi-general error instruction;
If multiple key error instructions are generated in one group of evaluation data, generating a second-level advanced early warning instruction;
if multiple key error instructions are not generated in the set of evaluation data, but the key error instructions are generated, generating a first-level early warning instruction;
If multiple key error instructions and key error instructions are not generated in the set of evaluation data, but multiple general error instructions are generated, generating a first-level early warning instruction;
If multiple key error instructions, key error instructions and multiple general error instructions are not generated in one set of evaluation data, but the general error instructions are generated, a low-level early warning instruction is generated.
Further, analyzing the evaluation data corresponding to the names with the same personal hash value, and judging whether the evaluation data are marked as repeated data or not;
marking the evaluation data corresponding to the names with the same personal hash value as the same data;
If the identification card numbers in the same data of the group b are marked as repeated data, the names in the same data of the group b are also marked as repeated data, and b is an integer greater than 1;
If the identity card numbers which are not marked as repeated data exist in the b groups of identical data, taking the personal hash value corresponding to each group of identical data as a group of test sets, namely, the test sets are in one-to-one correspondence with the identical data; the same data corresponding to the identification card number which is not marked as the repeated data is marked as different data; sequentially increasing and setting digital labels for a test set corresponding to the same data and a test set corresponding to different data, marking the digital label corresponding to the test set as a test label, wherein the range of the test label is [1, b '], b' =b;
the test labels corresponding to the same data and the test labels corresponding to a group of different data are used as a group of total test sets; sequentially inputting each group of total test set into a trained probability prediction model to predict corresponding repetition probability;
The training process of the probability prediction model comprises the following steps:
collecting repetition probability corresponding to the b groups of total test sets in advance, and converting the total test sets and the corresponding repetition probability into a corresponding group of feature vectors;
taking each group of feature vectors as input of a probability analysis model, wherein the probability analysis model takes a group of repetition probabilities corresponding to each group of total test sets as output, and takes actual repetition probabilities corresponding to each group of total test sets as prediction targets, and the actual repetition probabilities are the repetition probabilities corresponding to the total test sets collected in advance; taking the sum of the prediction errors of the minimum total test set as a training target; training the probability analysis model until the sum of the prediction errors reaches convergence, and stopping training; the probability analysis model is a deep neural network model;
A preset probability threshold P T;
Comparing the repetition probability P C with a probability threshold P T;
If P C≤PT, the names in the different data are not marked as repeated data, and the names in the same data are marked as repeated data;
If P C>PT, the names in different data and the names in the same data are marked as duplicate data.
The public accumulation data quality assessment and grading early warning system implements the public accumulation data quality assessment and grading early warning method, which comprises the following steps:
The data collection module is used for collecting m groups of evaluation data from the public accumulation system, wherein m is an integer greater than 1;
The data preprocessing module is used for cleaning the m groups of evaluation data;
the data quality evaluation module is used for performing quality evaluation on m groups of evaluation data after data cleaning and automatically identifying quality problems;
and the early warning module is used for generating corresponding early warning instructions for the evaluation data for data cleaning and quality problem identification.
The invention discloses a system and a method for evaluating and grading the quality of accumulation fund data, which have the technical effects and advantages that:
1. The method can clean and evaluate the quality of the accumulation fund data, automatically generate early warning instructions of different levels according to the quality problem, and send the early warning instructions to related personnel in an electronic mode; the problem data is comprehensively checked through a plurality of data verification means, so that automation and efficient processing of the quality management work of the public accumulation data are realized, the overall quality level of the public accumulation data is greatly improved, the manual auditing burden is lightened, and the scientificity and accuracy of business decisions are improved.
2. Analyzing the evaluation data corresponding to the names with the same personal hash value, predicting the repetition probability by using a probability prediction model, quantitatively analyzing the judgment result, and judging whether the names are marked as repeated data or not more accurately; the method can better coordinate the relationship between the repeated data identification and the normal data retention, so that the repeated data cleaning work takes account of the repeated data cleaning and the information loss minimization, the cleaning efficiency and the cleaning level of the accumulated gold data are effectively improved, and the accuracy of the accumulated gold data evaluation is improved.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, the system for evaluating and grading the quality of the public accumulation fund data according to the present embodiment includes a data collection module, a data preprocessing module, a data quality evaluation module and an early warning module; all the modules are connected in a wired and/or wireless mode, so that data transmission among the modules is realized.
And the data collection module is used for collecting m groups of evaluation data from the public accumulation system, wherein m is an integer greater than 1.
The assessment data includes at least personal information, transaction records, and balance records.
The personal information includes the name, identification number, telephone number, and address of the principal account holder.
The transaction records include a payment record and an extraction record.
The payment record comprises a single payment amount, a single payment date and a payment unit.
The withdrawal record includes a single withdrawal amount and a single withdrawal date.
The balance is recorded as balance information in the public accumulation account.
One set of evaluation data corresponds to one of the accumulation accounts in the accumulation system.
And the data preprocessing module is used for cleaning the m groups of evaluation data.
The step of data cleaning for m sets of evaluation data comprises:
step 1: duplicate data in the m sets of evaluation data is deleted.
The method for deleting the repeated data in the m groups of evaluation data comprises the following steps:
Calculating a corresponding hash value for each data in each group of personal information in the m groups of evaluation data by adopting a hash function, and marking the hash value as a personal hash value, wherein the hash function is such as SM2, SM3, SHA3-512, whirlpool and the like; comparing each personal hash value, and judging whether the same personal hash value exists or not; if the same personal hash value exists, marking the data with the same personal hash value as repeated data, reserving one repeated data in n repeated data with the same personal hash value, deleting the rest n-1 repeated data from m groups of evaluation data, wherein n is an integer larger than 1.
If the name and the ID card number in the personal information in one group of evaluation data are deleted, deleting the corresponding evaluation data from m groups of evaluation data; the reason is that if the names and the identification numbers in the personal information in the multiple groups of evaluation data are the same, it is explained that the public accumulation system stores repeated data of one public accumulation account, and redundant data needs to be deleted.
Step 2: unreasonable data in the m sets of evaluation data is deleted.
The step of deleting unreasonable data in the m sets of evaluation data includes:
step 201: judging whether the identification card number or the telephone number in each group of personal information in the m groups of evaluation data is marked as unreasonable data.
For the identification card number and the telephone number in each group of personal information in m groups of evaluation data, a character string function is adopted to obtain the corresponding character string length, the character string length corresponding to the identification card number is marked as the identification card length, the character string length corresponding to the telephone number is marked as the telephone length, and the character string function is length, strlen and the like; presetting an identity card length threshold value and a telephone length threshold value, comparing each identity card length with the identity card length threshold value, and marking the corresponding identity card number as unreasonable data if the identity card length is inconsistent with the identity card length threshold value; if the length of the identity card is consistent with the length threshold value of the identity card, the corresponding identity card number is not marked as unreasonable data; comparing each telephone length with a telephone length threshold, and if the telephone length is inconsistent with the telephone length threshold, marking the corresponding telephone number as unreasonable data; if the telephone length is consistent with the telephone length threshold, the corresponding telephone number is not marked as unreasonable data; the length threshold of the identification card number and the length threshold of the telephone are preset by a person skilled in the art according to practical situations, and the length threshold of the identification card number is preferably 18, and the length threshold of the telephone is 11.
Step 202: judging whether the single payment amount, the single extraction amount or the balance record in the m groups of evaluation data is marked as unreasonable data.
Analyzing the single payment amount, the single extraction amount and the balance record in the m groups of evaluation data; if the single payment amount is less than 0, marking the corresponding single payment amount as unreasonable data, and if the single payment amount is greater than or equal to 0, not marking the corresponding single payment amount as unreasonable data; if the single extraction amount is smaller than 0, marking the corresponding single extraction amount as unreasonable data, and if the single extraction amount is larger than or equal to 0, not marking the corresponding single extraction amount as unreasonable data; if the balance record is less than 0, marking the corresponding balance record as unreasonable data, and if the balance record is greater than or equal to 0, not marking the corresponding balance record as unreasonable data.
It should be noted that, because the single payment amount, the single extraction amount and the balance record are not less than 0, if the single payment amount, the single extraction amount or the balance record in the m groups of evaluation data is less than 0, the unreasonable data is obtained.
Step 203: and judging whether the single payment date or the single extraction date in the m groups of evaluation data is marked as unreasonable data.
Marking the single payment date and the single extraction date in the m groups of evaluation data as date data, inputting the date data into a trained date analysis model, and judging whether the date data is unreasonable.
The specific training process of the date analysis model comprises the following steps:
Setting corresponding judging results for a pieces of date data in advance, wherein a is an integer larger than 1, the judging results comprise reasonable date and unreasonable date, different digital labels are set for the reasonable date and unreasonable date, and the digital labels are set to be 0 for the reasonable date and 1 for the unreasonable date; and collecting a number of different date data by a person skilled in the art in the process of evaluating the quality of the historical public accumulation gold data, and sequentially judging whether the a number of different date data are unreasonable (for example, the date data are 1 month, 32 days, 2 months, 30 days and the like) according to actual experience by the person skilled in the art, and sequentially setting the a number of different date data into corresponding judgment results.
And marking the digital label of the judgment result as a judgment label, and converting the date data and the corresponding judgment label into a corresponding group of feature vectors.
Taking each group of feature vectors as the input of a date analysis model, wherein the date analysis model takes a group of prediction judgment labels corresponding to each group of date data as the output, and takes an actual judgment label corresponding to each group of date data as a prediction target, wherein the actual judgment label is the preset digital label of the judgment result corresponding to the date data; taking the sum of prediction errors of all date data as a training target; wherein, the calculation formula of the prediction error is zp= (alpha p- μp) 2, wherein Z p is the prediction error, p is the group number of the feature vector corresponding to the date data, alpha p is the prediction judgment label corresponding to the p-th group of date data, and mu p is the actual judgment label corresponding to the p-th group of date data; and training the date analysis model until the sum of the prediction errors reaches convergence, and stopping training.
The date analysis model is specifically a deep neural network model.
Obtaining a corresponding judgment result according to the predicted judgment label; if the judgment result is that the date is reasonable, the corresponding date data is not marked as unreasonable data; if the judgment result is that the date is unreasonable, marking the corresponding date data as unreasonable data.
Step 204: unreasonable data is deleted from the m sets of evaluation data.
The data quality evaluation module is used for performing quality evaluation on m groups of evaluation data after data cleaning and automatically identifying quality problems.
The method for carrying out quality evaluation on m groups of evaluation data after data cleaning comprises the following steps:
The identification card numbers in the m groups of evaluation data are sent to a query server, wherein the query server comprises a social security query server, a household registration query server and an operator query server, and the operator query server comprises a communication query server, a telecommunication query server and a mobile query server; the social security inquiring server inquires according to the identity card number, generates a corresponding name and a payment unit and feeds back the name and the payment unit; the household registration inquiry server inquires according to the ID card number, generates a corresponding address and feeds back the address; the operator inquiry server inquires according to the identification card number, generates a corresponding telephone number and feeds back.
The name, the payment unit, the address and the telephone number acquired through feedback are respectively calculated to corresponding hash values by adopting a hash function, and the hash values are marked as comparison hash values; and calculating corresponding hash values of the payment units in the m groups of evaluation data by adopting a hash function, and marking the hash values as personal hash values.
Constructing an analysis set by comparing hash values corresponding to the identification card numbers; constructing an evaluation set by using the personal hash value corresponding to the identification card number; marking the name, the payment unit, the address and the telephone number as analysis data; comparing the analysis set and the evaluation set which correspond to the same identification card number, and comparing each comparison hash value in the analysis set with the corresponding personal hash value in the evaluation set, wherein the comparison hash value which is compared is the same as the analysis data corresponding to the personal hash value, for example, the comparison hash value corresponds to a name, a payment unit and the like.
If the comparison hash value is the same as the corresponding personal hash value, the analysis data corresponding to the comparison hash value and the personal hash value are not marked as problem data, and the corresponding data is indicated to have no quality problem.
If the comparison hash value is different from the corresponding personal hash value, marking the analysis data corresponding to the comparison hash value and the personal hash value as problem data, which indicates that the corresponding analysis data has quality problems, and the analysis data stored in the public accumulation system has errors and needs to be changed.
If the analysis data corresponding to each comparison hash value in one analysis set is marked as problem data, marking the corresponding analysis set as a problem set, and marking the identification card number corresponding to the problem set as problem data; then the telephone numbers corresponding to the problem set are sent to an operator inquiry server, and the operator inquiry server inquires according to the telephone numbers, generates corresponding names and feeds back; calculating a corresponding hash value of the fed back name by adopting a hash function, and marking the hash value as a secondary comparison hash value; marking names corresponding to the problem set as comparison names, and comparing the comparison hash values corresponding to the comparison names with the secondary comparison hash values; if the comparison hash value is consistent with the secondary comparison hash value, not marking the telephone number and the name corresponding to the problem set as problem data; if the comparison hash value is inconsistent with the secondary comparison hash value, telephone numbers and names corresponding to the problem set are still marked as problem data.
It should be noted that, although the purpose of querying the name according to the phone number is that the analysis data corresponding to the question set is marked as question data, it may be caused by an error in the identification card number, so that it can be determined whether the phone number corresponds to the name according to the phone number query name, and if the phone number corresponds to the name, it is explained that the reason that the phone number and the name are marked as question data is that the identification card number is wrong, so that it is unnecessary to mark the phone number and the name as question data.
If the identification card number in the evaluation data is deleted, the undeleted name, payment unit, address and telephone number in the corresponding evaluation data are marked as question data.
And the early warning module is used for generating corresponding early warning instructions for the evaluation data for data cleaning and quality problem identification.
The early warning instructions comprise high-level early warning instructions and low-level early warning instructions; the high-level early warning instructions comprise a first-level early warning instruction and a second-level early warning instruction.
If the identification card number or the name in the group of evaluation data is marked as problem data or deleted, generating a key error instruction; indicating that critical data errors exist in the evaluation data.
If the identification card number and the name in one group of evaluation data are marked as problem data or deleted at the same time, generating a multi-key error instruction; indicating that all critical data in the evaluation data is in error.
If transaction records, telephone numbers or addresses in a set of assessment data are marked as problem data or deleted, a general error instruction is generated, which indicates that non-critical data in the assessment data have errors.
If the transaction records, telephone numbers or addresses in a group of evaluation data are marked as problem data or deleted, generating a multi-general error instruction, and indicating that all non-critical data in the evaluation data are in error;
if multiple key error instructions are generated in one set of evaluation data, a second-level advanced early warning instruction is generated, which indicates that all key data in the set of evaluation data are in error, and the priority degree of modification is highest.
If multiple key error instructions are not generated in one set of evaluation data, but key error instructions are generated, a high-level early warning instruction is generated, which indicates that key data in the set of evaluation data are in error, but not all key data are in error, and the priority of modification is higher.
If multiple key error instructions and key error instructions are not generated in one set of evaluation data, but multiple general error instructions are generated, a first-level early warning instruction is generated, which indicates that all non-key data in the set of evaluation data are in error, and the modification priority is higher.
If multiple key error instructions, key error instructions and multiple general error instructions are not generated in one set of evaluation data, but general error instructions are generated, a low-level early warning instruction is generated, which indicates that non-key data in the set of evaluation data have errors, and the priority of modification is lower.
And sending the generated early warning instruction to related business personnel in a mail, short message and other modes, so that the related business personnel can timely and effectively modify and process the evaluation data according to the early warning instruction.
The embodiment can clean and evaluate the quality of the accumulation fund data, automatically generate early warning instructions of different levels according to the quality problem, and send the early warning instructions to related personnel in an electronic mode; the problem data is comprehensively checked through a plurality of data verification means, so that automation and efficient processing of the quality management work of the public accumulation data are realized, the overall quality level of the public accumulation data is greatly improved, the manual auditing burden is lightened, and the scientificity and accuracy of business decisions are improved.
Example 2
Referring to fig. 2, the present embodiment further improves the design based on embodiment 1, and in embodiment 1, names that are repeated but have no quality problem in m sets of evaluation data are deleted when the repeated data are deleted, because there is a possibility that the names in the personal information are repeated, i.e. the names of multiple persons are identical; therefore, the embodiment provides an accumulated gold data quality evaluation and grading early warning system, and further comprises a data analysis module.
And the data analysis module is used for analyzing the evaluation data corresponding to the names with the same personal hash value and judging whether the evaluation data are marked as repeated data or not.
And marking the evaluation data corresponding to the names with the same personal hash value as the same data.
If the identification card numbers in the same data of the group b are marked as repeated data, the names in the same data of the group b are also marked as repeated data, and b is an integer greater than 1; the b sets of identical data are each represented as an integrated fund account.
If the identity card numbers which are not marked as repeated data exist in the b groups of identical data, taking the personal hash value corresponding to each group of identical data as a group of test sets, namely, the test sets are in one-to-one correspondence with the identical data; the same data corresponding to the identification card number which is not marked as the repeated data is marked as different data; the digital labels are sequentially and incrementally arranged on the test sets corresponding to the same data and the test sets corresponding to different data, the digital labels corresponding to the test sets are marked as test labels, and the range of the test labels is [1, b '], b' =b.
The test labels corresponding to the same data and the test labels corresponding to a group of different data are used as a group of total test sets; and sequentially inputting each group of total test set into a trained probability prediction model to predict corresponding repetition probability.
The specific training process of the probability prediction model comprises the following steps:
and collecting the repetition probability corresponding to the b groups of total test sets in advance, and converting the total test sets and the corresponding repetition probabilities into a corresponding group of feature vectors.
Taking each group of feature vectors as input of a probability analysis model, wherein the probability analysis model takes a group of repetition probabilities corresponding to each group of total test sets as output, and takes actual repetition probabilities corresponding to each group of total test sets as prediction targets, and the actual repetition probabilities are the repetition probabilities corresponding to the total test sets collected in advance; taking the sum of the prediction errors of the minimum total test set as a training target; the calculation formula of the prediction error is Z k=(αk-μk)2, wherein Zk is the prediction error, k is the group number of the feature vector corresponding to the total test set, alpha k is the repetition probability corresponding to the kth group of total test set, and mu k is the actual repetition probability corresponding to the kth group of total test set; and training the probability analysis model until the sum of the prediction errors reaches convergence, and stopping training.
The probability analysis model is specifically a deep neural network model and comprises an input layer, a hidden layer and an output layer; each hidden layer comprises a plurality of neurons, each neuron is connected with the next layer of neurons, the connection comprises weights, and the importance and influence of data transmission in the neural network are determined; each neuron between the hidden layer and the output layer has an activation function applied that mirrors nonlinearities, allowing the network to learn more complex patterns and features.
It should be noted that, in the process of evaluating the quality of the historical aggregate data, the person skilled in the art collects b groups of different total test sets, and under the condition of each group of total test set, the person combines the actual experience to analyze the corresponding repetition probability.
Presetting a probability threshold P T, wherein the probability threshold P T is used for collecting c groups of different total test sets in the process of evaluating the quality of historical accumulated gold data by a person skilled in the art, and names in different data in each group of total test sets are consistent with names in the same data, but no quality problem exists; and sequentially inputting the different total test sets of the c groups into a probability analysis model, obtaining corresponding repetition probability, and taking the average value of the c repetition probabilities as a probability threshold P T.
The repetition probability P C is compared with the probability threshold P T.
If P C≤PT, the names in the different data are not marked as repeated data, and the names in the same data are marked as repeated data; it is explained that names in different data are consistent with names in the same data, but there is no quality problem in names in different data.
If P C>PT, marking the names in different data and the names in the same data as repeated data; it is explained that names in different data have quality problems and need to be deleted.
According to the method, the evaluation data corresponding to the names with the same personal hash value are analyzed, the probability prediction model is utilized to predict the repetition probability, quantitative analysis is carried out on the judgment result, and whether the names are marked as repeated data or not is judged more accurately; the method can better coordinate the relationship between the repeated data identification and the normal data retention, so that the repeated data cleaning work takes account of the repeated data cleaning and the information loss minimization, the cleaning efficiency and the cleaning level of the accumulated gold data are effectively improved, and the accuracy of the accumulated gold data evaluation is improved.
Example 3
Referring to fig. 3, the present embodiment is not described in detail in embodiments 1 and 2, and provides a method for evaluating quality and grading early warning of accumulated gold data, which includes:
m sets of evaluation data are collected from the accumulation system, m being an integer greater than 1.
Data cleaning was performed on m sets of evaluation data.
And carrying out quality evaluation on m groups of evaluation data after data cleaning, and automatically identifying quality problems.
And generating corresponding early warning instructions for the evaluation data for data cleaning and quality problem identification.
In particular, the assessment data includes at least personal information, transaction records, and balance records.
The personal information includes the name, identification number, telephone number and address of the principal account holder.
The transaction records comprise a payment record and an extraction record.
The payment record comprises a single payment amount, a single payment date and a payment unit.
The withdrawal record includes a single withdrawal amount and a single withdrawal date.
The balance record is the balance information in the public accumulation account.
One set of evaluation data corresponds to one of the accumulation accounts in the accumulation system.
Specifically, the step of performing data cleaning on m groups of evaluation data includes:
step 1: duplicate data in the m sets of evaluation data is deleted.
Step 2: unreasonable data in the m sets of evaluation data is deleted.
Further, the method for deleting the repeated data in the m groups of evaluation data comprises the following steps:
Calculating a corresponding hash value of each data in each group of personal information in m groups of evaluation data by adopting a hash function, and marking the hash value as a personal hash value; comparing each personal hash value, and judging whether the same personal hash value exists or not; if the same personal hash value exists, marking the data with the same personal hash value as repeated data, reserving one repeated data in n repeated data with the same personal hash value, deleting the rest n-1 repeated data from m groups of evaluation data, wherein n is an integer larger than 1.
And if the name and the identification card number in the personal information in one set of evaluation data are deleted, deleting the corresponding evaluation data from m sets of evaluation data.
Specifically, the step of deleting unreasonable data in the m sets of evaluation data includes:
step 201: judging whether the identification card number or the telephone number in each group of personal information in the m groups of evaluation data is marked as unreasonable data.
The identification card number and the telephone number in each group of personal information in the m groups of evaluation data are respectively obtained by adopting a character string function, the character string length corresponding to the identification card number is marked as the identification card length, and the character string length corresponding to the telephone number is marked as the telephone length; presetting an identity card length threshold value and a telephone length threshold value, comparing each identity card length with the identity card length threshold value, and marking the corresponding identity card number as unreasonable data if the identity card length is inconsistent with the identity card length threshold value; if the length of the identity card is consistent with the length threshold value of the identity card, the corresponding identity card number is not marked as unreasonable data; comparing each telephone length with a telephone length threshold, and if the telephone length is inconsistent with the telephone length threshold, marking the corresponding telephone number as unreasonable data; if the phone length and the phone length threshold are consistent, the corresponding phone number is not marked as unreasonable data.
Step 202: judging whether the single payment amount, the single extraction amount or the balance record in the m groups of evaluation data is marked as unreasonable data.
Analyzing the single payment amount, the single extraction amount and the balance record in the m groups of evaluation data; if the single payment amount is less than 0, marking the corresponding single payment amount as unreasonable data, and if the single payment amount is greater than or equal to 0, not marking the corresponding single payment amount as unreasonable data; if the single extraction amount is smaller than 0, marking the corresponding single extraction amount as unreasonable data, and if the single extraction amount is larger than or equal to 0, not marking the corresponding single extraction amount as unreasonable data; if the balance record is less than 0, marking the corresponding balance record as unreasonable data, and if the balance record is greater than or equal to 0, not marking the corresponding balance record as unreasonable data.
Step 203: and judging whether the single payment date or the single extraction date in the m groups of evaluation data is marked as unreasonable data.
Marking the single payment date and the single extraction date in the m groups of evaluation data as date data, inputting the date data into a trained date analysis model, and judging whether the date data is unreasonable.
The training process of the date analysis model comprises the following steps:
Corresponding judging results are set for a pieces of date data in advance, a is an integer larger than 1, the judging results comprise reasonable date and unreasonable date, and different digital labels are set for the reasonable date and unreasonable date.
And marking the digital label of the judgment result as a judgment label, and converting the date data and the corresponding judgment label into a corresponding group of feature vectors.
Taking each group of feature vectors as the input of a date analysis model, wherein the date analysis model takes a group of prediction judgment labels corresponding to each group of date data as the output, and takes an actual judgment label corresponding to each group of date data as a prediction target, wherein the actual judgment label is the preset digital label of the judgment result corresponding to the date data; taking the sum of prediction errors of all date data as a training target; training the date analysis model until the sum of the prediction errors reaches convergence, and stopping training; the date analysis model is a deep neural network model.
Obtaining a corresponding judgment result according to the predicted judgment label; if the judgment result is that the date is reasonable, the corresponding date data is not marked as unreasonable data; if the judgment result is that the date is unreasonable, marking the corresponding date data as unreasonable data.
Step 204: unreasonable data is deleted from the m sets of evaluation data.
Specifically, the method for performing quality evaluation on m groups of evaluation data after data cleaning comprises the following steps:
The identification card numbers in the m groups of evaluation data are sent to a query server, wherein the query server comprises a social security query server, a household registration query server and an operator query server; the social security inquiring server inquires according to the identity card number, generates a corresponding name and a payment unit and feeds back the name and the payment unit; the household registration inquiry server inquires according to the ID card number, generates a corresponding address and feeds back the address; the operator inquiry server inquires according to the identification card number, generates a corresponding telephone number and feeds back.
The name, the payment unit, the address and the telephone number acquired through feedback are respectively calculated to corresponding hash values by adopting a hash function, and the hash values are marked as comparison hash values; and calculating corresponding hash values of the payment units in the m groups of evaluation data by adopting a hash function, and marking the hash values as personal hash values.
Constructing an analysis set by comparing hash values corresponding to the identification card numbers; constructing an evaluation set by using the personal hash value corresponding to the identification card number; marking the name, the payment unit, the address and the telephone number as analysis data; comparing the analysis set and the evaluation set which correspond to the same identification card number, and comparing each comparison hash value in the analysis set with the corresponding personal hash value in the evaluation set.
If the comparison hash value is the same as the corresponding personal hash value, the analysis data corresponding to the comparison hash value and the personal hash value are not marked as problem data.
And if the comparison hash value is different from the corresponding personal hash value, marking the analysis data corresponding to the comparison hash value and the personal hash value as problem data.
If the analysis data corresponding to each comparison hash value in one analysis set is marked as problem data, marking the corresponding analysis set as a problem set, and marking the identification card number corresponding to the problem set as problem data; then the telephone numbers corresponding to the problem set are sent to an operator inquiry server, and the operator inquiry server inquires according to the telephone numbers, generates corresponding names and feeds back; calculating a corresponding hash value of the fed back name by adopting a hash function, and marking the hash value as a secondary comparison hash value; marking names corresponding to the problem set as comparison names, and comparing the comparison hash values corresponding to the comparison names with the secondary comparison hash values; if the comparison hash value is consistent with the secondary comparison hash value, not marking the telephone number and the name corresponding to the problem set as problem data; if the comparison hash value is inconsistent with the secondary comparison hash value, telephone numbers and names corresponding to the problem set are still marked as problem data.
If the identification card number in the evaluation data is deleted, the undeleted name, payment unit, address and telephone number in the corresponding evaluation data are marked as question data.
Specifically, the early warning instructions comprise an advanced early warning instruction and a low-level early warning instruction; the high-level early warning instructions comprise a first-level early warning instruction and a second-level early warning instruction.
Specifically, the method for generating the corresponding early warning instruction for the evaluation data for data cleaning comprises the following steps:
If the identification card number or the name in the set of evaluation data is marked as problem data or deleted, a key error instruction is generated.
If the identification card number and the name in one group of evaluation data are marked as problem data or deleted at the same time, a multi-key error instruction is generated.
If a transaction record, telephone number or address is present in the set of evaluation data and marked as problem data or deleted, a general error instruction is generated.
If the transaction record, phone number or address in the set of evaluation data is marked as problem data or deleted, a multiple general error instruction is generated.
And if multiple key error instructions are generated in one group of evaluation data, generating a second-level advanced early warning instruction.
If multiple key error instructions are not generated in one group of evaluation data, but the key error instructions are generated, a first-level early warning instruction is generated.
If multiple key error instructions and key error instructions are not generated in one group of evaluation data, but multiple general error instructions are generated, a first-level early warning instruction is generated.
If multiple key error instructions, key error instructions and multiple general error instructions are not generated in one set of evaluation data, but the general error instructions are generated, a low-level early warning instruction is generated.
Specifically, the evaluation data corresponding to the same name of the personal hash value is analyzed to judge whether the evaluation data is marked as repeated data.
And marking the evaluation data corresponding to the names with the same personal hash value as the same data.
If the identification card numbers in the same data of the group b are marked as repeated data, the names in the same data of the group b are also marked as repeated data, and b is an integer greater than 1.
If the identity card numbers which are not marked as repeated data exist in the b groups of identical data, taking the personal hash value corresponding to each group of identical data as a group of test sets, namely, the test sets are in one-to-one correspondence with the identical data; the same data corresponding to the identification card number which is not marked as the repeated data is marked as different data; the digital labels are sequentially and incrementally arranged on the test sets corresponding to the same data and the test sets corresponding to different data, the digital labels corresponding to the test sets are marked as test labels, and the range of the test labels is [1, b '], b' =b.
The test labels corresponding to the same data and the test labels corresponding to a group of different data are used as a group of total test sets; and sequentially inputting each group of total test set into a trained probability prediction model to predict corresponding repetition probability.
The training process of the probability prediction model comprises the following steps:
and collecting the repetition probability corresponding to the b groups of total test sets in advance, and converting the total test sets and the corresponding repetition probabilities into a corresponding group of feature vectors.
Taking each group of feature vectors as input of a probability analysis model, wherein the probability analysis model takes a group of repetition probabilities corresponding to each group of total test sets as output, and takes actual repetition probabilities corresponding to each group of total test sets as prediction targets, and the actual repetition probabilities are the repetition probabilities corresponding to the total test sets collected in advance; taking the sum of the prediction errors of the minimum total test set as a training target; training the probability analysis model until the sum of the prediction errors reaches convergence, and stopping training; the probability analysis model is a deep neural network model.
The probability threshold P T is preset.
The repetition probability P C is compared with the probability threshold P T.
If P C≤PT, the names in the different data are not marked as duplicate data, and the names in the same data are marked as duplicate data.
If P C>PT, the names in different data and the names in the same data are marked as duplicate data.
Example 4
Referring to fig. 4, an electronic device 500 is also provided in accordance with yet another aspect of the present application. The electronic device 500 may include one or more processors and one or more memories. Wherein the memory has stored therein computer readable code which, when executed by the one or more processors, is operable to perform the aggregate data quality assessment and hierarchical early warning method as described above.
The method or system according to embodiments of the application may also be implemented by means of the architecture of the electronic device shown in fig. 4. As shown in fig. 4, the electronic device 500 may include a bus 501, one or more CPUs 502, a ROM503, a RAM504, a communication port 505 connected to a network, an input/output 506, a hard disk 507, and the like. A storage device in the electronic device 500, such as a ROM503 or a hard disk 507, may store the quality assessment and hierarchical early warning method for the accumulated gold data provided by the present application. Further, the electronic device 500 may also include a user interface 508. Of course, the architecture shown in fig. 4 is merely exemplary, and one or more components of the electronic device shown in fig. 4 may be omitted as may be desired in implementing different devices.
Example 5
Referring to FIG. 5, a computer readable storage medium 600 according to one embodiment of the application is shown. Computer readable storage medium 600 has stored thereon computer readable instructions. The method for evaluating and ranking the quality of the aggregated data according to the embodiment of the application described with reference to the above figures may be performed when the computer readable instructions are executed by a processor. Storage medium 600 includes, but is not limited to, for example, volatile memory and/or nonvolatile memory. Volatile memory can include, for example, random Access Memory (RAM), cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like.
In addition, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, the present application provides a non-transitory machine-readable storage medium storing machine-readable instructions executable by a processor to perform instructions corresponding to the method steps provided by the present application, such as: disclosed is an accumulation fund data quality evaluation and grading early warning method. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU).
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Finally: the foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.