CN111639850A - Quality evaluation method and system for multi-source heterogeneous data - Google Patents
Quality evaluation method and system for multi-source heterogeneous data Download PDFInfo
- Publication number
- CN111639850A CN111639850A CN202010463043.9A CN202010463043A CN111639850A CN 111639850 A CN111639850 A CN 111639850A CN 202010463043 A CN202010463043 A CN 202010463043A CN 111639850 A CN111639850 A CN 111639850A
- Authority
- CN
- China
- Prior art keywords
- data
- quality
- data quality
- rule
- source heterogeneous
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000011156 evaluation Methods 0.000 claims abstract description 51
- 239000011159 matrix material Substances 0.000 claims abstract description 32
- 238000001303 quality assessment method Methods 0.000 claims abstract 5
- 238000005516 engineering process Methods 0.000 claims description 20
- 238000013210 evaluation model Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 abstract description 13
- 238000012545 processing Methods 0.000 abstract description 2
- 238000005259 measurement Methods 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 239000008280 blood Substances 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000007667 floating Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06395—Quality analysis or management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Engineering & Computer Science (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Entrepreneurship & Innovation (AREA)
- General Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- Game Theory and Decision Science (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Health & Medical Sciences (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a quality evaluation method and a model of multi-source heterogeneous data, which are characterized in that a data set to be evaluated is obtained in a real-time or off-line mode, quality rule parameters are configured according to data items, a weight matrix is constructed, the passing rate of the data set is calculated, then a comprehensive evaluation result of the quality of the data set is obtained by utilizing a data quality comprehensive evaluation formula, the data is not limited to single type data processing, and the requirements of the multi-source heterogeneous data are met; reducing the complexity of the data quality assessment calculations.
Description
Technical Field
The invention relates to the technical field of intelligent power grid data management, in particular to a quality evaluation method and system for multi-source heterogeneous data.
Background
With the deep fusion of the new information technology and the smart grid, technologies such as intelligent sensing, an automatic control system and the internet of things are widely applied to various links such as generation, transmission, transformation, distribution and use of a power grid company, especially the application of new-generation communication technologies such as mobile internet, the internet of things and 5G, and the data acquisition frequency and the acquisition range of power grid intelligent equipment are greatly improved. With the rapid construction of comprehensive energy and energy Internet, hundreds of millions of intelligent electric meter equipment are deployed in a power grid, and the power grid becomes a core link for the integration of new technologies of full-chain data acquisition and Internet of things communication. The intelligent electric meter supports important activities such as production, operation, monitoring and management of a power grid company, and the acquired mass data is widely applied to the core business field of the power grid company. The quality of the intelligent electric meter plays a decisive role in the quality of the collected data, the accuracy and the reliability of the data generated by the low-quality intelligent electric meter cannot be guaranteed, and the normal operation of a power grid company is seriously influenced. In production practice, the quality of the intelligent electric meter is usually in positive correlation with the quality of data generated by collection, and is also influenced by various factors such as abnormity, faults and the like of the intelligent electric meter during operation. Therefore, by collecting various types of data generated in the operation life cycle of the intelligent electric meter and combining different service systems affected by correlation, the quality evaluation method aiming at multi-source heterogeneous data is utilized, and the comprehensive evaluation of the quality of the intelligent electric meter under different operation states can be realized.
The existing data quality evaluation implementation methods are divided into the following two categories: the method comprises the steps that firstly, the quality level of historical data is evaluated through a database script statistical analysis means, and certain limitations are realized in technology and implementation; and secondly, evaluating the data quality by adopting a traditional machine learning technology and combining a neural network algorithm. The method needs to prepare a sample data set to train the neural network to form a data quality evaluation model, and needs to retrain a new model when the data quality rule changes, so that the process is complicated. The above methods are all based on a static structured data set, and the evaluation capability facing multi-source heterogeneous data is very limited.
Therefore, a data quality evaluation method and an evaluation system are required.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a quality evaluation method and a quality evaluation model for multi-source heterogeneous data, realizes quality evaluation of various data and associated service system data collected and generated in the operation life cycle of an intelligent electric meter, and reduces the complexity of data quality evaluation calculation. The method is specifically applied to the aspect of developing quality evaluation and abnormity diagnosis of the intelligent electric meter in the field of electric power marketing, and an evaluation model is constructed by using acquired data of electric quantities such as mass current, voltage, electric energy, power and the like and terminal event data, which are generated by the intelligent electric meter for years, so that the quality level of an intelligent electric meter module can be evaluated in an auxiliary and quantitative manner, and the reason of abnormity of the intelligent electric meter can be rapidly positioned.
Therefore, one objective of the present invention is to provide a quality evaluation method for multi-source heterogeneous data, which includes the following steps: s1, acquiring multi-source heterogeneous mass data as a data set to be evaluated; s2, presetting data quality rules of multi-dimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule; s3, constructing a weight matrix of the data quality rule by using the preset dimension parameter score and importance weight of the data quality rule; and S4, respectively calculating data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and carrying out weighted summation on the passing rates of all the data quality rules by combining the weight matrix of the data quality rules to obtain a comprehensive evaluation result of the data set to be evaluated.
Preferably, in S1, acquiring the multi-source heterogeneous mass data includes quickly accessing various types of data by using a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and storing the multi-source heterogeneous data to an internal memory database or a parallel database to form a data set to be evaluated.
In any of the above embodiments, in S2, when the data quality rule of the multidimensional parameter is preset, the dimensional parameter includes the importance of the system, the number of references, the constraint type, the rule completeness, the evaluation object relevance, and the rule importance.
In any of the above embodiments, preferably, the data quality rule weight matrix in S3 is expressed by the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
In any of the above embodiments, preferably, when the data quality is comprehensively evaluated in S4, the following formula is used:
wherein: s represents the comprehensive score of data quality; wiIndicating the ith data quality ruleA weighted score; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
The invention also provides a quality evaluation system of the multi-source heterogeneous data, which comprises a data acquisition module, a data quality rule presetting module, a data quality rule weight matrix and a data quality comprehensive evaluation model; the data acquisition module is used for acquiring multi-source heterogeneous mass data as a data set to be evaluated; the data quality rule presetting module is used for presetting data quality rules of multidimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule; the data quality rule weight matrix is constructed by using preset dimension parameter values and importance weights of the data quality rules; the data quality rule weight matrix is used for matching the weight of each data quality rule; the data quality comprehensive evaluation model respectively calculates data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and performs weighted summation on the passing rates of all the data quality rules by combining with the weight matrix model of the data quality rules to obtain the comprehensive evaluation result of the data set to be evaluated.
Preferably, when the data acquisition module acquires multi-source heterogeneous mass data, various types of data are quickly accessed by a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and the multi-source heterogeneous data acquired by the data acquisition module is stored in a memory database or a parallel database to form a data set to be evaluated.
In any of the above embodiments, preferably, the data quality rule presetting module, when presetting the data quality rule of the multidimensional parameter, the dimensional parameter includes importance of a system to which the data quality rule belongs, reference times, constraint types, rule completeness, evaluation object relevance, and rule importance.
In any one of the above embodiments, preferably, the data quality rule weight matrix is expressed by the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
In any one of the above embodiments, preferably, the data quality comprehensive assessment model adopts the following formula when comprehensively assessing data quality:
wherein: s represents the comprehensive score of data quality; wiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
Compared with the prior art, the quality evaluation method and the quality evaluation system for the multi-source heterogeneous data provided by the invention at least have the following advantages: historical data and real-time data are respectively obtained by adopting a message queue method and a data bus method, compared with the traditional technology, the method is not limited to single type data processing any more, and the requirements of multi-source heterogeneous data are met; the quality evaluation of various data and associated service system data collected and generated in the operation life cycle of the intelligent electric meter is realized, and the complexity of data quality evaluation calculation is reduced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a quality evaluation method for multi-source heterogeneous data according to the present invention;
fig. 2 is a schematic structural diagram of a quality evaluation system for multi-source heterogeneous data provided by the present invention.
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
As shown in fig. 1, the present invention provides a quality evaluation method for multi-source heterogeneous data, including the following steps:
s1, acquiring a multi-source heterogeneous mass data to-be-evaluated data set in a real-time or offline mode; when a data set is established, a multi-source heterogeneous data acquisition technology is adopted, various data are quickly accessed by a standardized acquisition task template, and the requirements for acquiring mass real-time and historical data mainly comprise the following two aspects, namely, on one hand, the multi-source heterogeneous real-time data are acquired by adopting a message queue technology; on the other hand, aiming at the multi-source heterogeneous mass historical data, a data bus technology is adopted for obtaining. And storing the multi-source heterogeneous data to an internal memory database or a parallel database according to the evaluation timeliness requirement to form a data set to be evaluated.
S2, presetting data quality rules of multi-dimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule;
aiming at the characteristics of a data set to be evaluated, associated services and data attribution, each rule respectively and comprehensively considers six dimensional parameter configurations of the importance, the number of times of reference, the constraint type, the rule completeness, the evaluation object relevance and the rule importance of the data item, and comprehensively evaluates the data quality rules. Wherein, the evaluation object comprises an application system and a data theme. The following is illustrated for each parameter:
1) the importance of the system: the importance degree of the system to which the data item belongs is generally divided into a core information system, an important information system and a non-important information system, and each type of information system is further subdivided.
2) Number of references: the number of times that each data item is referred to by other systems can obtain the condition that the data item is referred to according to the blood-related analysis of the metadata, and the higher the number of times that the data item is referred to is, the higher the score of the data quality rule under the data item is in the data flow.
3) Constraint type: if the data item is a main key or an external key, the data item is recommended to have a higher score of the data quality rule; if the data item is not a primary key or a foreign key, but other constraints or indexes exist, the suggested score is referred to be medium; if not, a relatively low score is set.
4) Rule completeness: if a relatively comprehensive data quality rule is formulated under the data item, the more data quality rules on each data quality measurement attribute, the higher the rule completeness is, and the higher the score of the data item is suggested.
5) Evaluating the relevance of the object: the evaluation objects are different, and the focus of attention is also different, and is considered by the application range of the data item. The data items with high attention in the scoring model have higher scores.
6) Degree of rule importance: the method is configured according to the measurement attributes of the data quality rules, wherein the highest importance degree of the data quality measurement attributes is completeness and accuracy, consistency is carried out, and timeliness and normalization are carried out.
The data quality rules are subordinate to the data items, and the importance degree of each dimension parameter can be evaluated from the perspective of the data items according to the weight evaluation of the data quality rules. The influence and importance of the six dimensional parameters on the data quality rule are different, and the weight proportion of each dimension can be determined according to the requirement of data quality evaluation. And (3) appointing the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance of the system to which the six dimensional parameters belong by combining the data quality evaluation emphasis in the power grid field, wherein the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance are respectively 20%, 10%, 20% and 30% in sequence.
Giving an evaluation score range according to six dimensional parameters of each data quality rule; according to the principle that the dimensional parameters are not suitable for being too large in floating, the emphasis of the data quality evaluation in the power grid field is combined, the recommendation range of each dimensional parameter is [80,120], the average score is [96,105], the recommendation score range higher than the average score is [106,120], and the recommendation score range lower than the average score is [80-95 ]. The corresponding scores of the dimension parameters are given according to parameter description conditions, and a typical dimension parameter score configuration condition is shown in the following table:
table 1: dimension parameter value configuration table
S3, constructing a weight matrix of the data quality rule by using the preset dimension parameter score and importance weight of the data quality rule;
and constructing a weight matrix model by utilizing the scores of the six dimensional parameters and the importance weight of the data quality rule so as to comprehensively evaluate the objectivity of the data quality rule. The data quality rule weight matrix model formula is designed as shown in (formula 1):
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)(formula 1)
Wherein: wiA weighted score representing the ith data quality rule; wa(i)Representing the ith data quality rule under the dimension of' belonging system importance degreeThe score of (a), the score is specifically given by the service expert in combination with the weight matrix; wb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of six dimensional parameters in the weight matrix, such as a% representing the "importance of the system" dimensional parameter in the proportion of all dimensions, and a% + b% + c% + d%e%+f%=100%。
And S4, respectively calculating data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and carrying out weighted summation on the passing rates of all the data quality rules by combining the weight matrix of the data quality rules to obtain a comprehensive evaluation result of the data set to be evaluated.
The model is composed of a series of grading formulas, inspection passing rate, weight, analysis dimensionality and the like of data quality rules under an evaluation object are comprehensively considered, and quantifiable comprehensive scores are formed and used for measuring the data quality level of the evaluation object. The data quality comprehensive evaluation model is realized by three steps:
(1) calculating a data quality rule score
The check passing rate index is used for measurement, the index is defined as the ratio of the number of records passing the check rule to the total number of records participating in the check rule, and the index is converted into a percentage value, and the calculation formula is as follows (formula 2):
wherein: re (rule estimation) which represents the score of the data quality rule, and the value range of Re is [0, 100']To (c) to (d); radoptThe number of data set records representing the correct result of the data quality rule checked; rtotalIndicating the total number of data set records that the data quality check rule uses for checking.
The calculation of Re also takes into account the following special cases: when R istotalWhen it is 0, it meansNo record exists in the database table to be evaluated, and in this case, the data quality rule does not participate in calculation; the pair R is triggered by the assessment model listener when the dataset dynamically changesadopt、RtotalThe number is adjusted.
If the inspection passing rate is 100 within a certain period, the inspection can be adjusted or cancelled according to the requirement of data quality evaluation, so as to improve the evaluation calculation efficiency.
(2) Data quality Each dimension evaluation score
And respectively calculating data quality evaluation scores from dimensions such as data integrity, accuracy, consistency, timeliness, normalization and the like, and positioning main dimensions causing data quality problems according to the scores. The calculation formula of each dimension score is shown in the following formula (formula 3):
wherein: skRepresenting a data quality score in accordance with a k-th dimension of data quality; wiA weighted score representing the ith data quality rule; reikRepresenting the passing rate of the ith data quality rule in the dimension k; n represents the number of data quality rules in this dimension k.
(3) Data quality comprehensive assessment score
The calculation method of the data quality comprehensive evaluation score is to perform weighted summation on the passing rates of all the data quality rules, so as to obtain a comprehensive evaluation result of the data set to be evaluated. The calculation formula is shown as the following formula (formula 4):
wherein: s represents the comprehensive score of data quality; m isiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
As can be seen from the evaluation model calculation formula, the comprehensive evaluation score of the data set is not equal to the sum of the evaluation scores of all the dimensions, and the evaluation result is related to the number of the data quality rules and the score of all the dimensions.
The method is specifically applied to the aspect of developing quality evaluation and abnormity diagnosis of the intelligent electric meter in the field of electric power marketing, and an evaluation model is constructed by using acquired data of electric quantities such as mass current, voltage, electric energy, power and the like and terminal event data, which are generated by the intelligent electric meter for years, so that the quality level of an intelligent electric meter module can be evaluated in an auxiliary and quantitative manner, and the reason of abnormity of the intelligent electric meter can be rapidly positioned.
As shown in fig. 2, corresponding to the above embodiment, the present invention further provides a quality evaluation system for multi-source heterogeneous data, which includes a data acquisition module, a data quality rule presetting module, a data quality rule weight matrix, and a data quality comprehensive evaluation model; the data acquisition module is used for acquiring multi-source heterogeneous mass data as a data set to be evaluated; when the data acquisition module acquires multi-source heterogeneous mass data, various data are quickly accessed by a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and the multi-source heterogeneous data acquired by the data acquisition module is stored in a memory database or a parallel database to form a data set to be evaluated. And the data quality rule presetting module is used for presetting the data quality rules of the multi-dimensional parameters according to the characteristics, the associated services and the data attribution of the data set to be evaluated, and presetting the evaluation value range for each dimensional parameter of each data quality rule.
Aiming at the characteristics of a data set to be evaluated, associated services and data attribution, each rule respectively and comprehensively considers six dimensional parameter configurations of the importance, the number of times of reference, the constraint type, the rule completeness, the evaluation object relevance and the rule importance of the data item, and comprehensively evaluates the data quality rules. Wherein, the evaluation object comprises an application system and a data theme. The following is illustrated for each parameter:
1) the importance of the system: the importance degree of the system to which the data item belongs is generally divided into a core information system, an important information system and a non-important information system, and each type of information system is further subdivided.
2) Number of references: the number of times that each data item is referred to by other systems can obtain the condition that the data item is referred to according to the blood-related analysis of the metadata, and the higher the number of times that the data item is referred to is, the higher the score of the data quality rule under the data item is in the data flow.
3) Constraint type: if the data item is a main key or an external key, the data item is recommended to have a higher score of the data quality rule; if the data item is not a primary key or a foreign key, but other constraints or indexes exist, the suggested score is referred to be medium; if not, a relatively low score is set.
4) Rule completeness: if a relatively comprehensive data quality rule is formulated under the data item, the more data quality rules on each data quality measurement attribute, the higher the rule completeness is, and the higher the score of the data item is suggested.
5) Evaluating the relevance of the object: the evaluation objects are different, and the focus of attention is also different, and is considered by the application range of the data item. The data items with high attention in the scoring model have higher scores.
6) Degree of rule importance: the method is configured according to the measurement attributes of the data quality rules, wherein the highest importance degree of the data quality measurement attributes is completeness and accuracy, consistency is carried out, and timeliness and normalization are carried out.
The data quality rules are subordinate to the data items, and the importance degree of each dimension parameter can be evaluated from the perspective of the data items according to the weight evaluation of the data quality rules. The influence and importance of the six dimensional parameters on the data quality rule are different, and the weight proportion of each dimension can be determined according to the requirement of data quality evaluation. And (3) appointing the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance of the system to which the six dimensional parameters belong by combining the data quality evaluation emphasis in the power grid field, wherein the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance are respectively 20%, 10%, 20% and 30% in sequence.
Giving an evaluation score range according to six dimensional parameters of each data quality rule; according to the principle that the dimensional parameters are not suitable for being too large in floating, the emphasis of the data quality evaluation in the power grid field is combined, the recommendation range of each dimensional parameter is [80,120], the average score is [96,105], the recommendation score range higher than the average score is [106,120], and the recommendation score range lower than the average score is [80-95 ]. The corresponding score of the dimension parameter is given according to the parameter description condition, and the specific parameter configuration is as shown in the above table 1.
The data quality rule weight matrix is constructed by using preset dimension parameter values and importance weights of the data quality rules; the data quality rule weight matrix is used for matching the weight of each data quality rule; the data quality rule weight matrix is expressed by the following formula 3:
and the data quality rule presetting module is used for presetting the data quality rules of the multi-dimensional parameters, wherein the dimensional parameters comprise the importance of the system, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance.
The data quality comprehensive evaluation model respectively calculates data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and performs weighted summation on the passing rates of all the data quality rules by combining with the weight matrix model of the data quality rules to obtain the comprehensive evaluation result of the data set to be evaluated. And the data quality comprehensive evaluation model adopts the formula 4 to carry out evaluation calculation when comprehensively evaluating the data quality.
As can be seen from the evaluation model calculation formula, the comprehensive evaluation score of the data set is not equal to the sum of the evaluation scores of all the dimensions, and the evaluation result is related to the number of the data quality rules and the score of all the dimensions.
It will be appreciated by those skilled in the art that the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or equivalence to the invention are intended to be embraced therein.
Claims (10)
1. A quality evaluation method of multi-source heterogeneous data comprises the following steps: the method comprises the following steps:
s1, acquiring multi-source heterogeneous mass data as a data set to be evaluated;
s2, presetting data quality rules of multi-dimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule;
s3, constructing a weight matrix of the data quality rule by using the preset dimension parameter score and importance weight of the data quality rule;
and S4, respectively calculating data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and carrying out weighted summation on the passing rates of all the data quality rules by combining the weight matrix of the data quality rules to obtain a comprehensive evaluation result of the data set to be evaluated.
2. The method for evaluating the quality of multi-source heterogeneous data according to claim 1, wherein: in S1, acquiring the multi-source heterogeneous mass data includes accessing various types of data quickly by using a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and storing the multi-source heterogeneous data to an internal memory database or a parallel database to form a data set to be evaluated.
3. The method for evaluating the quality of multi-source heterogeneous data according to claim 2, wherein: in S2, when the data quality rule of the multidimensional parameter is preset, the dimensional parameter includes the importance of the system, the number of references, the constraint type, the completeness of the rule, the relevance of the evaluation object, and the importance of the rule.
4. The method for evaluating the quality of multi-source heterogeneous data according to claim 1, wherein: the data quality rule weight matrix is expressed in S3 using the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
5. The method for evaluating the quality of multi-source heterogeneous data according to claim 1, wherein: in the comprehensive evaluation of the data quality in S4, the following formula is adopted
Wherein: s represents the comprehensive score of data quality; wiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
6. A quality evaluation system of multi-source heterogeneous data is characterized in that: the system comprises a data acquisition module, a data quality rule presetting module, a data quality rule weight matrix and a data quality comprehensive evaluation model;
the data acquisition module is used for acquiring multi-source heterogeneous mass data as a data set to be evaluated;
the data quality rule presetting module is used for presetting data quality rules of multidimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule;
the data quality rule weight matrix is constructed by using preset dimension parameter values and importance weights of the data quality rules; the data quality rule weight matrix is used for matching the weight of each data quality rule;
the data quality comprehensive evaluation model respectively calculates data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and performs weighted summation on the passing rates of all the data quality rules by combining with the weight matrix model of the data quality rules to obtain the comprehensive evaluation result of the data set to be evaluated.
7. The system for quality assessment of multi-source heterogeneous data according to claim 6, wherein: when the data acquisition module acquires multi-source heterogeneous mass data, various data are quickly accessed by a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and the multi-source heterogeneous data acquired by the data acquisition module is stored in a memory database or a parallel database to form a data set to be evaluated.
8. The system for quality assessment of multi-source heterogeneous data according to claim 7, wherein: the data quality rule presetting module is used for presetting the data quality rules of the multidimensional parameters, wherein the dimensional parameters comprise the importance of the system, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance.
9. The system for quality assessment of multi-source heterogeneous data according to claim 6, wherein: the data quality rule weight matrix is expressed by the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
10. The system for quality assessment of multi-source heterogeneous data according to claim 6, wherein: the data quality comprehensive evaluation model adopts the following formula when comprehensively evaluating the data quality:
wherein: s represents the comprehensive score of data quality; wiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010463043.9A CN111639850A (en) | 2020-05-27 | 2020-05-27 | Quality evaluation method and system for multi-source heterogeneous data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010463043.9A CN111639850A (en) | 2020-05-27 | 2020-05-27 | Quality evaluation method and system for multi-source heterogeneous data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111639850A true CN111639850A (en) | 2020-09-08 |
Family
ID=72328753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010463043.9A Pending CN111639850A (en) | 2020-05-27 | 2020-05-27 | Quality evaluation method and system for multi-source heterogeneous data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111639850A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112989827A (en) * | 2021-05-20 | 2021-06-18 | 江苏数兑科技有限公司 | Text data set quality evaluation method based on multi-source heterogeneous characteristics |
CN113177688A (en) * | 2021-04-01 | 2021-07-27 | 柳城县迪森人造板有限公司 | Quality detection method and device for solid wood ecological plate |
CN113448955A (en) * | 2021-08-30 | 2021-09-28 | 上海观安信息技术股份有限公司 | Data set quality evaluation method and device, computer equipment and storage medium |
CN114034347A (en) * | 2021-11-30 | 2022-02-11 | 广东鑫光智能系统有限公司 | Plate quality detection method and terminal |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105976120A (en) * | 2016-05-17 | 2016-09-28 | 全球能源互联网研究院 | Electric power operation monitoring data quality assessment system and method |
US20170169380A1 (en) * | 2015-12-14 | 2017-06-15 | Wipro Limited | Method and System for Determining Quality Level of Performance Data Associated With an Enterprise |
CN108898311A (en) * | 2018-06-28 | 2018-11-27 | 国网湖南省电力有限公司 | A kind of data quality checking method towards intelligent distribution network repairing dispatching platform |
CN110210719A (en) * | 2019-05-10 | 2019-09-06 | 中国电力科学研究院有限公司 | A kind of power equipment static data method for evaluating quality and system |
-
2020
- 2020-05-27 CN CN202010463043.9A patent/CN111639850A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170169380A1 (en) * | 2015-12-14 | 2017-06-15 | Wipro Limited | Method and System for Determining Quality Level of Performance Data Associated With an Enterprise |
CN105976120A (en) * | 2016-05-17 | 2016-09-28 | 全球能源互联网研究院 | Electric power operation monitoring data quality assessment system and method |
CN108898311A (en) * | 2018-06-28 | 2018-11-27 | 国网湖南省电力有限公司 | A kind of data quality checking method towards intelligent distribution network repairing dispatching platform |
CN110210719A (en) * | 2019-05-10 | 2019-09-06 | 中国电力科学研究院有限公司 | A kind of power equipment static data method for evaluating quality and system |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113177688A (en) * | 2021-04-01 | 2021-07-27 | 柳城县迪森人造板有限公司 | Quality detection method and device for solid wood ecological plate |
CN112989827A (en) * | 2021-05-20 | 2021-06-18 | 江苏数兑科技有限公司 | Text data set quality evaluation method based on multi-source heterogeneous characteristics |
CN112989827B (en) * | 2021-05-20 | 2021-08-27 | 江苏数兑科技有限公司 | Text data set quality evaluation method based on multi-source heterogeneous characteristics |
CN113448955A (en) * | 2021-08-30 | 2021-09-28 | 上海观安信息技术股份有限公司 | Data set quality evaluation method and device, computer equipment and storage medium |
CN114034347A (en) * | 2021-11-30 | 2022-02-11 | 广东鑫光智能系统有限公司 | Plate quality detection method and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111639850A (en) | Quality evaluation method and system for multi-source heterogeneous data | |
CN113779496B (en) | Power equipment state evaluation method and system based on equipment panoramic data | |
CN108320043B (en) | Power distribution network equipment state diagnosis and prediction method based on electric power big data | |
CN109670676A (en) | Distributing net platform region method for prewarning risk and system based on Support Vector data description | |
CN105046591A (en) | Method for evaluating electricity utilization energy efficiency of power consumer | |
CN116681187B (en) | Enterprise carbon quota prediction method based on enterprise operation data | |
CN114638476A (en) | Water conservancy integrated operation and maintenance management method and system | |
CN116150897A (en) | Machine tool spindle performance evaluation method and system based on digital twin | |
CN111429016A (en) | Small and micro enterprise financing wind control method and system based on industrial internet platform | |
CN117560300B (en) | Intelligent internet of things flow prediction and optimization system | |
CN113435759B (en) | Primary equipment risk intelligent assessment method based on deep learning | |
CN116011827A (en) | Power failure monitoring analysis and early warning system and method for key cells | |
CN110781959A (en) | Power customer clustering method based on BIRCH algorithm and random forest algorithm | |
CN110738565A (en) | Real estate finance artificial intelligence composite wind control model based on data set | |
CN117311295B (en) | Production quality improving method and system based on wireless network equipment | |
CN118313519A (en) | Electromechanical full life cycle prediction modeling method and system | |
CN112348220A (en) | Credit risk assessment prediction method and system based on enterprise behavior pattern | |
CN116993380A (en) | Financial market relevance analysis method | |
CN115713027A (en) | Transformer state evaluation method, device and system | |
CN114997888A (en) | Food safety credit assessment method and system fusing multi-type big data | |
CN113886592A (en) | Quality detection method for operation and maintenance data of power information communication system | |
Zhang et al. | Data Cleaning for Prediction and its Evaluation of Building Energy Consumption | |
CN113592362A (en) | Urban power grid anti-disaster capability assessment method and related device | |
CN118211168B (en) | Water business checking and collecting list management system and method | |
CN118332413B (en) | Intelligent analysis method and system for electromechanical faults |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200908 |