[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111639850A - Quality evaluation method and system for multi-source heterogeneous data - Google Patents

Quality evaluation method and system for multi-source heterogeneous data Download PDF

Info

Publication number
CN111639850A
CN111639850A CN202010463043.9A CN202010463043A CN111639850A CN 111639850 A CN111639850 A CN 111639850A CN 202010463043 A CN202010463043 A CN 202010463043A CN 111639850 A CN111639850 A CN 111639850A
Authority
CN
China
Prior art keywords
data
quality
data quality
rule
source heterogeneous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010463043.9A
Other languages
Chinese (zh)
Inventor
肖凯
季知祥
蔡常雨
徐永进
丁徐楠
叶莘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
China Electric Power Research Institute Co Ltd CEPRI
Electric Power Research Institute of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Zhejiang Electric Power Co Ltd
China Electric Power Research Institute Co Ltd CEPRI
Electric Power Research Institute of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Zhejiang Electric Power Co Ltd, China Electric Power Research Institute Co Ltd CEPRI, Electric Power Research Institute of State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202010463043.9A priority Critical patent/CN111639850A/en
Publication of CN111639850A publication Critical patent/CN111639850A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06395Quality analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • Game Theory and Decision Science (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a quality evaluation method and a model of multi-source heterogeneous data, which are characterized in that a data set to be evaluated is obtained in a real-time or off-line mode, quality rule parameters are configured according to data items, a weight matrix is constructed, the passing rate of the data set is calculated, then a comprehensive evaluation result of the quality of the data set is obtained by utilizing a data quality comprehensive evaluation formula, the data is not limited to single type data processing, and the requirements of the multi-source heterogeneous data are met; reducing the complexity of the data quality assessment calculations.

Description

Quality evaluation method and system for multi-source heterogeneous data
Technical Field
The invention relates to the technical field of intelligent power grid data management, in particular to a quality evaluation method and system for multi-source heterogeneous data.
Background
With the deep fusion of the new information technology and the smart grid, technologies such as intelligent sensing, an automatic control system and the internet of things are widely applied to various links such as generation, transmission, transformation, distribution and use of a power grid company, especially the application of new-generation communication technologies such as mobile internet, the internet of things and 5G, and the data acquisition frequency and the acquisition range of power grid intelligent equipment are greatly improved. With the rapid construction of comprehensive energy and energy Internet, hundreds of millions of intelligent electric meter equipment are deployed in a power grid, and the power grid becomes a core link for the integration of new technologies of full-chain data acquisition and Internet of things communication. The intelligent electric meter supports important activities such as production, operation, monitoring and management of a power grid company, and the acquired mass data is widely applied to the core business field of the power grid company. The quality of the intelligent electric meter plays a decisive role in the quality of the collected data, the accuracy and the reliability of the data generated by the low-quality intelligent electric meter cannot be guaranteed, and the normal operation of a power grid company is seriously influenced. In production practice, the quality of the intelligent electric meter is usually in positive correlation with the quality of data generated by collection, and is also influenced by various factors such as abnormity, faults and the like of the intelligent electric meter during operation. Therefore, by collecting various types of data generated in the operation life cycle of the intelligent electric meter and combining different service systems affected by correlation, the quality evaluation method aiming at multi-source heterogeneous data is utilized, and the comprehensive evaluation of the quality of the intelligent electric meter under different operation states can be realized.
The existing data quality evaluation implementation methods are divided into the following two categories: the method comprises the steps that firstly, the quality level of historical data is evaluated through a database script statistical analysis means, and certain limitations are realized in technology and implementation; and secondly, evaluating the data quality by adopting a traditional machine learning technology and combining a neural network algorithm. The method needs to prepare a sample data set to train the neural network to form a data quality evaluation model, and needs to retrain a new model when the data quality rule changes, so that the process is complicated. The above methods are all based on a static structured data set, and the evaluation capability facing multi-source heterogeneous data is very limited.
Therefore, a data quality evaluation method and an evaluation system are required.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a quality evaluation method and a quality evaluation model for multi-source heterogeneous data, realizes quality evaluation of various data and associated service system data collected and generated in the operation life cycle of an intelligent electric meter, and reduces the complexity of data quality evaluation calculation. The method is specifically applied to the aspect of developing quality evaluation and abnormity diagnosis of the intelligent electric meter in the field of electric power marketing, and an evaluation model is constructed by using acquired data of electric quantities such as mass current, voltage, electric energy, power and the like and terminal event data, which are generated by the intelligent electric meter for years, so that the quality level of an intelligent electric meter module can be evaluated in an auxiliary and quantitative manner, and the reason of abnormity of the intelligent electric meter can be rapidly positioned.
Therefore, one objective of the present invention is to provide a quality evaluation method for multi-source heterogeneous data, which includes the following steps: s1, acquiring multi-source heterogeneous mass data as a data set to be evaluated; s2, presetting data quality rules of multi-dimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule; s3, constructing a weight matrix of the data quality rule by using the preset dimension parameter score and importance weight of the data quality rule; and S4, respectively calculating data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and carrying out weighted summation on the passing rates of all the data quality rules by combining the weight matrix of the data quality rules to obtain a comprehensive evaluation result of the data set to be evaluated.
Preferably, in S1, acquiring the multi-source heterogeneous mass data includes quickly accessing various types of data by using a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and storing the multi-source heterogeneous data to an internal memory database or a parallel database to form a data set to be evaluated.
In any of the above embodiments, in S2, when the data quality rule of the multidimensional parameter is preset, the dimensional parameter includes the importance of the system, the number of references, the constraint type, the rule completeness, the evaluation object relevance, and the rule importance.
In any of the above embodiments, preferably, the data quality rule weight matrix in S3 is expressed by the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
In any of the above embodiments, preferably, when the data quality is comprehensively evaluated in S4, the following formula is used:
Figure BDA0002511689940000031
wherein: s represents the comprehensive score of data quality; wiIndicating the ith data quality ruleA weighted score; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
The invention also provides a quality evaluation system of the multi-source heterogeneous data, which comprises a data acquisition module, a data quality rule presetting module, a data quality rule weight matrix and a data quality comprehensive evaluation model; the data acquisition module is used for acquiring multi-source heterogeneous mass data as a data set to be evaluated; the data quality rule presetting module is used for presetting data quality rules of multidimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule; the data quality rule weight matrix is constructed by using preset dimension parameter values and importance weights of the data quality rules; the data quality rule weight matrix is used for matching the weight of each data quality rule; the data quality comprehensive evaluation model respectively calculates data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and performs weighted summation on the passing rates of all the data quality rules by combining with the weight matrix model of the data quality rules to obtain the comprehensive evaluation result of the data set to be evaluated.
Preferably, when the data acquisition module acquires multi-source heterogeneous mass data, various types of data are quickly accessed by a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and the multi-source heterogeneous data acquired by the data acquisition module is stored in a memory database or a parallel database to form a data set to be evaluated.
In any of the above embodiments, preferably, the data quality rule presetting module, when presetting the data quality rule of the multidimensional parameter, the dimensional parameter includes importance of a system to which the data quality rule belongs, reference times, constraint types, rule completeness, evaluation object relevance, and rule importance.
In any one of the above embodiments, preferably, the data quality rule weight matrix is expressed by the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
In any one of the above embodiments, preferably, the data quality comprehensive assessment model adopts the following formula when comprehensively assessing data quality:
Figure BDA0002511689940000041
wherein: s represents the comprehensive score of data quality; wiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
Compared with the prior art, the quality evaluation method and the quality evaluation system for the multi-source heterogeneous data provided by the invention at least have the following advantages: historical data and real-time data are respectively obtained by adopting a message queue method and a data bus method, compared with the traditional technology, the method is not limited to single type data processing any more, and the requirements of multi-source heterogeneous data are met; the quality evaluation of various data and associated service system data collected and generated in the operation life cycle of the intelligent electric meter is realized, and the complexity of data quality evaluation calculation is reduced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a quality evaluation method for multi-source heterogeneous data according to the present invention;
fig. 2 is a schematic structural diagram of a quality evaluation system for multi-source heterogeneous data provided by the present invention.
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
As shown in fig. 1, the present invention provides a quality evaluation method for multi-source heterogeneous data, including the following steps:
s1, acquiring a multi-source heterogeneous mass data to-be-evaluated data set in a real-time or offline mode; when a data set is established, a multi-source heterogeneous data acquisition technology is adopted, various data are quickly accessed by a standardized acquisition task template, and the requirements for acquiring mass real-time and historical data mainly comprise the following two aspects, namely, on one hand, the multi-source heterogeneous real-time data are acquired by adopting a message queue technology; on the other hand, aiming at the multi-source heterogeneous mass historical data, a data bus technology is adopted for obtaining. And storing the multi-source heterogeneous data to an internal memory database or a parallel database according to the evaluation timeliness requirement to form a data set to be evaluated.
S2, presetting data quality rules of multi-dimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule;
aiming at the characteristics of a data set to be evaluated, associated services and data attribution, each rule respectively and comprehensively considers six dimensional parameter configurations of the importance, the number of times of reference, the constraint type, the rule completeness, the evaluation object relevance and the rule importance of the data item, and comprehensively evaluates the data quality rules. Wherein, the evaluation object comprises an application system and a data theme. The following is illustrated for each parameter:
1) the importance of the system: the importance degree of the system to which the data item belongs is generally divided into a core information system, an important information system and a non-important information system, and each type of information system is further subdivided.
2) Number of references: the number of times that each data item is referred to by other systems can obtain the condition that the data item is referred to according to the blood-related analysis of the metadata, and the higher the number of times that the data item is referred to is, the higher the score of the data quality rule under the data item is in the data flow.
3) Constraint type: if the data item is a main key or an external key, the data item is recommended to have a higher score of the data quality rule; if the data item is not a primary key or a foreign key, but other constraints or indexes exist, the suggested score is referred to be medium; if not, a relatively low score is set.
4) Rule completeness: if a relatively comprehensive data quality rule is formulated under the data item, the more data quality rules on each data quality measurement attribute, the higher the rule completeness is, and the higher the score of the data item is suggested.
5) Evaluating the relevance of the object: the evaluation objects are different, and the focus of attention is also different, and is considered by the application range of the data item. The data items with high attention in the scoring model have higher scores.
6) Degree of rule importance: the method is configured according to the measurement attributes of the data quality rules, wherein the highest importance degree of the data quality measurement attributes is completeness and accuracy, consistency is carried out, and timeliness and normalization are carried out.
The data quality rules are subordinate to the data items, and the importance degree of each dimension parameter can be evaluated from the perspective of the data items according to the weight evaluation of the data quality rules. The influence and importance of the six dimensional parameters on the data quality rule are different, and the weight proportion of each dimension can be determined according to the requirement of data quality evaluation. And (3) appointing the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance of the system to which the six dimensional parameters belong by combining the data quality evaluation emphasis in the power grid field, wherein the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance are respectively 20%, 10%, 20% and 30% in sequence.
Giving an evaluation score range according to six dimensional parameters of each data quality rule; according to the principle that the dimensional parameters are not suitable for being too large in floating, the emphasis of the data quality evaluation in the power grid field is combined, the recommendation range of each dimensional parameter is [80,120], the average score is [96,105], the recommendation score range higher than the average score is [106,120], and the recommendation score range lower than the average score is [80-95 ]. The corresponding scores of the dimension parameters are given according to parameter description conditions, and a typical dimension parameter score configuration condition is shown in the following table:
table 1: dimension parameter value configuration table
Figure BDA0002511689940000071
S3, constructing a weight matrix of the data quality rule by using the preset dimension parameter score and importance weight of the data quality rule;
and constructing a weight matrix model by utilizing the scores of the six dimensional parameters and the importance weight of the data quality rule so as to comprehensively evaluate the objectivity of the data quality rule. The data quality rule weight matrix model formula is designed as shown in (formula 1):
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)(formula 1)
Wherein: wiA weighted score representing the ith data quality rule; wa(i)Representing the ith data quality rule under the dimension of' belonging system importance degreeThe score of (a), the score is specifically given by the service expert in combination with the weight matrix; wb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of six dimensional parameters in the weight matrix, such as a% representing the "importance of the system" dimensional parameter in the proportion of all dimensions, and a% + b% + c% + d%e%+f%=100%。
And S4, respectively calculating data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and carrying out weighted summation on the passing rates of all the data quality rules by combining the weight matrix of the data quality rules to obtain a comprehensive evaluation result of the data set to be evaluated.
The model is composed of a series of grading formulas, inspection passing rate, weight, analysis dimensionality and the like of data quality rules under an evaluation object are comprehensively considered, and quantifiable comprehensive scores are formed and used for measuring the data quality level of the evaluation object. The data quality comprehensive evaluation model is realized by three steps:
(1) calculating a data quality rule score
The check passing rate index is used for measurement, the index is defined as the ratio of the number of records passing the check rule to the total number of records participating in the check rule, and the index is converted into a percentage value, and the calculation formula is as follows (formula 2):
Figure BDA0002511689940000081
wherein: re (rule estimation) which represents the score of the data quality rule, and the value range of Re is [0, 100']To (c) to (d); radoptThe number of data set records representing the correct result of the data quality rule checked; rtotalIndicating the total number of data set records that the data quality check rule uses for checking.
The calculation of Re also takes into account the following special cases: when R istotalWhen it is 0, it meansNo record exists in the database table to be evaluated, and in this case, the data quality rule does not participate in calculation; the pair R is triggered by the assessment model listener when the dataset dynamically changesadopt、RtotalThe number is adjusted.
If the inspection passing rate is 100 within a certain period, the inspection can be adjusted or cancelled according to the requirement of data quality evaluation, so as to improve the evaluation calculation efficiency.
(2) Data quality Each dimension evaluation score
And respectively calculating data quality evaluation scores from dimensions such as data integrity, accuracy, consistency, timeliness, normalization and the like, and positioning main dimensions causing data quality problems according to the scores. The calculation formula of each dimension score is shown in the following formula (formula 3):
Figure BDA0002511689940000082
wherein: skRepresenting a data quality score in accordance with a k-th dimension of data quality; wiA weighted score representing the ith data quality rule; reikRepresenting the passing rate of the ith data quality rule in the dimension k; n represents the number of data quality rules in this dimension k.
(3) Data quality comprehensive assessment score
The calculation method of the data quality comprehensive evaluation score is to perform weighted summation on the passing rates of all the data quality rules, so as to obtain a comprehensive evaluation result of the data set to be evaluated. The calculation formula is shown as the following formula (formula 4):
Figure BDA0002511689940000091
wherein: s represents the comprehensive score of data quality; m isiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
As can be seen from the evaluation model calculation formula, the comprehensive evaluation score of the data set is not equal to the sum of the evaluation scores of all the dimensions, and the evaluation result is related to the number of the data quality rules and the score of all the dimensions.
The method is specifically applied to the aspect of developing quality evaluation and abnormity diagnosis of the intelligent electric meter in the field of electric power marketing, and an evaluation model is constructed by using acquired data of electric quantities such as mass current, voltage, electric energy, power and the like and terminal event data, which are generated by the intelligent electric meter for years, so that the quality level of an intelligent electric meter module can be evaluated in an auxiliary and quantitative manner, and the reason of abnormity of the intelligent electric meter can be rapidly positioned.
As shown in fig. 2, corresponding to the above embodiment, the present invention further provides a quality evaluation system for multi-source heterogeneous data, which includes a data acquisition module, a data quality rule presetting module, a data quality rule weight matrix, and a data quality comprehensive evaluation model; the data acquisition module is used for acquiring multi-source heterogeneous mass data as a data set to be evaluated; when the data acquisition module acquires multi-source heterogeneous mass data, various data are quickly accessed by a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and the multi-source heterogeneous data acquired by the data acquisition module is stored in a memory database or a parallel database to form a data set to be evaluated. And the data quality rule presetting module is used for presetting the data quality rules of the multi-dimensional parameters according to the characteristics, the associated services and the data attribution of the data set to be evaluated, and presetting the evaluation value range for each dimensional parameter of each data quality rule.
Aiming at the characteristics of a data set to be evaluated, associated services and data attribution, each rule respectively and comprehensively considers six dimensional parameter configurations of the importance, the number of times of reference, the constraint type, the rule completeness, the evaluation object relevance and the rule importance of the data item, and comprehensively evaluates the data quality rules. Wherein, the evaluation object comprises an application system and a data theme. The following is illustrated for each parameter:
1) the importance of the system: the importance degree of the system to which the data item belongs is generally divided into a core information system, an important information system and a non-important information system, and each type of information system is further subdivided.
2) Number of references: the number of times that each data item is referred to by other systems can obtain the condition that the data item is referred to according to the blood-related analysis of the metadata, and the higher the number of times that the data item is referred to is, the higher the score of the data quality rule under the data item is in the data flow.
3) Constraint type: if the data item is a main key or an external key, the data item is recommended to have a higher score of the data quality rule; if the data item is not a primary key or a foreign key, but other constraints or indexes exist, the suggested score is referred to be medium; if not, a relatively low score is set.
4) Rule completeness: if a relatively comprehensive data quality rule is formulated under the data item, the more data quality rules on each data quality measurement attribute, the higher the rule completeness is, and the higher the score of the data item is suggested.
5) Evaluating the relevance of the object: the evaluation objects are different, and the focus of attention is also different, and is considered by the application range of the data item. The data items with high attention in the scoring model have higher scores.
6) Degree of rule importance: the method is configured according to the measurement attributes of the data quality rules, wherein the highest importance degree of the data quality measurement attributes is completeness and accuracy, consistency is carried out, and timeliness and normalization are carried out.
The data quality rules are subordinate to the data items, and the importance degree of each dimension parameter can be evaluated from the perspective of the data items according to the weight evaluation of the data quality rules. The influence and importance of the six dimensional parameters on the data quality rule are different, and the weight proportion of each dimension can be determined according to the requirement of data quality evaluation. And (3) appointing the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance of the system to which the six dimensional parameters belong by combining the data quality evaluation emphasis in the power grid field, wherein the importance, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance are respectively 20%, 10%, 20% and 30% in sequence.
Giving an evaluation score range according to six dimensional parameters of each data quality rule; according to the principle that the dimensional parameters are not suitable for being too large in floating, the emphasis of the data quality evaluation in the power grid field is combined, the recommendation range of each dimensional parameter is [80,120], the average score is [96,105], the recommendation score range higher than the average score is [106,120], and the recommendation score range lower than the average score is [80-95 ]. The corresponding score of the dimension parameter is given according to the parameter description condition, and the specific parameter configuration is as shown in the above table 1.
The data quality rule weight matrix is constructed by using preset dimension parameter values and importance weights of the data quality rules; the data quality rule weight matrix is used for matching the weight of each data quality rule; the data quality rule weight matrix is expressed by the following formula 3:
and the data quality rule presetting module is used for presetting the data quality rules of the multi-dimensional parameters, wherein the dimensional parameters comprise the importance of the system, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance.
The data quality comprehensive evaluation model respectively calculates data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and performs weighted summation on the passing rates of all the data quality rules by combining with the weight matrix model of the data quality rules to obtain the comprehensive evaluation result of the data set to be evaluated. And the data quality comprehensive evaluation model adopts the formula 4 to carry out evaluation calculation when comprehensively evaluating the data quality.
As can be seen from the evaluation model calculation formula, the comprehensive evaluation score of the data set is not equal to the sum of the evaluation scores of all the dimensions, and the evaluation result is related to the number of the data quality rules and the score of all the dimensions.
It will be appreciated by those skilled in the art that the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or equivalence to the invention are intended to be embraced therein.

Claims (10)

1. A quality evaluation method of multi-source heterogeneous data comprises the following steps: the method comprises the following steps:
s1, acquiring multi-source heterogeneous mass data as a data set to be evaluated;
s2, presetting data quality rules of multi-dimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule;
s3, constructing a weight matrix of the data quality rule by using the preset dimension parameter score and importance weight of the data quality rule;
and S4, respectively calculating data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and carrying out weighted summation on the passing rates of all the data quality rules by combining the weight matrix of the data quality rules to obtain a comprehensive evaluation result of the data set to be evaluated.
2. The method for evaluating the quality of multi-source heterogeneous data according to claim 1, wherein: in S1, acquiring the multi-source heterogeneous mass data includes accessing various types of data quickly by using a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and storing the multi-source heterogeneous data to an internal memory database or a parallel database to form a data set to be evaluated.
3. The method for evaluating the quality of multi-source heterogeneous data according to claim 2, wherein: in S2, when the data quality rule of the multidimensional parameter is preset, the dimensional parameter includes the importance of the system, the number of references, the constraint type, the completeness of the rule, the relevance of the evaluation object, and the importance of the rule.
4. The method for evaluating the quality of multi-source heterogeneous data according to claim 1, wherein: the data quality rule weight matrix is expressed in S3 using the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
5. The method for evaluating the quality of multi-source heterogeneous data according to claim 1, wherein: in the comprehensive evaluation of the data quality in S4, the following formula is adopted
Figure FDA0002511689930000021
Wherein: s represents the comprehensive score of data quality; wiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
6. A quality evaluation system of multi-source heterogeneous data is characterized in that: the system comprises a data acquisition module, a data quality rule presetting module, a data quality rule weight matrix and a data quality comprehensive evaluation model;
the data acquisition module is used for acquiring multi-source heterogeneous mass data as a data set to be evaluated;
the data quality rule presetting module is used for presetting data quality rules of multidimensional parameters according to the characteristics, associated services and data attribution of the data set to be evaluated, and presetting an evaluation value range for each dimensional parameter of each data quality rule;
the data quality rule weight matrix is constructed by using preset dimension parameter values and importance weights of the data quality rules; the data quality rule weight matrix is used for matching the weight of each data quality rule;
the data quality comprehensive evaluation model respectively calculates data quality evaluation scores from multiple dimensions of data integrity, accuracy, consistency, timeliness and normalization by calculating the passing rate of the data quality rules, and performs weighted summation on the passing rates of all the data quality rules by combining with the weight matrix model of the data quality rules to obtain the comprehensive evaluation result of the data set to be evaluated.
7. The system for quality assessment of multi-source heterogeneous data according to claim 6, wherein: when the data acquisition module acquires multi-source heterogeneous mass data, various data are quickly accessed by a standardized acquisition task template; aiming at multi-source heterogeneous real-time data, acquiring by adopting a message queue technology; aiming at multi-source heterogeneous mass historical data, acquiring by adopting a data bus technology; and the multi-source heterogeneous data acquired by the data acquisition module is stored in a memory database or a parallel database to form a data set to be evaluated.
8. The system for quality assessment of multi-source heterogeneous data according to claim 7, wherein: the data quality rule presetting module is used for presetting the data quality rules of the multidimensional parameters, wherein the dimensional parameters comprise the importance of the system, the reference times, the constraint types, the rule completeness, the evaluation object relevance and the rule importance.
9. The system for quality assessment of multi-source heterogeneous data according to claim 6, wherein: the data quality rule weight matrix is expressed by the following formula:
Wi=a%*Wa(i)+b%*Wb(i)+c%*Wc(i)+d%*Wd(i)+e%*We(i)+f%*Wf(i)
wherein: wiA weighted score representing the ith data quality rule; wa(i)Represents the score, W, of the ith data quality rule in the "a" dimensionb(i),Wc(i),Wd(i),We(i),Wf(i)Mean and Wa(i)Meanwhile, the scores under the corresponding dimensions are respectively represented; a%, b%, c%, d%, e%, f% respectively represent the ratio of each dimension parameter in the weight matrix, and a% + b% + c% + d%e%+f%=100%。
10. The system for quality assessment of multi-source heterogeneous data according to claim 6, wherein: the data quality comprehensive evaluation model adopts the following formula when comprehensively evaluating the data quality:
Figure FDA0002511689930000031
wherein: s represents the comprehensive score of data quality; wiA weighted score representing the ith data quality rule; reiRepresenting the passing rate of the ith data quality rule; n denotes the number of overall data quality rules.
CN202010463043.9A 2020-05-27 2020-05-27 Quality evaluation method and system for multi-source heterogeneous data Pending CN111639850A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010463043.9A CN111639850A (en) 2020-05-27 2020-05-27 Quality evaluation method and system for multi-source heterogeneous data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010463043.9A CN111639850A (en) 2020-05-27 2020-05-27 Quality evaluation method and system for multi-source heterogeneous data

Publications (1)

Publication Number Publication Date
CN111639850A true CN111639850A (en) 2020-09-08

Family

ID=72328753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010463043.9A Pending CN111639850A (en) 2020-05-27 2020-05-27 Quality evaluation method and system for multi-source heterogeneous data

Country Status (1)

Country Link
CN (1) CN111639850A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112989827A (en) * 2021-05-20 2021-06-18 江苏数兑科技有限公司 Text data set quality evaluation method based on multi-source heterogeneous characteristics
CN113177688A (en) * 2021-04-01 2021-07-27 柳城县迪森人造板有限公司 Quality detection method and device for solid wood ecological plate
CN113448955A (en) * 2021-08-30 2021-09-28 上海观安信息技术股份有限公司 Data set quality evaluation method and device, computer equipment and storage medium
CN114034347A (en) * 2021-11-30 2022-02-11 广东鑫光智能系统有限公司 Plate quality detection method and terminal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
US20170169380A1 (en) * 2015-12-14 2017-06-15 Wipro Limited Method and System for Determining Quality Level of Performance Data Associated With an Enterprise
CN108898311A (en) * 2018-06-28 2018-11-27 国网湖南省电力有限公司 A kind of data quality checking method towards intelligent distribution network repairing dispatching platform
CN110210719A (en) * 2019-05-10 2019-09-06 中国电力科学研究院有限公司 A kind of power equipment static data method for evaluating quality and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170169380A1 (en) * 2015-12-14 2017-06-15 Wipro Limited Method and System for Determining Quality Level of Performance Data Associated With an Enterprise
CN105976120A (en) * 2016-05-17 2016-09-28 全球能源互联网研究院 Electric power operation monitoring data quality assessment system and method
CN108898311A (en) * 2018-06-28 2018-11-27 国网湖南省电力有限公司 A kind of data quality checking method towards intelligent distribution network repairing dispatching platform
CN110210719A (en) * 2019-05-10 2019-09-06 中国电力科学研究院有限公司 A kind of power equipment static data method for evaluating quality and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113177688A (en) * 2021-04-01 2021-07-27 柳城县迪森人造板有限公司 Quality detection method and device for solid wood ecological plate
CN112989827A (en) * 2021-05-20 2021-06-18 江苏数兑科技有限公司 Text data set quality evaluation method based on multi-source heterogeneous characteristics
CN112989827B (en) * 2021-05-20 2021-08-27 江苏数兑科技有限公司 Text data set quality evaluation method based on multi-source heterogeneous characteristics
CN113448955A (en) * 2021-08-30 2021-09-28 上海观安信息技术股份有限公司 Data set quality evaluation method and device, computer equipment and storage medium
CN114034347A (en) * 2021-11-30 2022-02-11 广东鑫光智能系统有限公司 Plate quality detection method and terminal

Similar Documents

Publication Publication Date Title
CN111639850A (en) Quality evaluation method and system for multi-source heterogeneous data
CN113779496B (en) Power equipment state evaluation method and system based on equipment panoramic data
CN108320043B (en) Power distribution network equipment state diagnosis and prediction method based on electric power big data
CN109670676A (en) Distributing net platform region method for prewarning risk and system based on Support Vector data description
CN105046591A (en) Method for evaluating electricity utilization energy efficiency of power consumer
CN116681187B (en) Enterprise carbon quota prediction method based on enterprise operation data
CN114638476A (en) Water conservancy integrated operation and maintenance management method and system
CN116150897A (en) Machine tool spindle performance evaluation method and system based on digital twin
CN111429016A (en) Small and micro enterprise financing wind control method and system based on industrial internet platform
CN117560300B (en) Intelligent internet of things flow prediction and optimization system
CN113435759B (en) Primary equipment risk intelligent assessment method based on deep learning
CN116011827A (en) Power failure monitoring analysis and early warning system and method for key cells
CN110781959A (en) Power customer clustering method based on BIRCH algorithm and random forest algorithm
CN110738565A (en) Real estate finance artificial intelligence composite wind control model based on data set
CN117311295B (en) Production quality improving method and system based on wireless network equipment
CN118313519A (en) Electromechanical full life cycle prediction modeling method and system
CN112348220A (en) Credit risk assessment prediction method and system based on enterprise behavior pattern
CN116993380A (en) Financial market relevance analysis method
CN115713027A (en) Transformer state evaluation method, device and system
CN114997888A (en) Food safety credit assessment method and system fusing multi-type big data
CN113886592A (en) Quality detection method for operation and maintenance data of power information communication system
Zhang et al. Data Cleaning for Prediction and its Evaluation of Building Energy Consumption
CN113592362A (en) Urban power grid anti-disaster capability assessment method and related device
CN118211168B (en) Water business checking and collecting list management system and method
CN118332413B (en) Intelligent analysis method and system for electromechanical faults

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200908