A kind of power information Data Quality Analysis system
Technical field
The present invention relates to the quality of data and big data fields, be specifically related to a kind of power information Data Quality Analysis system.
Background technology
Power information gathers the Data Quality Analysis method big Data Analysis Platform based on Hadoop, make full use of the advanced technologies such as HDFS distributed storage, Hive data base, Hbase data warehouse, Spark internal memory Computational frame, calculate platform in conjunction with data mining algorithm and R language parallelization, complete magnanimity power information and gather the quality analysis work of data.
Hadoop is as the open source projects being absorbed in distributed storage and calculating under the famous tissue Apache that increases income, increasingly receive attention, it can focus on and analyze demonstrating data systematic on a large scale, make overall planning huge data, realize the efficient process of mass data, be now widely used for the fields such as distributed storage, Webpage search, log analysis, advertisement calculating, Distributed Calculation, data mining.
HDFS (Hadoop distributed file system) provides basic-level support for Distributed Calculation storage.HDFS provides the mass data storage solution of an Error Tolerance and high-throughput.HDFS is used widely in various large-scale online services and large memory system, has become as the fact that mass data storage standard.Data block can by data are decomposed into data block, and be interspersed among in extensive work node by HDFS, it is achieved fault-tolerant and high-performance.HDFS cluster is combined some Datanode by independent Namenode and is formed.HDFS adopts master/slave framework.Namenode safeguards the file system of whole system.The each system cluster of Datanode individually configures, and processes the system memory unit on node.In inside, a file divides with block, each several block of Datanode aggregate.
Spark Computational frame.Spark is the general parallel computation frame of the UCBerkeleyAMPlab similar MapReduce increased income, and the Distributed Calculation that Spark realizes based on MapReduce algorithm has HadoopMapReduce have the advantage that;But what be different from MapReduce is that in the middle of Job, output and result can be saved in internal memory, thus being no longer necessary to read-write HDFS, therefore Spark can be applicable to the algorithm that data mining needs the MapReduce of iteration with machine learning etc. better.
R is an integrated statistical analysis software system comprising data process, function of statistic analysis and graph visualization, is jointly founded by RossIhaka and RobertGentleman.R language can be regarded as a kind of dialect that the S language development created by AT&T AT&T Labs goes out.Therefore, namely R is that a kind of software could also say that a kind of language, there is feature free, free, that increase income, comprise outstanding function of statistic analysis and powerful statistical cartography function, its simple and clear command parameter allows user should be readily appreciated that operation, and programmable functional language environment is also for needing the user of personalization definition to provide great convenience simultaneously.
SparkR is the AMPLab R kit issued, and provides the front end of light weight for ApacheSpark.SparkR provides the API of Spark Elastic distributed data collection (RDD), utilizes these API, and user can by the operation job of Rshell interactivity on cluster.
Kmeans is based on the typical clustering algorithm of distance, adopts distance as the evaluation index of similarity, and its principle is with k for parameter, and n object is divided into individual bunch of k, has higher similarity in making bunch, and bunch between similarity relatively low.The processing procedure of Kmeans algorithm is as follows: first, is randomly chosen k object, and each object initially represents meansigma methods or the center of bunch.To remaining each object, according to its distance with Ge Cu center, it is assigned to nearest bunch.Then the meansigma methods of each bunch is recalculated.This process constantly repeats, until criterion function convergence.
Therefore, need a kind of power information based on Hadoop, Spark Computational frame and R language of design badly and gather Data Quality Analysis system.
Summary of the invention
In view of this, a kind of power information Data Quality Analysis system provided by the invention, this system achieves the power information based on Hadoop, Spark Computational frame and R language and gathers Data Quality Analysis, improve the effect that power information is gathered Data Quality Analysis, also utilize big data technique to achieve the support that magnanimity power information gathers Data Quality Analysis simultaneously, substantially increase magnanimity power information is gathered Data Quality Analysis efficiency and speed;Realize magnanimity power information in the way of rapidly and efficiently and gather the preparation of data;Simplify the flow process of data mining, improve speed and the efficiency of data mining largely.
It is an object of the invention to be achieved through the following technical solutions:
A kind of power information Data Quality Analysis system, described system includes data preparation module, data integration module and data analysis module;
Described data preparation module is used for gathering and store power information data;
Described data integration module, based on the described power information data in described data preparation module, sets up the tables of data for inquiring about and calculating;
Described data analysis module, according to the described tables of data in described data integration module, calculates the index set of described power information data, obtains the effective percentage of described power information data.
Preferably, described data preparation module includes data acquisition unit, data exchange unit and data storage cell;
Described data acquisition unit is used for gathering power information data, and described power information data are stored as relational database form;
The described power information data of relational database form are imported to data storage cell by described data exchange unit data exchange tool;
Described data storage cell is distributed memory system, and described data storage cell stores described power information data in a text form.
Preferably, described power information data include electricity consumption data and assistance data;
Described electricity consumption data include Electricity customers information, electric energy meter information, stoichiometric point information, Real-time Collection information and history power consumption information, and wherein, described Real-time Collection information includes power load, voltage and electric current;
Described assistance data includes criteria for classification data and coding standard data.
Preferably, described data integration module is built table statement based on the described power information data in described data preparation module according to Hive and is set up Hive tables of data and set up Hive data summary table according to incidence relation.
Preferably, described Hive tables of data includes Electricity customers Hive tables of data, electric energy meter Hive tables of data, stoichiometric point Hive tables of data, power load Hive tables of data, utilization voltage Hive tables of data, electricity consumption electric current Hive tables of data and history power consumption Hive tables of data.
Preferably, described Hive data summary table includes the data summary table based on Electricity customers information, data summary table based on stoichiometric point information and the Hive summary table based on HBase;
Described data summary table based on Electricity customers information includes electric energy meter information and stoichiometric point information, and described data summary table based on Electricity customers information stores to HBase tables of data;
Described data summary table based on stoichiometric point information includes Real-time Collection information and history power consumption information, and described data summary table based on stoichiometric point information stores to HBase tables of data, wherein, described Real-time Collection information includes power load, voltage and current data;
The described big table of the Hive based on HBase is set up by Hive.
Preferably, calculate described index set to include calculating coincident indicator, integrity metrics, accuracy index and Validity Index respectively;
The coincident indicator of described power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve;
The accuracy index of described power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve.
Preferably, the coincident indicator of described power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve and include:
Calculate described coincident indicator and include building SQL statement, the tables of data that inquiry is associated, obtain Query Result and calculate the concordance rate data obtaining described power information data with formula 1;Wherein, formula 1 is: consistent data record number/total number of records * 100%;
Calculate described integrity metrics to include building SQL statement, the Hive data summary table that inquiry has been set up, by adding up the record number being automatically filled to null in described Hive data summary table, the percentage of head rice obtaining described power information data is calculated by formula 2, wherein, formula 2 is: (1-null records number/total number of records evidence) * 100%.
Preferably, the accuracy index of described power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve and include:
Calculating the index item in described accuracy index and described Validity Index is the index item arranged for the load curve data of Electricity customers, voltage curve data and current curve data in power information collection, the described mass analysis method based on data mining is by the described power information data to verify are clustered, it is thus achieved that the accuracy index of described power information data and Validity Index.
Preferably, the described power information data that verify are clustered, it is thus achieved that accuracy index and the Validity Index of described power information data include:
Calculate described accuracy index to include: with the Kmeans algorithm of parallelization, the curvilinear characteristic of the curve data of Electricity customers is clustered, from cluster result, obtain the class race of curvilinear motion feature abnormalities, carry out the accuracy rate of calculated curve data with formula 3;Wherein, formula 3 is: (record number/total number of records that 1-variation characteristic is abnormal) * 100%;
Calculate described Validity Index to include: with the Kmeans algorithm of parallelization, data span feature is clustered, from cluster result, obtain exceptional value distribution situation, calculate the effective percentage obtaining described power information data with formula 4;Wherein, formula 4 is: (1-exceptional value record number/total number of records) * 100%.
Can be seen that from above-mentioned technical scheme, the invention provides a kind of power information Data Quality Analysis system, including the data preparation module for gathering and store power information data, based on the power information data in data preparation module, set up the data integration module of tables of data for inquiring about and calculating and according to the tables of data in data integration module, calculate the index set of power information data, obtain the efficient data analysis module of power information data.Present invention achieves the power information based on Hadoop, Spark Computational frame and R language and gather Data Quality Analysis, improve power information and gather Data Quality Analysis effect, achieve the support that magnanimity power information is gathered Data Quality Analysis, substantially increase magnanimity power information is gathered Data Quality Analysis efficiency and speed;Realize magnanimity power information in the way of rapidly and efficiently and gather the preparation of data;Simplify the flow process of data mining, improve speed and the efficiency of data mining largely.
With immediate prior art ratio, technical scheme provided by the invention has following excellent effect:
1, in technical scheme provided by the present invention, power information gathers the mass analysis method of data and introduces big data analysis technique and parallel computation, have employed query statistic and mass analysis method that data mining combines, improve the effect that power information is gathered Data Quality Analysis, also utilize big data technique to achieve the support that magnanimity power information gathers Data Quality Analysis simultaneously, substantially increase magnanimity power information is gathered Data Quality Analysis efficiency and speed.
2, technical scheme provided by the present invention, in Data Preparation Process, using the distributed file system (HDFS) of big data platform as storing medium, the low latency that HBase distributed data base column stores is utilized to access advantage and Hive table in the powerful support for SQL, HBase and Hive is integrated, realizes magnanimity power information in the way of rapidly and efficiently and gather the preparation of data.
3, technical scheme provided by the present invention, data analysis adopts the mass analysis method that query statistic and data mining combine.Based on the quality analysis of mathematical statistics, by Hive data warehouse and Spark Computational frame, utilize Spark based on the advantage of internal memory, analyze speed and increase substantially;Quality analysis based on R language and the data mining in SparkR storehouse, it is achieved that the parallelization of mining algorithm Kmeans, simplifies the flow process of data mining, improves speed and the efficiency of data mining largely.
4, technical scheme provided by the invention, is widely used, and has significant Social benefit and economic benefit.
Accompanying drawing explanation
Fig. 1 is a kind of power information Data Quality Analysis system schematic of the present invention;
Fig. 2 is the data preparation module schematic diagram in the power information Data Quality Analysis system of the present invention;
Fig. 3 is the data integration module diagram in the power information Data Quality Analysis system of the present invention;
Fig. 4 is the calculating schematic diagram of the data analysis module in the power information Data Quality Analysis system of the present invention;
Fig. 5 is the concrete application examples schematic diagram of a kind of power information Data Quality Analysis system of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on embodiments of the invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.
As it is shown in figure 1, the present invention provides a kind of power information Data Quality Analysis system, including data preparation module, data integration module and data analysis module;
Data preparation module is used for gathering and store power information data;
Data integration module, based on the power information data in data preparation module, sets up the tables of data for inquiring about and calculating;
Data analysis module, according to the tables of data in data integration module, calculates the index set of power information data, obtains the effective percentage of power information data.
As in figure 2 it is shown, data preparation module includes data acquisition unit, data exchange unit and data storage cell;
Data acquisition unit is used for gathering power information data, and power information data are stored as relational database form;
The power information data of relational database form are imported to data storage cell by data exchange unit data exchange tool;
Data storage cell is distributed memory system, and data storage cell stores power information data in a text form.
Wherein, power information data include electricity consumption data and assistance data;
Electricity consumption data include Electricity customers information, electric energy meter information, stoichiometric point information, Real-time Collection information and history power consumption information, and wherein, Real-time Collection information includes power load, voltage and electric current;
Assistance data includes criteria for classification data and coding standard data.
Set up Hive tables of data as it is shown on figure 3, data integration module builds table statement based on the power information data in data preparation module according to Hive and set up Hive data summary table according to incidence relation.
Wherein, Hive tables of data includes Electricity customers Hive tables of data, electric energy meter Hive tables of data, stoichiometric point Hive tables of data, power load Hive tables of data, utilization voltage Hive tables of data, electricity consumption electric current Hive tables of data and history power consumption Hive tables of data.
Wherein, Hive data summary table includes the data summary table based on Electricity customers information, data summary table based on stoichiometric point information and the Hive summary table based on HBase;
Data summary table based on Electricity customers information includes electric energy meter information and stoichiometric point information, and the data summary table based on Electricity customers information stores to HBase tables of data;
Data summary table based on stoichiometric point information includes Real-time Collection information and history power consumption information, and the data summary table based on stoichiometric point information stores to HBase tables of data, and wherein, Real-time Collection information includes power load, voltage and current data;
The big table of Hive based on HBase is set up by Hive.
As shown in Figure 4, data analysis module parameter is concentrated and is included calculating coincident indicator, integrity metrics, accuracy index and Validity Index respectively;
The coincident indicator of power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve;
The accuracy index of power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve.
Wherein, the coincident indicator of power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve and include:
Calculate coincident indicator and include building SQL statement, the tables of data that inquiry is associated, obtain Query Result and calculate the concordance rate data obtaining power information data with formula " consistent data record number/total number of records * 100% ";
Calculation of integrity index includes building SQL statement, the Hive data summary table that inquiry has been set up, by adding up the record number being automatically filled to null in Hive data summary table, calculated the percentage of head rice obtaining power information data by formula " (1-null records number/total number of records evidence) * 100% ".
Wherein, the accuracy index of power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve and include:
Index item in accuracy in computation index and Validity Index is the index item arranged for the load curve data of Electricity customers, voltage curve data and current curve data in power information collection, mass analysis method based on data mining is by the power information data to verify are clustered, it is thus achieved that the accuracy index of power information data and Validity Index.
Wherein, the power information data that verify are clustered, it is thus achieved that the accuracy index of power information data and Validity Index include:
Accuracy in computation index includes: with the Kmeans algorithm of parallelization, the curvilinear characteristic of the curve data of Electricity customers is clustered, from cluster result, obtain the class race of curvilinear motion feature abnormalities, carried out the accuracy rate of calculated curve data by formula " (record number/total number of records that 1-variation characteristic is abnormal) * 100% ";
Calculating Validity Index includes: with the Kmeans algorithm of parallelization, data span feature is clustered, from cluster result, obtain exceptional value distribution situation, calculate the effective percentage obtaining power information data with formula " (1-exceptional value record number/total number of records) * 100% ".
As it is shown in figure 5, the present invention provides the concrete application examples of a kind of power information Data Quality Analysis system, as follows:
Power information based on Hadoop, Spark Computational frame and R language gathers Data Quality Analysis system, three parts such as including data preparation, data integration and data analysis.
Data prepare part primary responsibility power information and gather the acquisition of initial data.At present, power information gathers data and is mainly stored in Oracle data, data involved by this method mainly include Real-time Collection information and the history power consumption information etc. of Electricity customers information, electric energy meter information, stoichiometric point information, power load, voltage, electric current etc., for these data existed with relational database form, adopt Sqoop data exchange tool, data table related is imported in distributed memory system (HDFS), and store in a text form;This method also relates to some assistance datas, and such as criteria for classification, coding standard etc., these data exist in a text form, then directly use FTP instrument to upload in distributed memory system (HDFS).
Data integration part mainly completes Hive tables of data and the foundation of the big table of Hive data.Utilize Hive to build table statement, set up Electricity customers Hive tables of data, electric energy meter Hive tables of data, stoichiometric point Hive tables of data, power load Hive tables of data, utilization voltage Hive tables of data, electricity consumption electric current Hive tables of data and history power consumption Hive tables of data respectively.By incidence relation, setting up the big table of data based on Electricity customers information, including electric energy meter information, stoichiometric point information, storage is in HBase tables of data;Setting up the big table of data based on stoichiometric point information, Real-time Collection information including power load, voltage, electric current etc. and history power consumption information etc., storage is in HBase tables of data, and sets up the big table of Hive based on HBase by Hive.
Data analysis component has been used for power information and has gathered the mass analysis function of data.Quality analysis involved by this method mainly includes following index: concordance, integrity, accuracy, effectiveness.Wherein concordance, integrity pass through the mass analysis method based on query statistic, and accuracy, effectiveness are then by the mass analysis method based on data mining.Concordance, by building SQL statement, inquires about the tables of data being associated, and obtains Query Result and obtains concordance rate data by formula " consistent data record number/total number of records * 100% ";Integrity, by building SQL statement, is inquired about the big table of Hive data set up, by adding up the record number being automatically filled to null in big table, is calculated percentage of head rice by formula " (1-null records number/total number of records evidence) * 100% ".The index item that accuracy in this method and effectiveness are primarily directed to the load curve data of Electricity customers in power information collection, voltage curve data and current curve data and arrange, its quality analysis process is by the data to verify have been clustered.For accuracy, by the Kmeans algorithm of parallelization, the curvilinear characteristic of the curve data of Electricity customers is clustered, from cluster result, obtain the class race of curvilinear motion feature abnormalities, carried out the accuracy rate of calculated curve data by formula " (record number/total number of records that 1-variation characteristic is abnormal) * 100% ";For effectiveness, by the Kmeans algorithm of parallelization, data span feature is clustered, from cluster result, obtain exceptional value distribution situation, calculated the effective percentage of data by formula " (1-exceptional value record number/total number of records) * 100% ".
Above example is only in order to illustrate that technical scheme is not intended to limit; although the present invention being described in detail with reference to above-described embodiment; the specific embodiment of the present invention still can be modified or equivalent replacement by those of ordinary skill in the field; and these without departing from any amendment of spirit and scope of the invention or equivalent are replaced, within the claims of its present invention all awaited the reply in application.