[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN105786996A - Electricity information data quality analyzing system - Google Patents

Electricity information data quality analyzing system Download PDF

Info

Publication number
CN105786996A
CN105786996A CN201610091425.7A CN201610091425A CN105786996A CN 105786996 A CN105786996 A CN 105786996A CN 201610091425 A CN201610091425 A CN 201610091425A CN 105786996 A CN105786996 A CN 105786996A
Authority
CN
China
Prior art keywords
data
power information
hive
information
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610091425.7A
Other languages
Chinese (zh)
Inventor
潘森
朱力鹏
胡斌
周爱华
杨佩
裘洪彬
乔俊峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Smart Grid Research Institute of SGCC
Original Assignee
State Grid Corp of China SGCC
Smart Grid Research Institute of SGCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Smart Grid Research Institute of SGCC filed Critical State Grid Corp of China SGCC
Priority to CN201610091425.7A priority Critical patent/CN105786996A/en
Publication of CN105786996A publication Critical patent/CN105786996A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Computational Linguistics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention provides an electricity information data quality analyzing system which comprises a data preparation module, a data integration module and a data analysis module.The data preparation module is used for collecting and storing electricity information data, the data integration module is used for building a data table used for inquiry and calculation based on the electricity information data in the data preparation module, and the data analysis module is used for calculating an index set of the electricity information data according to the data table in the data integration module so as to obtain the effective rate of the electricity information data.Electricity information collection data quality analysis based on Hadoop and Spark calculation frameworks and R language is achieved, the electricity information collection data quality analysis effect is improved, mass electricity information collection data quality analysis is supported, the mass electricity information collection data quality analysis efficiency and speed are greatly increased, and preparation work of mass electricity information collection data is achieved rapidly and efficiently; the data mining process is simplified, and the data mining speed and efficiency are improved to the great extent.

Description

A kind of power information Data Quality Analysis system
Technical field
The present invention relates to the quality of data and big data fields, be specifically related to a kind of power information Data Quality Analysis system.
Background technology
Power information gathers the Data Quality Analysis method big Data Analysis Platform based on Hadoop, make full use of the advanced technologies such as HDFS distributed storage, Hive data base, Hbase data warehouse, Spark internal memory Computational frame, calculate platform in conjunction with data mining algorithm and R language parallelization, complete magnanimity power information and gather the quality analysis work of data.
Hadoop is as the open source projects being absorbed in distributed storage and calculating under the famous tissue Apache that increases income, increasingly receive attention, it can focus on and analyze demonstrating data systematic on a large scale, make overall planning huge data, realize the efficient process of mass data, be now widely used for the fields such as distributed storage, Webpage search, log analysis, advertisement calculating, Distributed Calculation, data mining.
HDFS (Hadoop distributed file system) provides basic-level support for Distributed Calculation storage.HDFS provides the mass data storage solution of an Error Tolerance and high-throughput.HDFS is used widely in various large-scale online services and large memory system, has become as the fact that mass data storage standard.Data block can by data are decomposed into data block, and be interspersed among in extensive work node by HDFS, it is achieved fault-tolerant and high-performance.HDFS cluster is combined some Datanode by independent Namenode and is formed.HDFS adopts master/slave framework.Namenode safeguards the file system of whole system.The each system cluster of Datanode individually configures, and processes the system memory unit on node.In inside, a file divides with block, each several block of Datanode aggregate.
Spark Computational frame.Spark is the general parallel computation frame of the UCBerkeleyAMPlab similar MapReduce increased income, and the Distributed Calculation that Spark realizes based on MapReduce algorithm has HadoopMapReduce have the advantage that;But what be different from MapReduce is that in the middle of Job, output and result can be saved in internal memory, thus being no longer necessary to read-write HDFS, therefore Spark can be applicable to the algorithm that data mining needs the MapReduce of iteration with machine learning etc. better.
R is an integrated statistical analysis software system comprising data process, function of statistic analysis and graph visualization, is jointly founded by RossIhaka and RobertGentleman.R language can be regarded as a kind of dialect that the S language development created by AT&T AT&T Labs goes out.Therefore, namely R is that a kind of software could also say that a kind of language, there is feature free, free, that increase income, comprise outstanding function of statistic analysis and powerful statistical cartography function, its simple and clear command parameter allows user should be readily appreciated that operation, and programmable functional language environment is also for needing the user of personalization definition to provide great convenience simultaneously.
SparkR is the AMPLab R kit issued, and provides the front end of light weight for ApacheSpark.SparkR provides the API of Spark Elastic distributed data collection (RDD), utilizes these API, and user can by the operation job of Rshell interactivity on cluster.
Kmeans is based on the typical clustering algorithm of distance, adopts distance as the evaluation index of similarity, and its principle is with k for parameter, and n object is divided into individual bunch of k, has higher similarity in making bunch, and bunch between similarity relatively low.The processing procedure of Kmeans algorithm is as follows: first, is randomly chosen k object, and each object initially represents meansigma methods or the center of bunch.To remaining each object, according to its distance with Ge Cu center, it is assigned to nearest bunch.Then the meansigma methods of each bunch is recalculated.This process constantly repeats, until criterion function convergence.
Therefore, need a kind of power information based on Hadoop, Spark Computational frame and R language of design badly and gather Data Quality Analysis system.
Summary of the invention
In view of this, a kind of power information Data Quality Analysis system provided by the invention, this system achieves the power information based on Hadoop, Spark Computational frame and R language and gathers Data Quality Analysis, improve the effect that power information is gathered Data Quality Analysis, also utilize big data technique to achieve the support that magnanimity power information gathers Data Quality Analysis simultaneously, substantially increase magnanimity power information is gathered Data Quality Analysis efficiency and speed;Realize magnanimity power information in the way of rapidly and efficiently and gather the preparation of data;Simplify the flow process of data mining, improve speed and the efficiency of data mining largely.
It is an object of the invention to be achieved through the following technical solutions:
A kind of power information Data Quality Analysis system, described system includes data preparation module, data integration module and data analysis module;
Described data preparation module is used for gathering and store power information data;
Described data integration module, based on the described power information data in described data preparation module, sets up the tables of data for inquiring about and calculating;
Described data analysis module, according to the described tables of data in described data integration module, calculates the index set of described power information data, obtains the effective percentage of described power information data.
Preferably, described data preparation module includes data acquisition unit, data exchange unit and data storage cell;
Described data acquisition unit is used for gathering power information data, and described power information data are stored as relational database form;
The described power information data of relational database form are imported to data storage cell by described data exchange unit data exchange tool;
Described data storage cell is distributed memory system, and described data storage cell stores described power information data in a text form.
Preferably, described power information data include electricity consumption data and assistance data;
Described electricity consumption data include Electricity customers information, electric energy meter information, stoichiometric point information, Real-time Collection information and history power consumption information, and wherein, described Real-time Collection information includes power load, voltage and electric current;
Described assistance data includes criteria for classification data and coding standard data.
Preferably, described data integration module is built table statement based on the described power information data in described data preparation module according to Hive and is set up Hive tables of data and set up Hive data summary table according to incidence relation.
Preferably, described Hive tables of data includes Electricity customers Hive tables of data, electric energy meter Hive tables of data, stoichiometric point Hive tables of data, power load Hive tables of data, utilization voltage Hive tables of data, electricity consumption electric current Hive tables of data and history power consumption Hive tables of data.
Preferably, described Hive data summary table includes the data summary table based on Electricity customers information, data summary table based on stoichiometric point information and the Hive summary table based on HBase;
Described data summary table based on Electricity customers information includes electric energy meter information and stoichiometric point information, and described data summary table based on Electricity customers information stores to HBase tables of data;
Described data summary table based on stoichiometric point information includes Real-time Collection information and history power consumption information, and described data summary table based on stoichiometric point information stores to HBase tables of data, wherein, described Real-time Collection information includes power load, voltage and current data;
The described big table of the Hive based on HBase is set up by Hive.
Preferably, calculate described index set to include calculating coincident indicator, integrity metrics, accuracy index and Validity Index respectively;
The coincident indicator of described power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve;
The accuracy index of described power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve.
Preferably, the coincident indicator of described power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve and include:
Calculate described coincident indicator and include building SQL statement, the tables of data that inquiry is associated, obtain Query Result and calculate the concordance rate data obtaining described power information data with formula 1;Wherein, formula 1 is: consistent data record number/total number of records * 100%;
Calculate described integrity metrics to include building SQL statement, the Hive data summary table that inquiry has been set up, by adding up the record number being automatically filled to null in described Hive data summary table, the percentage of head rice obtaining described power information data is calculated by formula 2, wherein, formula 2 is: (1-null records number/total number of records evidence) * 100%.
Preferably, the accuracy index of described power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve and include:
Calculating the index item in described accuracy index and described Validity Index is the index item arranged for the load curve data of Electricity customers, voltage curve data and current curve data in power information collection, the described mass analysis method based on data mining is by the described power information data to verify are clustered, it is thus achieved that the accuracy index of described power information data and Validity Index.
Preferably, the described power information data that verify are clustered, it is thus achieved that accuracy index and the Validity Index of described power information data include:
Calculate described accuracy index to include: with the Kmeans algorithm of parallelization, the curvilinear characteristic of the curve data of Electricity customers is clustered, from cluster result, obtain the class race of curvilinear motion feature abnormalities, carry out the accuracy rate of calculated curve data with formula 3;Wherein, formula 3 is: (record number/total number of records that 1-variation characteristic is abnormal) * 100%;
Calculate described Validity Index to include: with the Kmeans algorithm of parallelization, data span feature is clustered, from cluster result, obtain exceptional value distribution situation, calculate the effective percentage obtaining described power information data with formula 4;Wherein, formula 4 is: (1-exceptional value record number/total number of records) * 100%.
Can be seen that from above-mentioned technical scheme, the invention provides a kind of power information Data Quality Analysis system, including the data preparation module for gathering and store power information data, based on the power information data in data preparation module, set up the data integration module of tables of data for inquiring about and calculating and according to the tables of data in data integration module, calculate the index set of power information data, obtain the efficient data analysis module of power information data.Present invention achieves the power information based on Hadoop, Spark Computational frame and R language and gather Data Quality Analysis, improve power information and gather Data Quality Analysis effect, achieve the support that magnanimity power information is gathered Data Quality Analysis, substantially increase magnanimity power information is gathered Data Quality Analysis efficiency and speed;Realize magnanimity power information in the way of rapidly and efficiently and gather the preparation of data;Simplify the flow process of data mining, improve speed and the efficiency of data mining largely.
With immediate prior art ratio, technical scheme provided by the invention has following excellent effect:
1, in technical scheme provided by the present invention, power information gathers the mass analysis method of data and introduces big data analysis technique and parallel computation, have employed query statistic and mass analysis method that data mining combines, improve the effect that power information is gathered Data Quality Analysis, also utilize big data technique to achieve the support that magnanimity power information gathers Data Quality Analysis simultaneously, substantially increase magnanimity power information is gathered Data Quality Analysis efficiency and speed.
2, technical scheme provided by the present invention, in Data Preparation Process, using the distributed file system (HDFS) of big data platform as storing medium, the low latency that HBase distributed data base column stores is utilized to access advantage and Hive table in the powerful support for SQL, HBase and Hive is integrated, realizes magnanimity power information in the way of rapidly and efficiently and gather the preparation of data.
3, technical scheme provided by the present invention, data analysis adopts the mass analysis method that query statistic and data mining combine.Based on the quality analysis of mathematical statistics, by Hive data warehouse and Spark Computational frame, utilize Spark based on the advantage of internal memory, analyze speed and increase substantially;Quality analysis based on R language and the data mining in SparkR storehouse, it is achieved that the parallelization of mining algorithm Kmeans, simplifies the flow process of data mining, improves speed and the efficiency of data mining largely.
4, technical scheme provided by the invention, is widely used, and has significant Social benefit and economic benefit.
Accompanying drawing explanation
Fig. 1 is a kind of power information Data Quality Analysis system schematic of the present invention;
Fig. 2 is the data preparation module schematic diagram in the power information Data Quality Analysis system of the present invention;
Fig. 3 is the data integration module diagram in the power information Data Quality Analysis system of the present invention;
Fig. 4 is the calculating schematic diagram of the data analysis module in the power information Data Quality Analysis system of the present invention;
Fig. 5 is the concrete application examples schematic diagram of a kind of power information Data Quality Analysis system of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on embodiments of the invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.
As it is shown in figure 1, the present invention provides a kind of power information Data Quality Analysis system, including data preparation module, data integration module and data analysis module;
Data preparation module is used for gathering and store power information data;
Data integration module, based on the power information data in data preparation module, sets up the tables of data for inquiring about and calculating;
Data analysis module, according to the tables of data in data integration module, calculates the index set of power information data, obtains the effective percentage of power information data.
As in figure 2 it is shown, data preparation module includes data acquisition unit, data exchange unit and data storage cell;
Data acquisition unit is used for gathering power information data, and power information data are stored as relational database form;
The power information data of relational database form are imported to data storage cell by data exchange unit data exchange tool;
Data storage cell is distributed memory system, and data storage cell stores power information data in a text form.
Wherein, power information data include electricity consumption data and assistance data;
Electricity consumption data include Electricity customers information, electric energy meter information, stoichiometric point information, Real-time Collection information and history power consumption information, and wherein, Real-time Collection information includes power load, voltage and electric current;
Assistance data includes criteria for classification data and coding standard data.
Set up Hive tables of data as it is shown on figure 3, data integration module builds table statement based on the power information data in data preparation module according to Hive and set up Hive data summary table according to incidence relation.
Wherein, Hive tables of data includes Electricity customers Hive tables of data, electric energy meter Hive tables of data, stoichiometric point Hive tables of data, power load Hive tables of data, utilization voltage Hive tables of data, electricity consumption electric current Hive tables of data and history power consumption Hive tables of data.
Wherein, Hive data summary table includes the data summary table based on Electricity customers information, data summary table based on stoichiometric point information and the Hive summary table based on HBase;
Data summary table based on Electricity customers information includes electric energy meter information and stoichiometric point information, and the data summary table based on Electricity customers information stores to HBase tables of data;
Data summary table based on stoichiometric point information includes Real-time Collection information and history power consumption information, and the data summary table based on stoichiometric point information stores to HBase tables of data, and wherein, Real-time Collection information includes power load, voltage and current data;
The big table of Hive based on HBase is set up by Hive.
As shown in Figure 4, data analysis module parameter is concentrated and is included calculating coincident indicator, integrity metrics, accuracy index and Validity Index respectively;
The coincident indicator of power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve;
The accuracy index of power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve.
Wherein, the coincident indicator of power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve and include:
Calculate coincident indicator and include building SQL statement, the tables of data that inquiry is associated, obtain Query Result and calculate the concordance rate data obtaining power information data with formula " consistent data record number/total number of records * 100% ";
Calculation of integrity index includes building SQL statement, the Hive data summary table that inquiry has been set up, by adding up the record number being automatically filled to null in Hive data summary table, calculated the percentage of head rice obtaining power information data by formula " (1-null records number/total number of records evidence) * 100% ".
Wherein, the accuracy index of power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve and include:
Index item in accuracy in computation index and Validity Index is the index item arranged for the load curve data of Electricity customers, voltage curve data and current curve data in power information collection, mass analysis method based on data mining is by the power information data to verify are clustered, it is thus achieved that the accuracy index of power information data and Validity Index.
Wherein, the power information data that verify are clustered, it is thus achieved that the accuracy index of power information data and Validity Index include:
Accuracy in computation index includes: with the Kmeans algorithm of parallelization, the curvilinear characteristic of the curve data of Electricity customers is clustered, from cluster result, obtain the class race of curvilinear motion feature abnormalities, carried out the accuracy rate of calculated curve data by formula " (record number/total number of records that 1-variation characteristic is abnormal) * 100% ";
Calculating Validity Index includes: with the Kmeans algorithm of parallelization, data span feature is clustered, from cluster result, obtain exceptional value distribution situation, calculate the effective percentage obtaining power information data with formula " (1-exceptional value record number/total number of records) * 100% ".
As it is shown in figure 5, the present invention provides the concrete application examples of a kind of power information Data Quality Analysis system, as follows:
Power information based on Hadoop, Spark Computational frame and R language gathers Data Quality Analysis system, three parts such as including data preparation, data integration and data analysis.
Data prepare part primary responsibility power information and gather the acquisition of initial data.At present, power information gathers data and is mainly stored in Oracle data, data involved by this method mainly include Real-time Collection information and the history power consumption information etc. of Electricity customers information, electric energy meter information, stoichiometric point information, power load, voltage, electric current etc., for these data existed with relational database form, adopt Sqoop data exchange tool, data table related is imported in distributed memory system (HDFS), and store in a text form;This method also relates to some assistance datas, and such as criteria for classification, coding standard etc., these data exist in a text form, then directly use FTP instrument to upload in distributed memory system (HDFS).
Data integration part mainly completes Hive tables of data and the foundation of the big table of Hive data.Utilize Hive to build table statement, set up Electricity customers Hive tables of data, electric energy meter Hive tables of data, stoichiometric point Hive tables of data, power load Hive tables of data, utilization voltage Hive tables of data, electricity consumption electric current Hive tables of data and history power consumption Hive tables of data respectively.By incidence relation, setting up the big table of data based on Electricity customers information, including electric energy meter information, stoichiometric point information, storage is in HBase tables of data;Setting up the big table of data based on stoichiometric point information, Real-time Collection information including power load, voltage, electric current etc. and history power consumption information etc., storage is in HBase tables of data, and sets up the big table of Hive based on HBase by Hive.
Data analysis component has been used for power information and has gathered the mass analysis function of data.Quality analysis involved by this method mainly includes following index: concordance, integrity, accuracy, effectiveness.Wherein concordance, integrity pass through the mass analysis method based on query statistic, and accuracy, effectiveness are then by the mass analysis method based on data mining.Concordance, by building SQL statement, inquires about the tables of data being associated, and obtains Query Result and obtains concordance rate data by formula " consistent data record number/total number of records * 100% ";Integrity, by building SQL statement, is inquired about the big table of Hive data set up, by adding up the record number being automatically filled to null in big table, is calculated percentage of head rice by formula " (1-null records number/total number of records evidence) * 100% ".The index item that accuracy in this method and effectiveness are primarily directed to the load curve data of Electricity customers in power information collection, voltage curve data and current curve data and arrange, its quality analysis process is by the data to verify have been clustered.For accuracy, by the Kmeans algorithm of parallelization, the curvilinear characteristic of the curve data of Electricity customers is clustered, from cluster result, obtain the class race of curvilinear motion feature abnormalities, carried out the accuracy rate of calculated curve data by formula " (record number/total number of records that 1-variation characteristic is abnormal) * 100% ";For effectiveness, by the Kmeans algorithm of parallelization, data span feature is clustered, from cluster result, obtain exceptional value distribution situation, calculated the effective percentage of data by formula " (1-exceptional value record number/total number of records) * 100% ".
Above example is only in order to illustrate that technical scheme is not intended to limit; although the present invention being described in detail with reference to above-described embodiment; the specific embodiment of the present invention still can be modified or equivalent replacement by those of ordinary skill in the field; and these without departing from any amendment of spirit and scope of the invention or equivalent are replaced, within the claims of its present invention all awaited the reply in application.

Claims (10)

1. a power information Data Quality Analysis system, it is characterised in that described system includes data preparation module, data integration module and data analysis module;
Described data preparation module is used for gathering and store power information data;
Described data integration module, based on the described power information data in described data preparation module, sets up the tables of data for inquiring about and calculating;
Described data analysis module, according to the described tables of data in described data integration module, calculates the index set of described power information data, obtains the effective percentage of described power information data.
2. the system as claimed in claim 1, it is characterised in that described data preparation module includes data acquisition unit, data exchange unit and data storage cell;
Described data acquisition unit is used for gathering power information data, and described power information data are stored as relational database form;
The described power information data of relational database form are imported to data storage cell by described data exchange unit data exchange tool;
Described data storage cell is distributed memory system, and described data storage cell stores described power information data in a text form.
3. system as claimed in claim 1 or 2, it is characterised in that described power information data include electricity consumption data and assistance data;
Described electricity consumption data include Electricity customers information, electric energy meter information, stoichiometric point information, Real-time Collection information and history power consumption information, and wherein, described Real-time Collection information includes power load, voltage and electric current;
Described assistance data includes criteria for classification data and coding standard data.
4. the system as claimed in claim 1, it is characterised in that described data integration module is built table statement based on the described power information data in described data preparation module according to Hive and set up Hive tables of data and set up Hive data summary table according to incidence relation.
5. system as claimed in claim 4, it is characterized in that, described Hive tables of data includes Electricity customers Hive tables of data, electric energy meter Hive tables of data, stoichiometric point Hive tables of data, power load Hive tables of data, utilization voltage Hive tables of data, electricity consumption electric current Hive tables of data and history power consumption Hive tables of data.
6. system as claimed in claim 4, it is characterised in that described Hive data summary table includes the data summary table based on Electricity customers information, data summary table based on stoichiometric point information and the Hive summary table based on HBase;
Described data summary table based on Electricity customers information includes electric energy meter information and stoichiometric point information, and described data summary table based on Electricity customers information stores to HBase tables of data;
Described data summary table based on stoichiometric point information includes Real-time Collection information and history power consumption information, and described data summary table based on stoichiometric point information stores to HBase tables of data, wherein, described Real-time Collection information includes power load, voltage and current data;
The described big table of the Hive based on HBase is set up by Hive.
7. the system as claimed in claim 1, it is characterised in that calculate described index set and include calculating coincident indicator, integrity metrics, accuracy index and Validity Index respectively;
The coincident indicator of described power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve;
The accuracy index of described power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve.
8. system as claimed in claim 7, it is characterised in that the coincident indicator of described power information data and integrity metrics are used that the mass analysis method based on query statistic is tried to achieve and include:
Calculate described coincident indicator and include building SQL statement, the tables of data that inquiry is associated, obtain Query Result and calculate the concordance rate data obtaining described power information data with formula 1;Wherein, formula 1 is: consistent data record number/total number of records * 100%;
Calculate described integrity metrics to include building SQL statement, the Hive data summary table that inquiry has been set up, by adding up the record number being automatically filled to null in described Hive data summary table, the percentage of head rice obtaining described power information data is calculated by formula 2, wherein, formula 2 is: (1-null records number/total number of records evidence) * 100%.
9. system as claimed in claim 7, it is characterised in that the accuracy index of described power information data and Validity Index are used that the mass analysis method based on data mining is tried to achieve and include:
Calculating the index item in described accuracy index and described Validity Index is the index item arranged for the load curve data of Electricity customers, voltage curve data and current curve data in power information collection, the described mass analysis method based on data mining is by the described power information data to verify are clustered, it is thus achieved that the accuracy index of described power information data and Validity Index.
10. system as claimed in claim 9, it is characterised in that the described power information data that verify are clustered, it is thus achieved that accuracy index and the Validity Index of described power information data include:
Calculate described accuracy index to include: with the Kmeans algorithm of parallelization, the curvilinear characteristic of the curve data of Electricity customers is clustered, from cluster result, obtain the class race of curvilinear motion feature abnormalities, carry out the accuracy rate of calculated curve data with formula 3;Wherein, formula 3 is: (record number/total number of records that 1-variation characteristic is abnormal) * 100%;
Calculate described Validity Index to include: with the Kmeans algorithm of parallelization, data span feature is clustered, from cluster result, obtain exceptional value distribution situation, calculate the effective percentage obtaining described power information data with formula 4;Wherein, formula 4 is: (1-exceptional value record number/total number of records) * 100%.
CN201610091425.7A 2016-02-18 2016-02-18 Electricity information data quality analyzing system Pending CN105786996A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610091425.7A CN105786996A (en) 2016-02-18 2016-02-18 Electricity information data quality analyzing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610091425.7A CN105786996A (en) 2016-02-18 2016-02-18 Electricity information data quality analyzing system

Publications (1)

Publication Number Publication Date
CN105786996A true CN105786996A (en) 2016-07-20

Family

ID=56402264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610091425.7A Pending CN105786996A (en) 2016-02-18 2016-02-18 Electricity information data quality analyzing system

Country Status (1)

Country Link
CN (1) CN105786996A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341467A (en) * 2016-08-30 2017-01-18 国网江苏省电力公司电力科学研究院 State analysis method of power utilization information collector based on big data parallel computing
CN106909676A (en) * 2017-03-02 2017-06-30 国家电网公司 The analysis method and device of user power utilization behavior
CN106951360A (en) * 2017-03-27 2017-07-14 网宿科技股份有限公司 Data statistics integrity degree computational methods and system
CN107145532A (en) * 2017-04-18 2017-09-08 北京思特奇信息技术股份有限公司 The real-time analysis and processing method and system of a kind of flow data
CN107169640A (en) * 2017-05-03 2017-09-15 国网江西省电力公司电力科学研究院 A kind of power distribution network key index analysis method based on big data technology
CN107256158A (en) * 2017-06-07 2017-10-17 广州供电局有限公司 The detection method and system of power system load reduction
CN108932266A (en) * 2017-05-26 2018-12-04 西门子公司 Big data processing method, apparatus and system and machine readable media
CN108959356A (en) * 2018-05-07 2018-12-07 国网上海市电力公司 A kind of intelligence adapted TV university Data application system Data Mart method for building up
CN106682213B (en) * 2016-12-30 2020-08-07 Tcl科技集团股份有限公司 Internet of things task customizing method and system based on Hadoop platform

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130304268A1 (en) * 2007-10-14 2013-11-14 Marcos B. Pernia Electrical Energy Usage Monitoring System
CN104361110A (en) * 2014-12-01 2015-02-18 广东电网有限责任公司清远供电局 Mass electricity consumption data analysis system as well as real-time calculation method and data mining method
CN104462314A (en) * 2014-11-28 2015-03-25 国家电网公司 Power grid data processing method and device
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN104850629A (en) * 2015-05-21 2015-08-19 杭州天宽科技有限公司 Analysis method of massive intelligent electricity-consumption data based on improved k-means algorithm
CN105184452A (en) * 2015-08-14 2015-12-23 山东大学 MapReduce operation dependence control method suitable for power information big-data calculation
CN105243155A (en) * 2015-10-29 2016-01-13 贵州电网有限责任公司电力调度控制中心 Big data extracting and exchanging system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130304268A1 (en) * 2007-10-14 2013-11-14 Marcos B. Pernia Electrical Energy Usage Monitoring System
CN104462314A (en) * 2014-11-28 2015-03-25 国家电网公司 Power grid data processing method and device
CN104361110A (en) * 2014-12-01 2015-02-18 广东电网有限责任公司清远供电局 Mass electricity consumption data analysis system as well as real-time calculation method and data mining method
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN104850629A (en) * 2015-05-21 2015-08-19 杭州天宽科技有限公司 Analysis method of massive intelligent electricity-consumption data based on improved k-means algorithm
CN105184452A (en) * 2015-08-14 2015-12-23 山东大学 MapReduce operation dependence control method suitable for power information big-data calculation
CN105243155A (en) * 2015-10-29 2016-01-13 贵州电网有限责任公司电力调度控制中心 Big data extracting and exchanging system

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106341467A (en) * 2016-08-30 2017-01-18 国网江苏省电力公司电力科学研究院 State analysis method of power utilization information collector based on big data parallel computing
CN106341467B (en) * 2016-08-30 2019-11-29 国网江苏省电力公司电力科学研究院 Power information based on big data parallel computation acquires equipment state analysis method
CN106682213B (en) * 2016-12-30 2020-08-07 Tcl科技集团股份有限公司 Internet of things task customizing method and system based on Hadoop platform
CN106909676A (en) * 2017-03-02 2017-06-30 国家电网公司 The analysis method and device of user power utilization behavior
CN106951360A (en) * 2017-03-27 2017-07-14 网宿科技股份有限公司 Data statistics integrity degree computational methods and system
CN106951360B (en) * 2017-03-27 2020-08-04 网宿科技股份有限公司 Data statistical integrity calculation method and system
CN107145532A (en) * 2017-04-18 2017-09-08 北京思特奇信息技术股份有限公司 The real-time analysis and processing method and system of a kind of flow data
CN107169640A (en) * 2017-05-03 2017-09-15 国网江西省电力公司电力科学研究院 A kind of power distribution network key index analysis method based on big data technology
CN108932266A (en) * 2017-05-26 2018-12-04 西门子公司 Big data processing method, apparatus and system and machine readable media
CN107256158A (en) * 2017-06-07 2017-10-17 广州供电局有限公司 The detection method and system of power system load reduction
CN107256158B (en) * 2017-06-07 2021-06-18 广州供电局有限公司 Method and system for detecting load reduction of power system
CN108959356A (en) * 2018-05-07 2018-12-07 国网上海市电力公司 A kind of intelligence adapted TV university Data application system Data Mart method for building up

Similar Documents

Publication Publication Date Title
CN105786996A (en) Electricity information data quality analyzing system
CN104820670B (en) A kind of acquisition of power information big data and storage method
CN109582667A (en) A kind of multiple database mixing storage method and system based on power regulation big data
CN110991700A (en) Weather and electricity utilization correlation prediction method and device based on deep learning improvement
CN103631922B (en) Extensive Web information extracting method and system based on Hadoop clusters
CN103258049A (en) Association rule mining method based on mass data
Liu et al. Real-time complex event processing and analytics for smart grid
CN107832876B (en) Partition maximum load prediction method based on MapReduce framework
CN104951529A (en) Interactive analyzing method for website logs
Ceci et al. Big data techniques for supporting accurate predictions of energy production from renewable sources
Diao et al. Dynamic and static analysis of agricultural productivity in China
CN105550332A (en) Dual-layer index structure based origin graph query method
CN104933143A (en) Method and device for acquiring recommended object
CN102819616B (en) Cloud online real-time multi-dimensional analysis system and method
Curé et al. On the evaluation of RDF distribution algorithms implemented over apache spark
Bai et al. Probabilistic reverse skyline query processing over uncertain data stream
CN111639060A (en) Thermal power plant time sequence data processing method, device, equipment and medium
CN104794175A (en) Optimal scenic spot and hotel pairing method based on measurement k closest pair
CN112148719B (en) Data processing query method and device based on OLAP pre-calculation model
Bai et al. Association rule mining algorithm based on Spark for pesticide transaction data analyses
Shinkevich et al. Decision making support for the development of new products based on Big Data technology
Ding et al. Commapreduce: An improvement of mapreduce with lightweight communication mechanisms
US20220317644A1 (en) Production programming system and method based on nonlinear program model, and computer-readable storage medium
Anusha et al. Big data techniques for efficient storage and processing of weather data
Liu et al. A versatile event-driven data model in hbase database for multi-source data of power grid

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 102209 Beijing City, Changping District science and Technology Park in the future smart grid research institute hospital

Applicant after: GLOBAL ENERGY INTERCONNECTION RESEARCH INSTITUTE

Applicant after: State Grid Corporation of China

Address before: 102211 Beijing city Changping District Xiaotangshan town big East Village Road No. 270 (future technology city)

Applicant before: State Grid Smart Grid Institute

Applicant before: State Grid Corporation of China

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160720