CN112783883A

CN112783883A - Power data standardized cleaning method and device under multi-source data access

Info

Publication number: CN112783883A
Application number: CN202110094083.5A
Authority: CN
Inventors: 周立德; 黎鸣; 陈凤超; 梅傲琪; 胡润锋; 钟志明; 邱泽坚; 何毅鹏; 黄达区; 饶欢; 张锐; 刘沛林; 徐睿烽; 鲁承波
Original assignee: Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Current assignee: Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2021-05-11
Anticipated expiration: 2041-01-22
Also published as: CN112783883B

Abstract

The invention relates to a standardized cleaning method for power data under multi-source data access, which comprises the following steps: s10, performing preliminary clustering processing on the data, reading the collected data by using a K-means algorithm, classifying the collected data according to attribute value characteristics of the data, S20, performing multi-source data cleaning by using the data after clustering processing as a data source for data cleaning, setting the processed data into a database form, and completing multi-source data cleaning by using an existing data cleaning tool. The invention has the beneficial effects that: the collected data are classified according to the attribute value characteristics of the data, the data after clustering processing are used as data sources for data cleaning, the processed data are set to be in a database form, multi-source data cleaning work is completed by adopting an existing data cleaning tool, the accuracy of database data processing results is improved, and the accuracy of the cleaning data is further improved.

Description

Power data standardized cleaning method and device under multi-source data access

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a device for standardized cleaning of electric power data under multi-source data access.

Background

People can use more and more data resources, but mass data does not necessarily have real value, the value of the data comes from the quality of the data, and the quality of data mining directly influences the quality of decision making. However, it is very difficult to manually process these huge and cluttered data, and the data quality problem has become one of the bottlenecks that restrict the application and processing of data. The quality problem in the data is corrected, the decision-making error is avoided, and the decision-making risk is reduced, so that the method is an important link for data processing. In previous studies, data cleansing was accomplished using a data standardized cleansing system. However, due to the increase of data volume, the appearance of multi-source data has an influence on the performance of the system, and the power data is particularly obvious.

Disclosure of Invention

The invention aims to overcome the problems in the prior art and provides a method and a device for standardized cleaning of power data under multi-source data access.

In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:

s10, performing primary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, and finishing the clustering processing on the electric power data;

and S20, electric power multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, and electric power data standardization cleaning is completed.

Further, in S10, the complexity of the work process of classifying data may be represented as a (n), the collected data is represented in the form of a character string, and the complexity of the data clustering calculation may be represented as a (m · I), where m represents the number of different attribute data, I represents the number of data with the same attribute, and to ensure the feasibility of data clustering, the constraint conditions in the preliminary clustering process are set as:

in the formula, S represents a clustering core distance, and the formula is transformed to obtain a clustering constraint condition applicable to multi-source data, which includes:

A(m·I)＝A(n)

the method adopts a form of calculating the similarity to control the calculation precision of the clustering core distance, and is expressed as follows through a formula:

J(A,B)＝|A∩B|/|A∪B|

in the formula, J represents the similarity of the calculated core distance, B is a calculation result, and the error value of the calculation result is controlled within 0.5% through the formula so as to ensure the effectiveness of subsequent calculation;

setting G to represent the number of times of data type occurrence and H to represent the weight of the occurrence of the data type, the frequency R of occurrence of data of this type in the calculation can be represented as:

the formula is merged into a data clustering module to complete data clustering processing;

s20, multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need cleaning, the other group of data is data which needs cleaning, the data which does not need cleaning is set to be C, the data contained in C all consist of elements in the database Y, f_c(a) Denotes the number of times C appears in the database, Q_c(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:

set Q_c(v) Representing the similarity of the subset v in the database, there are:

setting v₁∈C，v₂E is C, then data v₂And v₁The relationship of (c) is expressed as:

the data which needs to be split in the database can be processed through the formula, and the multi-source data cleaning work is completed by adopting the existing data cleaning tool.

The power data includes annual, monthly and single day power production data and electricity usage data.

Wherein the data cleansing tool is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.

The device comprises a development board, a direct-current power supply, a voltage stabilizing circuit and a communication interface, wherein the direct-current power supply supplies power to the development board through the voltage stabilizing circuit, the direct-current power supply directly supplies power to the communication interface, a USB interface and a network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is respectively connected with the clock circuit, the reset circuit, the communication module and the timer.

The model of the ARM chip is S3C 2440.

The invention has the beneficial effects that: the collected data are classified according to the attribute value characteristics of the data, the data after clustering processing are used as data sources for data cleaning, the processed data are set to be in a database form, multi-source data cleaning work is completed by adopting an existing data cleaning tool, the accuracy of database data processing results is improved, and the accuracy of the cleaning data is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of a flow chart of a cleaning method in an embodiment of the present invention;

FIG. 2 is a frame diagram of a cleaning apparatus in an embodiment of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

As shown in fig. 1, a method for standardized cleaning of power data under multi-source data access includes:

s10, performing preliminary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, wherein the complexity of the working processing can be expressed as A (n), embodying the acquired data in a character string mode, and expressing the data clustering calculation complexity as A (m.I), wherein m represents the number of different attribute data, I represents the number of data with the same attribute, and in order to ensure the feasibility of data clustering, setting constraint conditions in the preliminary clustering processing as follows:

A(m·I)＝A(n)

J(A,B)＝|A∩B|/|A∪B|

integrating the formula into a data clustering module to finish data clustering processing, wherein the power data comprises annual, monthly and single-day power production data and power utilization data;

The data cleansing means is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.

As shown in fig. 2, a standardized cleaning device of power data under multi-source data access is used for operating the standardized cleaning method of power data under multi-source data access, and comprises a development board, a direct-current power supply, a voltage stabilizing circuit and a communication interface, wherein the direct-current power supply supplies power to the development board through the voltage stabilizing circuit, the direct-current power supply directly supplies power to the communication interface, a USB interface and a network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is respectively connected with the clock circuit, the reset circuit, the communication module and the timer.

The model of the ARM chip is S3C 2440.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims

1. A standardized cleaning method for power data under multi-source data access is characterized by comprising the following steps:

2. The standardized cleaning method for power data under multi-source data access of claim 1, wherein in S10, the complexity of the work process of data classification can be represented as a (n), the collected data is represented in a character string form, the complexity of the data clustering calculation can be represented as a (m · I), where m represents the number of data with different attributes, I represents the number of data with the same attribute, and in order to ensure the feasibility of data clustering, the constraint conditions in the preliminary clustering process are set as:

A(m·I)＝A(n)

J(A，B)＝|A∩B|/|A∪B|

and (4) integrating the formula into a data clustering module to finish data clustering processing.

3. The method for standardized cleaning of power data under multi-source data access according to claim 1, wherein the specific step of S20 is: the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need to be cleaned, the other group of data is data which needs to be cleaned, and cleaning is not neededThe data of the wash is set to C, the data contained in C consists entirely of the elements of the database Y, f_c(a) Denotes the number of times C appears in the database, Q_c(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:

4. The method of claim i, wherein the power data comprises annual, monthly and single day power production data and electricity usage data.

5. The standardized cleaning method for power data under multi-source data access according to claim 1, characterized in that: the data cleansing means is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.

6. A standardized belt cleaning device of electric power data under multisource data access for operating the standardized cleaning method of electric power data under multisource data access of claim 1 ~ 5, characterized in that: the direct current power supply supplies power to the development board through the voltage stabilizing circuit, the direct current power supply directly supplies power to the communication interface, the USB interface and the network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is connected with the clock circuit, the reset circuit, the communication module and the timer respectively.

7. The standardized belt cleaning device of power data under multisource data access of claim 6, characterized in that: the model of the ARM chip is S3C 2440.