CN112783883A - Power data standardized cleaning method and device under multi-source data access - Google Patents
Power data standardized cleaning method and device under multi-source data access Download PDFInfo
- Publication number
- CN112783883A CN112783883A CN202110094083.5A CN202110094083A CN112783883A CN 112783883 A CN112783883 A CN 112783883A CN 202110094083 A CN202110094083 A CN 202110094083A CN 112783883 A CN112783883 A CN 112783883A
- Authority
- CN
- China
- Prior art keywords
- data
- cleaning
- source
- clustering
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000004891 communication Methods 0.000 claims description 14
- 238000011161 development Methods 0.000 claims description 11
- 230000000087 stabilizing effect Effects 0.000 claims description 5
- 241000245032 Trillium Species 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims description 3
- 230000005611 electricity Effects 0.000 claims description 2
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a standardized cleaning method for power data under multi-source data access, which comprises the following steps: s10, performing preliminary clustering processing on the data, reading the collected data by using a K-means algorithm, classifying the collected data according to attribute value characteristics of the data, S20, performing multi-source data cleaning by using the data after clustering processing as a data source for data cleaning, setting the processed data into a database form, and completing multi-source data cleaning by using an existing data cleaning tool. The invention has the beneficial effects that: the collected data are classified according to the attribute value characteristics of the data, the data after clustering processing are used as data sources for data cleaning, the processed data are set to be in a database form, multi-source data cleaning work is completed by adopting an existing data cleaning tool, the accuracy of database data processing results is improved, and the accuracy of the cleaning data is further improved.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for standardized cleaning of electric power data under multi-source data access.
Background
People can use more and more data resources, but mass data does not necessarily have real value, the value of the data comes from the quality of the data, and the quality of data mining directly influences the quality of decision making. However, it is very difficult to manually process these huge and cluttered data, and the data quality problem has become one of the bottlenecks that restrict the application and processing of data. The quality problem in the data is corrected, the decision-making error is avoided, and the decision-making risk is reduced, so that the method is an important link for data processing. In previous studies, data cleansing was accomplished using a data standardized cleansing system. However, due to the increase of data volume, the appearance of multi-source data has an influence on the performance of the system, and the power data is particularly obvious.
Disclosure of Invention
The invention aims to overcome the problems in the prior art and provides a method and a device for standardized cleaning of power data under multi-source data access.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
s10, performing primary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, and finishing the clustering processing on the electric power data;
and S20, electric power multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, and electric power data standardization cleaning is completed.
Further, in S10, the complexity of the work process of classifying data may be represented as a (n), the collected data is represented in the form of a character string, and the complexity of the data clustering calculation may be represented as a (m · I), where m represents the number of different attribute data, I represents the number of data with the same attribute, and to ensure the feasibility of data clustering, the constraint conditions in the preliminary clustering process are set as:
in the formula, S represents a clustering core distance, and the formula is transformed to obtain a clustering constraint condition applicable to multi-source data, which includes:
A(m·I)=A(n)
the method adopts a form of calculating the similarity to control the calculation precision of the clustering core distance, and is expressed as follows through a formula:
J(A,B)=|A∩B|/|A∪B|
in the formula, J represents the similarity of the calculated core distance, B is a calculation result, and the error value of the calculation result is controlled within 0.5% through the formula so as to ensure the effectiveness of subsequent calculation;
setting G to represent the number of times of data type occurrence and H to represent the weight of the occurrence of the data type, the frequency R of occurrence of data of this type in the calculation can be represented as:
the formula is merged into a data clustering module to complete data clustering processing;
s20, multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need cleaning, the other group of data is data which needs cleaning, the data which does not need cleaning is set to be C, the data contained in C all consist of elements in the database Y, fc(a) Denotes the number of times C appears in the database, Qc(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:
set Qc(v) Representing the similarity of the subset v in the database, there are:
setting v1∈C,v2E is C, then data v2And v1The relationship of (c) is expressed as:
the data which needs to be split in the database can be processed through the formula, and the multi-source data cleaning work is completed by adopting the existing data cleaning tool.
The power data includes annual, monthly and single day power production data and electricity usage data.
Wherein the data cleansing tool is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.
The device comprises a development board, a direct-current power supply, a voltage stabilizing circuit and a communication interface, wherein the direct-current power supply supplies power to the development board through the voltage stabilizing circuit, the direct-current power supply directly supplies power to the communication interface, a USB interface and a network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is respectively connected with the clock circuit, the reset circuit, the communication module and the timer.
The model of the ARM chip is S3C 2440.
The invention has the beneficial effects that: the collected data are classified according to the attribute value characteristics of the data, the data after clustering processing are used as data sources for data cleaning, the processed data are set to be in a database form, multi-source data cleaning work is completed by adopting an existing data cleaning tool, the accuracy of database data processing results is improved, and the accuracy of the cleaning data is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of a flow chart of a cleaning method in an embodiment of the present invention;
FIG. 2 is a frame diagram of a cleaning apparatus in an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
As shown in fig. 1, a method for standardized cleaning of power data under multi-source data access includes:
s10, performing preliminary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, wherein the complexity of the working processing can be expressed as A (n), embodying the acquired data in a character string mode, and expressing the data clustering calculation complexity as A (m.I), wherein m represents the number of different attribute data, I represents the number of data with the same attribute, and in order to ensure the feasibility of data clustering, setting constraint conditions in the preliminary clustering processing as follows:
in the formula, S represents a clustering core distance, and the formula is transformed to obtain a clustering constraint condition applicable to multi-source data, which includes:
A(m·I)=A(n)
the method adopts a form of calculating the similarity to control the calculation precision of the clustering core distance, and is expressed as follows through a formula:
J(A,B)=|A∩B|/|A∪B|
in the formula, J represents the similarity of the calculated core distance, B is a calculation result, and the error value of the calculation result is controlled within 0.5% through the formula so as to ensure the effectiveness of subsequent calculation;
setting G to represent the number of times of data type occurrence and H to represent the weight of the occurrence of the data type, the frequency R of occurrence of data of this type in the calculation can be represented as:
integrating the formula into a data clustering module to finish data clustering processing, wherein the power data comprises annual, monthly and single-day power production data and power utilization data;
s20, multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need cleaning, the other group of data is data which needs cleaning, the data which does not need cleaning is set to be C, the data contained in C all consist of elements in the database Y, fc(a) Denotes the number of times C appears in the database, Qc(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:
set Qc(v) Representing the similarity of the subset v in the database, there are:
setting v1∈C,v2E is C, then data v2And v1The relationship of (c) is expressed as:
the data which needs to be split in the database can be processed through the formula, and the multi-source data cleaning work is completed by adopting the existing data cleaning tool.
The data cleansing means is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.
As shown in fig. 2, a standardized cleaning device of power data under multi-source data access is used for operating the standardized cleaning method of power data under multi-source data access, and comprises a development board, a direct-current power supply, a voltage stabilizing circuit and a communication interface, wherein the direct-current power supply supplies power to the development board through the voltage stabilizing circuit, the direct-current power supply directly supplies power to the communication interface, a USB interface and a network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is respectively connected with the clock circuit, the reset circuit, the communication module and the timer.
The model of the ARM chip is S3C 2440.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.
Claims (7)
1. A standardized cleaning method for power data under multi-source data access is characterized by comprising the following steps:
s10, performing primary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, and finishing the clustering processing on the electric power data;
and S20, electric power multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, and electric power data standardization cleaning is completed.
2. The standardized cleaning method for power data under multi-source data access of claim 1, wherein in S10, the complexity of the work process of data classification can be represented as a (n), the collected data is represented in a character string form, the complexity of the data clustering calculation can be represented as a (m · I), where m represents the number of data with different attributes, I represents the number of data with the same attribute, and in order to ensure the feasibility of data clustering, the constraint conditions in the preliminary clustering process are set as:
in the formula, S represents a clustering core distance, and the formula is transformed to obtain a clustering constraint condition applicable to multi-source data, which includes:
A(m·I)=A(n)
the method adopts a form of calculating the similarity to control the calculation precision of the clustering core distance, and is expressed as follows through a formula:
J(A,B)=|A∩B|/|A∪B|
in the formula, J represents the similarity of the calculated core distance, B is a calculation result, and the error value of the calculation result is controlled within 0.5% through the formula so as to ensure the effectiveness of subsequent calculation;
setting G to represent the number of times of data type occurrence and H to represent the weight of the occurrence of the data type, the frequency R of occurrence of data of this type in the calculation can be represented as:
and (4) integrating the formula into a data clustering module to finish data clustering processing.
3. The method for standardized cleaning of power data under multi-source data access according to claim 1, wherein the specific step of S20 is: the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need to be cleaned, the other group of data is data which needs to be cleaned, and cleaning is not neededThe data of the wash is set to C, the data contained in C consists entirely of the elements of the database Y, fc(a) Denotes the number of times C appears in the database, Qc(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:
set Qc(v) Representing the similarity of the subset v in the database, there are:
setting v1∈C,v2E is C, then data v2And v1The relationship of (c) is expressed as:
the data which needs to be split in the database can be processed through the formula, and the multi-source data cleaning work is completed by adopting the existing data cleaning tool.
4. The method of claim i, wherein the power data comprises annual, monthly and single day power production data and electricity usage data.
5. The standardized cleaning method for power data under multi-source data access according to claim 1, characterized in that: the data cleansing means is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.
6. A standardized belt cleaning device of electric power data under multisource data access for operating the standardized cleaning method of electric power data under multisource data access of claim 1 ~ 5, characterized in that: the direct current power supply supplies power to the development board through the voltage stabilizing circuit, the direct current power supply directly supplies power to the communication interface, the USB interface and the network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is connected with the clock circuit, the reset circuit, the communication module and the timer respectively.
7. The standardized belt cleaning device of power data under multisource data access of claim 6, characterized in that: the model of the ARM chip is S3C 2440.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110094083.5A CN112783883B (en) | 2021-01-22 | 2021-01-22 | Data standardized cleaning method and device under multi-source data access |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110094083.5A CN112783883B (en) | 2021-01-22 | 2021-01-22 | Data standardized cleaning method and device under multi-source data access |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112783883A true CN112783883A (en) | 2021-05-11 |
CN112783883B CN112783883B (en) | 2024-09-06 |
Family
ID=75758820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110094083.5A Active CN112783883B (en) | 2021-01-22 | 2021-01-22 | Data standardized cleaning method and device under multi-source data access |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112783883B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706791A (en) * | 2009-09-17 | 2010-05-12 | 成都康赛电子科大信息技术有限责任公司 | User preference based data cleaning method |
CN103714154A (en) * | 2013-12-26 | 2014-04-09 | 西安理工大学 | Method for determining optimum cluster number |
CN104317801A (en) * | 2014-09-19 | 2015-01-28 | 东北大学 | Data cleaning system and method for aiming at big data |
CN106021452A (en) * | 2016-05-16 | 2016-10-12 | 南方电网科学研究院有限责任公司 | Electromagnetic environment measurement data cleaning method |
CN107679089A (en) * | 2017-09-05 | 2018-02-09 | 全球能源互联网研究院 | A kind of cleaning method for electric power sensing data, device and system |
CN109993234A (en) * | 2019-04-10 | 2019-07-09 | 百度在线网络技术(北京)有限公司 | A kind of unmanned training data classification method, device and electronic equipment |
WO2019137185A1 (en) * | 2018-01-09 | 2019-07-18 | 美的集团股份有限公司 | Image screening method and apparatus, storage medium and computer device |
CN110209658A (en) * | 2019-06-04 | 2019-09-06 | 北京字节跳动网络技术有限公司 | Data cleaning method and device |
CN110674120A (en) * | 2019-08-09 | 2020-01-10 | 国电新能源技术研究院有限公司 | Wind power plant data cleaning method and device |
CN110928862A (en) * | 2019-10-23 | 2020-03-27 | 深圳市华讯方舟太赫兹科技有限公司 | Data cleaning method, data cleaning apparatus, and computer storage medium |
CN111597178A (en) * | 2020-05-18 | 2020-08-28 | 山东浪潮通软信息科技有限公司 | Method, system, equipment and medium for cleaning repeating data |
-
2021
- 2021-01-22 CN CN202110094083.5A patent/CN112783883B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706791A (en) * | 2009-09-17 | 2010-05-12 | 成都康赛电子科大信息技术有限责任公司 | User preference based data cleaning method |
CN103714154A (en) * | 2013-12-26 | 2014-04-09 | 西安理工大学 | Method for determining optimum cluster number |
CN104317801A (en) * | 2014-09-19 | 2015-01-28 | 东北大学 | Data cleaning system and method for aiming at big data |
CN106021452A (en) * | 2016-05-16 | 2016-10-12 | 南方电网科学研究院有限责任公司 | Electromagnetic environment measurement data cleaning method |
CN107679089A (en) * | 2017-09-05 | 2018-02-09 | 全球能源互联网研究院 | A kind of cleaning method for electric power sensing data, device and system |
WO2019137185A1 (en) * | 2018-01-09 | 2019-07-18 | 美的集团股份有限公司 | Image screening method and apparatus, storage medium and computer device |
CN109993234A (en) * | 2019-04-10 | 2019-07-09 | 百度在线网络技术(北京)有限公司 | A kind of unmanned training data classification method, device and electronic equipment |
CN110209658A (en) * | 2019-06-04 | 2019-09-06 | 北京字节跳动网络技术有限公司 | Data cleaning method and device |
CN110674120A (en) * | 2019-08-09 | 2020-01-10 | 国电新能源技术研究院有限公司 | Wind power plant data cleaning method and device |
CN110928862A (en) * | 2019-10-23 | 2020-03-27 | 深圳市华讯方舟太赫兹科技有限公司 | Data cleaning method, data cleaning apparatus, and computer storage medium |
CN111597178A (en) * | 2020-05-18 | 2020-08-28 | 山东浪潮通软信息科技有限公司 | Method, system, equipment and medium for cleaning repeating data |
Non-Patent Citations (1)
Title |
---|
韩帅 等: "基于改进K-Means 聚类和误差反馈的数据清洗方法", 《电网与清洁能源》, vol. 36, no. 7, pages 9 - 14 * |
Also Published As
Publication number | Publication date |
---|---|
CN112783883B (en) | 2024-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2017024691A1 (en) | Analogue circuit fault mode classification method | |
CN104572895B (en) | MPP databases and Hadoop company-datas interoperability methods, instrument and implementation method | |
JP2019520615A (en) | Character recognition method, device, server and storage medium of claim document for damages | |
CN110119948B (en) | Power consumer credit evaluation method and system based on time-varying weight dynamic combination | |
WO2016165378A1 (en) | Energy storage power station mass data cleaning method and system | |
CN108664635B (en) | Method, device, equipment and storage medium for acquiring database statistical information | |
CN105184394A (en) | On-line data mining optimized control method based on cyber physical system (CPS) of power distribution network | |
WO2019223145A1 (en) | Electronic device, promotion list recommendation method and system, and computer-readable storage medium | |
CN110222176A (en) | A kind of cleaning method of text data, system and readable storage medium storing program for executing | |
CN117131449A (en) | Data management-oriented anomaly identification method and system with propagation learning capability | |
CN116821832A (en) | Abnormal data identification and correction method for high-voltage industrial and commercial user power load | |
CN101789000A (en) | Method for classifying modes in search engine | |
CN112783883A (en) | Power data standardized cleaning method and device under multi-source data access | |
CN113408226B (en) | Chip power supply network fast-convex current estimation method and system based on deep learning | |
CN105069574A (en) | New method for analyzing business flow behavior similarity | |
CN108805204B (en) | Electric energy quality disturbance analysis device based on deep neural network and use method thereof | |
WO2021128721A1 (en) | Method and device for text classification | |
CN104090813A (en) | Analysis modeling method for CPU (central processing unit) usage of virtual machines in cloud data center | |
CN115169426B (en) | Anomaly detection method and system based on similarity learning fusion model | |
CN109978677A (en) | A kind of power distribution network construction project investment statistics automatically generates calculation method and system | |
CN116089142A (en) | Novel service fault root cause analysis method | |
CN112463643A (en) | Software quality prediction method | |
CN115051363A (en) | Distribution network area user change relation identification method and device and computer storage medium | |
CN115034511A (en) | Enterprise life cycle determining method and device, electronic equipment and medium | |
CN114996930A (en) | Modeling method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |