[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112783883A - Power data standardized cleaning method and device under multi-source data access - Google Patents

Power data standardized cleaning method and device under multi-source data access Download PDF

Info

Publication number
CN112783883A
CN112783883A CN202110094083.5A CN202110094083A CN112783883A CN 112783883 A CN112783883 A CN 112783883A CN 202110094083 A CN202110094083 A CN 202110094083A CN 112783883 A CN112783883 A CN 112783883A
Authority
CN
China
Prior art keywords
data
cleaning
source
clustering
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110094083.5A
Other languages
Chinese (zh)
Other versions
CN112783883B (en
Inventor
周立德
黎鸣
陈凤超
梅傲琪
胡润锋
钟志明
邱泽坚
何毅鹏
黄达区
饶欢
张锐
刘沛林
徐睿烽
鲁承波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Original Assignee
Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd filed Critical Dongguan Power Supply Bureau of Guangdong Power Grid Co Ltd
Priority to CN202110094083.5A priority Critical patent/CN112783883B/en
Publication of CN112783883A publication Critical patent/CN112783883A/en
Application granted granted Critical
Publication of CN112783883B publication Critical patent/CN112783883B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a standardized cleaning method for power data under multi-source data access, which comprises the following steps: s10, performing preliminary clustering processing on the data, reading the collected data by using a K-means algorithm, classifying the collected data according to attribute value characteristics of the data, S20, performing multi-source data cleaning by using the data after clustering processing as a data source for data cleaning, setting the processed data into a database form, and completing multi-source data cleaning by using an existing data cleaning tool. The invention has the beneficial effects that: the collected data are classified according to the attribute value characteristics of the data, the data after clustering processing are used as data sources for data cleaning, the processed data are set to be in a database form, multi-source data cleaning work is completed by adopting an existing data cleaning tool, the accuracy of database data processing results is improved, and the accuracy of the cleaning data is further improved.

Description

Power data standardized cleaning method and device under multi-source data access
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for standardized cleaning of electric power data under multi-source data access.
Background
People can use more and more data resources, but mass data does not necessarily have real value, the value of the data comes from the quality of the data, and the quality of data mining directly influences the quality of decision making. However, it is very difficult to manually process these huge and cluttered data, and the data quality problem has become one of the bottlenecks that restrict the application and processing of data. The quality problem in the data is corrected, the decision-making error is avoided, and the decision-making risk is reduced, so that the method is an important link for data processing. In previous studies, data cleansing was accomplished using a data standardized cleansing system. However, due to the increase of data volume, the appearance of multi-source data has an influence on the performance of the system, and the power data is particularly obvious.
Disclosure of Invention
The invention aims to overcome the problems in the prior art and provides a method and a device for standardized cleaning of power data under multi-source data access.
In order to achieve the technical purpose and achieve the technical effect, the invention is realized by the following technical scheme:
s10, performing primary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, and finishing the clustering processing on the electric power data;
and S20, electric power multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, and electric power data standardization cleaning is completed.
Further, in S10, the complexity of the work process of classifying data may be represented as a (n), the collected data is represented in the form of a character string, and the complexity of the data clustering calculation may be represented as a (m · I), where m represents the number of different attribute data, I represents the number of data with the same attribute, and to ensure the feasibility of data clustering, the constraint conditions in the preliminary clustering process are set as:
Figure BDA0002912607040000021
in the formula, S represents a clustering core distance, and the formula is transformed to obtain a clustering constraint condition applicable to multi-source data, which includes:
Figure BDA0002912607040000022
A(m·I)=A(n)
the method adopts a form of calculating the similarity to control the calculation precision of the clustering core distance, and is expressed as follows through a formula:
J(A,B)=|A∩B|/|A∪B|
in the formula, J represents the similarity of the calculated core distance, B is a calculation result, and the error value of the calculation result is controlled within 0.5% through the formula so as to ensure the effectiveness of subsequent calculation;
setting G to represent the number of times of data type occurrence and H to represent the weight of the occurrence of the data type, the frequency R of occurrence of data of this type in the calculation can be represented as:
Figure BDA0002912607040000023
the formula is merged into a data clustering module to complete data clustering processing;
s20, multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need cleaning, the other group of data is data which needs cleaning, the data which does not need cleaning is set to be C, the data contained in C all consist of elements in the database Y, fc(a) Denotes the number of times C appears in the database, Qc(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:
Figure BDA0002912607040000031
set Qc(v) Representing the similarity of the subset v in the database, there are:
Figure BDA0002912607040000032
setting v1∈C,v2E is C, then data v2And v1The relationship of (c) is expressed as:
Figure BDA0002912607040000033
the data which needs to be split in the database can be processed through the formula, and the multi-source data cleaning work is completed by adopting the existing data cleaning tool.
The power data includes annual, monthly and single day power production data and electricity usage data.
Wherein the data cleansing tool is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.
The device comprises a development board, a direct-current power supply, a voltage stabilizing circuit and a communication interface, wherein the direct-current power supply supplies power to the development board through the voltage stabilizing circuit, the direct-current power supply directly supplies power to the communication interface, a USB interface and a network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is respectively connected with the clock circuit, the reset circuit, the communication module and the timer.
The model of the ARM chip is S3C 2440.
The invention has the beneficial effects that: the collected data are classified according to the attribute value characteristics of the data, the data after clustering processing are used as data sources for data cleaning, the processed data are set to be in a database form, multi-source data cleaning work is completed by adopting an existing data cleaning tool, the accuracy of database data processing results is improved, and the accuracy of the cleaning data is further improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a schematic diagram of a flow chart of a cleaning method in an embodiment of the present invention;
FIG. 2 is a frame diagram of a cleaning apparatus in an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
As shown in fig. 1, a method for standardized cleaning of power data under multi-source data access includes:
s10, performing preliminary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, wherein the complexity of the working processing can be expressed as A (n), embodying the acquired data in a character string mode, and expressing the data clustering calculation complexity as A (m.I), wherein m represents the number of different attribute data, I represents the number of data with the same attribute, and in order to ensure the feasibility of data clustering, setting constraint conditions in the preliminary clustering processing as follows:
Figure BDA0002912607040000041
in the formula, S represents a clustering core distance, and the formula is transformed to obtain a clustering constraint condition applicable to multi-source data, which includes:
Figure BDA0002912607040000042
A(m·I)=A(n)
the method adopts a form of calculating the similarity to control the calculation precision of the clustering core distance, and is expressed as follows through a formula:
J(A,B)=|A∩B|/|A∪B|
in the formula, J represents the similarity of the calculated core distance, B is a calculation result, and the error value of the calculation result is controlled within 0.5% through the formula so as to ensure the effectiveness of subsequent calculation;
setting G to represent the number of times of data type occurrence and H to represent the weight of the occurrence of the data type, the frequency R of occurrence of data of this type in the calculation can be represented as:
Figure BDA0002912607040000051
integrating the formula into a data clustering module to finish data clustering processing, wherein the power data comprises annual, monthly and single-day power production data and power utilization data;
s20, multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need cleaning, the other group of data is data which needs cleaning, the data which does not need cleaning is set to be C, the data contained in C all consist of elements in the database Y, fc(a) Denotes the number of times C appears in the database, Qc(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:
Figure BDA0002912607040000052
set Qc(v) Representing the similarity of the subset v in the database, there are:
Figure BDA0002912607040000053
setting v1∈C,v2E is C, then data v2And v1The relationship of (c) is expressed as:
Figure BDA0002912607040000061
the data which needs to be split in the database can be processed through the formula, and the multi-source data cleaning work is completed by adopting the existing data cleaning tool.
The data cleansing means is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.
As shown in fig. 2, a standardized cleaning device of power data under multi-source data access is used for operating the standardized cleaning method of power data under multi-source data access, and comprises a development board, a direct-current power supply, a voltage stabilizing circuit and a communication interface, wherein the direct-current power supply supplies power to the development board through the voltage stabilizing circuit, the direct-current power supply directly supplies power to the communication interface, a USB interface and a network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is respectively connected with the clock circuit, the reset circuit, the communication module and the timer.
The model of the ARM chip is S3C 2440.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed.

Claims (7)

1. A standardized cleaning method for power data under multi-source data access is characterized by comprising the following steps:
s10, performing primary clustering processing on the electric power data, reading the acquired data by using a K-means algorithm, classifying the acquired data according to attribute value characteristics of the data, and finishing the clustering processing on the electric power data;
and S20, electric power multi-source data cleaning, wherein the data after clustering processing is used as a data source for data cleaning, and electric power data standardization cleaning is completed.
2. The standardized cleaning method for power data under multi-source data access of claim 1, wherein in S10, the complexity of the work process of data classification can be represented as a (n), the collected data is represented in a character string form, the complexity of the data clustering calculation can be represented as a (m · I), where m represents the number of data with different attributes, I represents the number of data with the same attribute, and in order to ensure the feasibility of data clustering, the constraint conditions in the preliminary clustering process are set as:
Figure FDA0002912607030000011
in the formula, S represents a clustering core distance, and the formula is transformed to obtain a clustering constraint condition applicable to multi-source data, which includes:
Figure FDA0002912607030000012
A(m·I)=A(n)
the method adopts a form of calculating the similarity to control the calculation precision of the clustering core distance, and is expressed as follows through a formula:
J(A,B)=|A∩B|/|A∪B|
in the formula, J represents the similarity of the calculated core distance, B is a calculation result, and the error value of the calculation result is controlled within 0.5% through the formula so as to ensure the effectiveness of subsequent calculation;
setting G to represent the number of times of data type occurrence and H to represent the weight of the occurrence of the data type, the frequency R of occurrence of data of this type in the calculation can be represented as:
Figure FDA0002912607030000021
and (4) integrating the formula into a data clustering module to finish data clustering processing.
3. The method for standardized cleaning of power data under multi-source data access according to claim 1, wherein the specific step of S20 is: the data after clustering processing is used as a data source for data cleaning, the processed data is set to be in a database form, two groups of data are set in a database Y, one group of data is a data set which does not need to be cleaned, the other group of data is data which needs to be cleaned, and cleaning is not neededThe data of the wash is set to C, the data contained in C consists entirely of the elements of the database Y, fc(a) Denotes the number of times C appears in the database, Qc(a) For the similarity of the data to be cleaned and the data not to be cleaned in the database, the following are provided:
Figure FDA0002912607030000022
set Qc(v) Representing the similarity of the subset v in the database, there are:
Figure FDA0002912607030000023
setting v1∈C,v2E is C, then data v2And v1The relationship of (c) is expressed as:
Figure FDA0002912607030000024
the data which needs to be split in the database can be processed through the formula, and the multi-source data cleaning work is completed by adopting the existing data cleaning tool.
4. The method of claim i, wherein the power data comprises annual, monthly and single day power production data and electricity usage data.
5. The standardized cleaning method for power data under multi-source data access according to claim 1, characterized in that: the data cleansing means is at least one of IDCENTRIC, PUREINTEGRATE, TRILLIUM, DATACLEANSESER, MATCHIT.
6. A standardized belt cleaning device of electric power data under multisource data access for operating the standardized cleaning method of electric power data under multisource data access of claim 1 ~ 5, characterized in that: the direct current power supply supplies power to the development board through the voltage stabilizing circuit, the direct current power supply directly supplies power to the communication interface, the USB interface and the network port of the development board are connected with the communication interface, an ARM chip, a clock circuit, a reset circuit, a communication module and a timer are installed on the development board, and the ARM chip is connected with the clock circuit, the reset circuit, the communication module and the timer respectively.
7. The standardized belt cleaning device of power data under multisource data access of claim 6, characterized in that: the model of the ARM chip is S3C 2440.
CN202110094083.5A 2021-01-22 2021-01-22 Data standardized cleaning method and device under multi-source data access Active CN112783883B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110094083.5A CN112783883B (en) 2021-01-22 2021-01-22 Data standardized cleaning method and device under multi-source data access

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110094083.5A CN112783883B (en) 2021-01-22 2021-01-22 Data standardized cleaning method and device under multi-source data access

Publications (2)

Publication Number Publication Date
CN112783883A true CN112783883A (en) 2021-05-11
CN112783883B CN112783883B (en) 2024-09-06

Family

ID=75758820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110094083.5A Active CN112783883B (en) 2021-01-22 2021-01-22 Data standardized cleaning method and device under multi-source data access

Country Status (1)

Country Link
CN (1) CN112783883B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706791A (en) * 2009-09-17 2010-05-12 成都康赛电子科大信息技术有限责任公司 User preference based data cleaning method
CN103714154A (en) * 2013-12-26 2014-04-09 西安理工大学 Method for determining optimum cluster number
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data
CN106021452A (en) * 2016-05-16 2016-10-12 南方电网科学研究院有限责任公司 Electromagnetic environment measurement data cleaning method
CN107679089A (en) * 2017-09-05 2018-02-09 全球能源互联网研究院 A kind of cleaning method for electric power sensing data, device and system
CN109993234A (en) * 2019-04-10 2019-07-09 百度在线网络技术(北京)有限公司 A kind of unmanned training data classification method, device and electronic equipment
WO2019137185A1 (en) * 2018-01-09 2019-07-18 美的集团股份有限公司 Image screening method and apparatus, storage medium and computer device
CN110209658A (en) * 2019-06-04 2019-09-06 北京字节跳动网络技术有限公司 Data cleaning method and device
CN110674120A (en) * 2019-08-09 2020-01-10 国电新能源技术研究院有限公司 Wind power plant data cleaning method and device
CN110928862A (en) * 2019-10-23 2020-03-27 深圳市华讯方舟太赫兹科技有限公司 Data cleaning method, data cleaning apparatus, and computer storage medium
CN111597178A (en) * 2020-05-18 2020-08-28 山东浪潮通软信息科技有限公司 Method, system, equipment and medium for cleaning repeating data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706791A (en) * 2009-09-17 2010-05-12 成都康赛电子科大信息技术有限责任公司 User preference based data cleaning method
CN103714154A (en) * 2013-12-26 2014-04-09 西安理工大学 Method for determining optimum cluster number
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data
CN106021452A (en) * 2016-05-16 2016-10-12 南方电网科学研究院有限责任公司 Electromagnetic environment measurement data cleaning method
CN107679089A (en) * 2017-09-05 2018-02-09 全球能源互联网研究院 A kind of cleaning method for electric power sensing data, device and system
WO2019137185A1 (en) * 2018-01-09 2019-07-18 美的集团股份有限公司 Image screening method and apparatus, storage medium and computer device
CN109993234A (en) * 2019-04-10 2019-07-09 百度在线网络技术(北京)有限公司 A kind of unmanned training data classification method, device and electronic equipment
CN110209658A (en) * 2019-06-04 2019-09-06 北京字节跳动网络技术有限公司 Data cleaning method and device
CN110674120A (en) * 2019-08-09 2020-01-10 国电新能源技术研究院有限公司 Wind power plant data cleaning method and device
CN110928862A (en) * 2019-10-23 2020-03-27 深圳市华讯方舟太赫兹科技有限公司 Data cleaning method, data cleaning apparatus, and computer storage medium
CN111597178A (en) * 2020-05-18 2020-08-28 山东浪潮通软信息科技有限公司 Method, system, equipment and medium for cleaning repeating data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韩帅 等: "基于改进K-Means 聚类和误差反馈的数据清洗方法", 《电网与清洁能源》, vol. 36, no. 7, pages 9 - 14 *

Also Published As

Publication number Publication date
CN112783883B (en) 2024-09-06

Similar Documents

Publication Publication Date Title
WO2017024691A1 (en) Analogue circuit fault mode classification method
CN104572895B (en) MPP databases and Hadoop company-datas interoperability methods, instrument and implementation method
JP2019520615A (en) Character recognition method, device, server and storage medium of claim document for damages
CN110119948B (en) Power consumer credit evaluation method and system based on time-varying weight dynamic combination
WO2016165378A1 (en) Energy storage power station mass data cleaning method and system
CN108664635B (en) Method, device, equipment and storage medium for acquiring database statistical information
CN105184394A (en) On-line data mining optimized control method based on cyber physical system (CPS) of power distribution network
WO2019223145A1 (en) Electronic device, promotion list recommendation method and system, and computer-readable storage medium
CN110222176A (en) A kind of cleaning method of text data, system and readable storage medium storing program for executing
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN116821832A (en) Abnormal data identification and correction method for high-voltage industrial and commercial user power load
CN101789000A (en) Method for classifying modes in search engine
CN112783883A (en) Power data standardized cleaning method and device under multi-source data access
CN113408226B (en) Chip power supply network fast-convex current estimation method and system based on deep learning
CN105069574A (en) New method for analyzing business flow behavior similarity
CN108805204B (en) Electric energy quality disturbance analysis device based on deep neural network and use method thereof
WO2021128721A1 (en) Method and device for text classification
CN104090813A (en) Analysis modeling method for CPU (central processing unit) usage of virtual machines in cloud data center
CN115169426B (en) Anomaly detection method and system based on similarity learning fusion model
CN109978677A (en) A kind of power distribution network construction project investment statistics automatically generates calculation method and system
CN116089142A (en) Novel service fault root cause analysis method
CN112463643A (en) Software quality prediction method
CN115051363A (en) Distribution network area user change relation identification method and device and computer storage medium
CN115034511A (en) Enterprise life cycle determining method and device, electronic equipment and medium
CN114996930A (en) Modeling method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant