[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111046977A - Data preprocessing method based on EM algorithm and KNN algorithm - Google Patents

Data preprocessing method based on EM algorithm and KNN algorithm Download PDF

Info

Publication number
CN111046977A
CN111046977A CN201911392045.7A CN201911392045A CN111046977A CN 111046977 A CN111046977 A CN 111046977A CN 201911392045 A CN201911392045 A CN 201911392045A CN 111046977 A CN111046977 A CN 111046977A
Authority
CN
China
Prior art keywords
algorithm
data
class
incomplete
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911392045.7A
Other languages
Chinese (zh)
Inventor
唐雪飞
黄永鑫
蒲高飞
胡茂秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Comsys Information Technology Co ltd
Original Assignee
Chengdu Comsys Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Comsys Information Technology Co ltd filed Critical Chengdu Comsys Information Technology Co ltd
Priority to CN201911392045.7A priority Critical patent/CN111046977A/en
Publication of CN111046977A publication Critical patent/CN111046977A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data preprocessing method based on an EM algorithm and a KNN algorithm, which comprises the following steps: s1, dividing the original data set into a complete data subset and an incomplete data subset according to whether the attribute values are missing or not, taking the complete data subset as a training sample of the EM algorithm, and clustering by using the EM algorithm; and S2, filling missing values on the clustering result by using a KNN algorithm. According to the method, before missing value filling is carried out by using KNN, clustering analysis is carried out on an original data set by using an EM algorithm, and then missing value filling is carried out by using KNN on the obtained clustering result, so that the method is simple to operate and high in filling accuracy.

Description

Data preprocessing method based on EM algorithm and KNN algorithm
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a data preprocessing method based on an EM algorithm and a KNN algorithm.
Background
The financial statement analysis is used for processing, analyzing, comparing, evaluating and explaining the data provided by the enterprise financial statement. If it is said that the accounting and tabulation belong to the reflection function of accounting, then the financial statement analysis is subject to the interpretation and evaluation functions. The purpose of the financial statement analysis is to judge the financial condition of the enterprise and diagnose the loss of the enterprise operation management. Through analysis, whether the financial condition of an enterprise is good or not, whether the operation management of the enterprise is sound or not and whether the business prospect of the enterprise is bright or not can be judged, and meanwhile, the syndrome of the operation management of the enterprise can be found through analysis, and a problem solving method is provided. The method for analyzing the financial statements mainly comprises a trend analysis method and a ratio analysis method. The trend analysis method is to compare the increasing and decreasing directions and inclinations of each item at the later stage according to the financial statements of several successive stages, so as to reveal the changes and trends in finance and operation.
Data mining requires a large amount of data resources, in practical applications, data from different original databases have a large amount of incomplete data, noisy data, heterogeneous data, error data, and the like due to different initial definitions or structures of the databases, however, most data mining algorithms are usually based on clean and complete data sets. Therefore, data in an actual system cannot be directly applied to data analysis, difficulty of data mining is increased, and unprocessed data can seriously affect the result of knowledge discovery. It follows that data preprocessing is critical to data mining. Statistically, the data preprocessing accounts for 60% of the whole data mining process, and the subsequent learning training only accounts for 10% of the whole work. The quality of data preprocessing directly influences the quality of data, and finally controls the result of subsequent data mining. The effective data preprocessing can improve the quality of the whole data, not only saves space cost and time cost, but also is beneficial to obtaining good data mining results to conduct decision guidance and value evaluation.
Various data quality problems are often encountered in the data mining process, wherein the data imperfection problem is particularly prominent. The phenomenon of data missing is common, for example, in a UCI database commonly used in the field of machine learning, a data set containing missing data accounts for more than 40%. The existing processing methods for data incompleteness problems can be roughly divided into three types: deletion methods, padding methods, and unprocessed methods that retain the original information. The application of the deletion method is very limited, the original information of the data set is lost due to the adoption of the deletion method for dealing with the incomplete data problem, the waste of useful information of the data is easily caused, and meanwhile, the accuracy and the objectivity of a data mining result are influenced to a certain extent by the discarding of the information. The deletion method is mainly suitable for data sets with complete random deletion and small proportion of missing data. The filling method is a relatively scientific and effective processing method, and fully utilizes the information of the data to fill, so that the estimated filling value is as close as possible to the true value of the original data. Compared with the former two methods for changing the original data set, the method without processing keeps the original state of the data set. The method utilizes the machine learning technology to weaken the influence of data missing, and directly learns from an incomplete data set, and the learning methods comprise a Bayesian belief network, a rough set method, an artificial neural network and the like.
Clustering is a typical unsupervised learning method. Under the guidance of no prior knowledge, similar example objects are classified into different categories by a static classification method, so that the example objects in the same category are similar as much as possible, and the difference between different categories is large as much as possible. The KNN algorithm is used for missing value filling, and K nearest neighbor filling is obtained. The K nearest neighbor filling is to search K complete objects which are closest to incomplete objects in a complete data set, and fill missing values by using the information of the K neighbors. Compared with other missing value filling algorithms, the K nearest neighbor filling algorithm has the advantages of simplicity in operation and high filling accuracy, but the algorithm is troublesome in operation because the K value needs to be set manually, and the K values required to be set by different training data are different.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a data preprocessing method based on an EM algorithm and a KNN algorithm, which is simple to operate and high in filling accuracy, wherein the EM algorithm is used for carrying out clustering analysis on an original data set before KNN is used for filling missing values, and then KNN is used for filling the missing values on the obtained clustering result.
The purpose of the invention is realized by the following technical scheme: the data preprocessing method based on the EM algorithm and the KNN algorithm comprises the following steps:
s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing, taking the complete data subset as a training sample of an EM (effective magnetic field) algorithm, and clustering by using the EM algorithm;
and S2, filling missing values on the clustering result by using a KNN algorithm.
Further, the step S1 includes the following sub-steps:
s11, recording the complete data subset data as (x)1,x2,...,xn) Sample x1,x2,...,xnIndependent of each other, each sample corresponds to a class ziUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongsi;zi) Maximization, p (x)i;zi) The likelihood function of (d) is:
Figure BDA0002345267280000021
taking the logarithm of the above formula to obtain:
Figure BDA0002345267280000022
wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)i,zi(ii) a Theta) is the sample x when the model parameter is thetaiAnd class ziA joint distribution between;
s12, defining a category variable ziSatisfy a certain distribution QiAnd the distribution function Qi(zi) The following conditions are satisfied:
Figure BDA0002345267280000023
transforming the solving formula of l (theta) in the step S1 by using the Zhansen inequality to obtain:
Figure BDA0002345267280000031
due to the fact that
Figure BDA0002345267280000032
Is that
Figure BDA0002345267280000033
The expectation of (c), so is derived from the jensen inequality:
Figure BDA0002345267280000034
namely, it is
Figure BDA0002345267280000035
Expected probability f (E [ X ]]) Greater than or equal to
Figure BDA0002345267280000036
Expectation of function E [ f (X)];
As known from the jensen inequality, if and only if X is a constant, the inequality takes an equal sign, then there is:
Figure BDA0002345267280000037
where C is a constant, for a series of different ziThe values, summed, result in:
Figure BDA0002345267280000038
due to the fact that
Figure BDA0002345267280000039
Therefore, the method comprises the following steps:
Figure BDA00023452672800000310
thus, Qi(zi) The calculation formula of (2) is as follows:
Figure BDA00023452672800000311
p(zi|xi(ii) a θ) refers to the sample x when the model parameter is θiBelong to the class ziThe conditional probability of (a);
s13, Q obtained in the step S2i(zi) As the distribution of the categories, then maximizing the likelihood function to obtain the final clustering result: given an initial value θ, the loop repeats steps E and M until convergence:
e, step E: for each xiCalculating Qi(zi)=p(zi|xi;θ);
And M: calculating the ratio of theta:
Figure BDA00023452672800000312
further, the step S2 includes the following sub-steps:
s21, matching incomplete data subsets D according to the number of missing attribute valuesiSorting from small to large;
s22, calculating the distance from each record r in the sorted incomplete data subset to each cluster center c formed by EM clustering, and sorting from small to large;
s23, classifying each incomplete record into the class of the cluster center c with the minimum distance to the record;
s24, calculating the distance dis between the incomplete record and other training data in the belonged class by using an Euclidean distance formula; for the continuous attributes in the incomplete record, missing value padding is performed using the following formula:
Figure BDA0002345267280000041
wherein v isnMeaning incomplete recording, βiRefers to a complete record, P, of the class in which cluster center c is locatedrRefers to incomplete recording vnContaining consecutive attributes of missing values, n referring toα refers to the similarity of two records, i.e. the calculated distance dis;
and S25, filling discrete attributes in the incomplete record by obtaining the mode of other complete records in the corresponding attributes in the belonged records.
The invention has the beneficial effects that: according to the method, before missing value filling is carried out by using KNN, clustering analysis is carried out on an original data set by using an EM algorithm, and then missing value filling is carried out by using KNN on the obtained clustering result, so that the method is simple to operate and high in filling accuracy.
Drawings
Fig. 1 is a flow chart of a data preprocessing method based on the EM algorithm and the KNN algorithm.
Detailed Description
The method provided by the invention belongs to a filling method, and the background technology related to the method is explained as follows:
1. maximum Expectation (EM) algorithm
The Expectation-maximization (EM) algorithm is an algorithm that finds a parameter maximum likelihood estimate or maximum a posteriori estimate in a probabilistic (probabilistic) model, where the probabilistic model relies on unobservable hidden variables (Latent variables). The algorithm is mainly calculated by two steps alternately, wherein the first step is to calculate expectation (E) and calculate the maximum likelihood estimated value of the hidden variable by using the existing estimated value of the hidden variable; the second step is to maximize (M), the maximum likelihood found at step E is maximized to calculate the value of the parameter. The parameter estimates found in step M will be used for the next E step calculation, alternating until convergence. The most direct application of the EM algorithm is to solve parameter estimation, but if we consider potential classes as hidden variables and samples as observed values, the clustering problem can be converted into a parameter estimation problem, which is the principle of clustering using the EM algorithm. The main flow of the EM algorithm is as follows:
(1) initializing initial values theta of model parameters theta randomly0
(2) Start iteration of the EM algorithm:
(a) first, a known joint distribution P (x) is calculated(i),z(i)(ii) a Theta) conditional probability expectation L (theta )j):
Qi(z(i))=P(z(i)|x(i);θ)
Figure BDA0002345267280000042
(b) Second, L (θ, θ) is maximizedj) To obtain thetaj+1
θj+1=argmaxxL(θ,θj)
(c) If theta is greater than thetaj+1And (4) converging, finishing the algorithm, and otherwise, continuing iteration.
(3) And outputting the model parameter theta.
The EM algorithm can guarantee convergence to a stable point but cannot guarantee convergence to a global maximum point, and thus is a locally optimal algorithm. Of course, if the target L (θ, θ) is optimizedj) Convex, the EM algorithm can guarantee convergence to a global maximum, which is the same as the iterative algorithm in the gradient descent method.
2. K Nearest Neighbor (KNN) algorithm
The basic idea of the method is as follows: if most of K most similar samples (namely K adjacent samples in the feature space) in the feature space of a sample to be classified belong to a certain class, the sample also belongs to the class, and the method is a supervised learning algorithm. The general flow of the KNN algorithm is:
(1) calculating the distance between the test data point and each training data point, and sequencing according to the distance increasing order;
(2) selecting K points with the minimum distance from the current test data point;
(3) determining the occurrence frequency of the category where the first K training data points are located;
(4) and returning the class with the highest frequency of the current K training data points as the prediction classification of the current test data point.
It is worth noting that in the KNN algorithm, the distance between the objects is used as a non-similarity measurement index between the objects, and the problem of matching between the objects is avoided. The distance is generally calculated by using a Euclidean distance formula or a Manhattan distance formula, and in the invention, the Euclidean distance is used for measuring the distance between each object, and the distance formula is shown as follows:
Figure BDA0002345267280000051
the technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, the data preprocessing method based on EM algorithm and KNN algorithm of the present invention includes the following steps:
s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing or not, if so, determining the collected data to be the incomplete data, otherwise, determining the collected data to be the complete data, taking the complete data subset as a training sample of an EM algorithm, and clustering by using the EM algorithm; the method comprises the following substeps:
s11, recording the complete data subset data as (x)1,x2,...,xn) Sample x1,x2,...,xnIndependent of each other, each sample corresponds to a class ziUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongsi;zi) Maximization, p (x)i;zi) The likelihood function of (d) is:
Figure BDA0002345267280000052
taking the logarithm of the above formula to obtain:
Figure BDA0002345267280000053
wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)i,zi(ii) a Theta) is a model parameter of thetaSample x of timeiAnd class ziA joint distribution between;
s12, defining a category variable ziSatisfy a certain distribution QiAnd the distribution function Qi(zi) The following conditions are satisfied:
Figure BDA0002345267280000061
transforming the solving formula of l (theta) in the step S1 by using the Zhansen inequality to obtain:
Figure BDA0002345267280000062
due to the fact that
Figure BDA0002345267280000063
Is that
Figure BDA0002345267280000064
The expectation of (c), so is derived from the jensen inequality:
Figure BDA0002345267280000065
namely, it is
Figure BDA0002345267280000066
Expected probability f (E [ X ]]) Greater than or equal to
Figure BDA0002345267280000067
Expectation of function E [ f (X)];
As known from the jensen inequality, if and only if X is a constant, the inequality takes an equal sign, then there is:
Figure BDA0002345267280000068
where C is a constant, for a series of different ziThe values, summed, result in:
Figure BDA0002345267280000069
due to the fact that
Figure BDA00023452672800000610
Therefore, the method comprises the following steps:
Figure BDA00023452672800000611
thus, Qi(zi) The calculation formula of (2) is as follows:
Figure BDA00023452672800000612
p(zi|xi(ii) a θ) refers to the sample x when the model parameter is θiBelong to the class ziThe conditional probability of (a);
s13, Q obtained in the step S2i(zi) As the distribution of the categories, then maximizing the likelihood function to obtain the final clustering result: given an initial value θ, the loop repeats steps E and M until convergence:
e, step E: for each xiCalculating Qi(zi)=p(zi|xi;θ);
And M: calculating the ratio of theta:
Figure BDA00023452672800000613
step E refers to calculating, for each data point in the training sample, i.e. each record in the complete data subset, the probability that it belongs to each cluster therein, and using this as a weight. M steps refer to the estimation of the relevant parameters (mean, variance) of each cluster using the weights calculated in the previous step: and taking the probability in the step E as a weight of each data point, and then calculating the mean value and the variance of each cluster like K-means so as to solve the overall probability or the maximum likelihood of the clusters.
S2, filling missing values on the clustering result by using a KNN algorithm; the method comprises the following substeps:
s21, matching incomplete data subsets D according to the number of missing attribute valuesiSorting from small to large;
s22, calculating the distance from each record r in the sorted incomplete data subset to each cluster center c formed by EM clustering, and sorting from small to large;
s23, classifying each incomplete record into the class of the cluster center c with the minimum distance to the record;
s24, calculating the distance dis between the incomplete record and other training data in the belonged class by using an Euclidean distance formula; for the continuous attributes in the incomplete record, missing value padding is performed using the following formula:
Figure BDA0002345267280000071
wherein v isnMeaning incomplete recording, βiRefers to a complete record, P, of the class in which cluster center c is locatedrRefers to incomplete recording vnThe cluster center c comprises continuous attributes of missing values, n refers to the total number of complete records of the class where the cluster center c is located, α refers to the similarity of two records, namely the calculated distance dis;
and S25, filling discrete attributes in the incomplete record by obtaining the mode of other complete records in the corresponding attributes in the belonged records.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (3)

1. The data preprocessing method based on the EM algorithm and the KNN algorithm is characterized by comprising the following steps of:
s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing, taking the complete data subset as a training sample of an EM (effective magnetic field) algorithm, and clustering by using the EM algorithm;
and S2, filling missing values on the clustering result by using a KNN algorithm.
2. The method for preprocessing data based on EM algorithm and KNN algorithm as claimed in claim 1, wherein said step S1 comprises the following sub-steps:
s11, recording the complete data subset data as (x)1,x2,...,xn) Sample x1,x2,...,xnIndependent of each other, each sample corresponds to a class ziUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongsi;zi) Maximization, p (x)i;zi) The likelihood function of (d) is:
Figure FDA0002345267270000011
taking the logarithm of the above formula to obtain:
Figure FDA0002345267270000012
wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)i,zi(ii) a Theta) is the sample x when the model parameter is thetaiAnd class ziA joint distribution between;
s12, defining a category variable ziSatisfy a certain distribution QiAnd the distribution function Qi(zi) The following conditions are satisfied:
Figure FDA0002345267270000013
transforming the solving formula of l (theta) in the step S1 by using the Zhansen inequality to obtain:
Figure FDA0002345267270000014
due to the fact that
Figure FDA0002345267270000015
Is that
Figure FDA0002345267270000016
The expectation of (c), so is derived from the jensen inequality:
Figure FDA0002345267270000017
namely, it is
Figure FDA0002345267270000018
Desired probability f (E [ x ]]) Greater than or equal to
Figure FDA0002345267270000019
Expectation of function E [ f (X)];
As known from the jensen inequality, if and only if X is a constant, the inequality takes an equal sign, then there is:
Figure FDA00023452672700000110
where C is a constant, for a series of different ziThe values, summed, result in:
Figure FDA0002345267270000021
due to the fact that
Figure FDA0002345267270000022
Therefore, the method comprises the following steps:
Figure FDA0002345267270000023
thus, Qi(zi) The calculation formula of (2) is as follows:
Figure FDA0002345267270000024
p(zi|xi(ii) a θ) refers to the sample x when the model parameter is θiBelong to the class ziThe conditional probability of (a);
s13, Q obtained in the step S2i(zi) As the distribution of the categories, then maximizing the likelihood function to obtain the final clustering result: given an initial value θ, the loop repeats steps E and M until convergence:
e, step E: for each xiCalculating Qi(zi)=p(zi|xi;θ);
And M: calculating the ratio of theta:
Figure FDA0002345267270000025
3. the method for preprocessing data based on EM algorithm and KNN algorithm as claimed in claim 1, wherein said step S2 comprises the following sub-steps:
s21, matching incomplete data subsets D according to the number of missing attribute valuesiSorting from small to large;
s22, calculating the distance from each record r in the sorted incomplete data subset to each cluster center c formed by EM clustering, and sorting from small to large;
s23, classifying each incomplete record into the class of the cluster center c with the minimum distance to the record;
s24, calculating the distance dis between the incomplete record and other training data in the belonged class by using an Euclidean distance formula; for the continuous attributes in the incomplete record, missing value padding is performed using the following formula:
Figure FDA0002345267270000026
wherein v isnMeaning incomplete recording, βiRefers to a complete record, P, of the class in which cluster center c is locatedrRefers to incomplete recording vnThe cluster center c comprises continuous attributes of missing values, n refers to the total number of complete records of the class where the cluster center c is located, α refers to the similarity of two records, namely the calculated distance dis;
and S25, filling discrete attributes in the incomplete record by obtaining the mode of other complete records in the corresponding attributes in the belonged records.
CN201911392045.7A 2019-12-30 2019-12-30 Data preprocessing method based on EM algorithm and KNN algorithm Pending CN111046977A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911392045.7A CN111046977A (en) 2019-12-30 2019-12-30 Data preprocessing method based on EM algorithm and KNN algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911392045.7A CN111046977A (en) 2019-12-30 2019-12-30 Data preprocessing method based on EM algorithm and KNN algorithm

Publications (1)

Publication Number Publication Date
CN111046977A true CN111046977A (en) 2020-04-21

Family

ID=70241570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911392045.7A Pending CN111046977A (en) 2019-12-30 2019-12-30 Data preprocessing method based on EM algorithm and KNN algorithm

Country Status (1)

Country Link
CN (1) CN111046977A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984626A (en) * 2020-08-25 2020-11-24 西安建筑科技大学 Statistical mode-based energy consumption data identification and restoration method
CN113065574A (en) * 2021-02-24 2021-07-02 同济大学 Data preprocessing method and device for semiconductor manufacturing system
CN113139570A (en) * 2021-03-05 2021-07-20 河海大学 Dam safety monitoring data completion method based on optimal hybrid valuation
CN113435536A (en) * 2021-07-15 2021-09-24 广东电网有限责任公司 Electricity charge data preprocessing method, device, terminal equipment and medium
CN114004266A (en) * 2020-07-27 2022-02-01 中国电信股份有限公司 Non-equilibrium industrial data classification method and device and computer readable storage medium
CN114168578A (en) * 2021-12-09 2022-03-11 国网江苏省电力有限公司 Daily load data missing value interpolation method based on clustering and neighbor algorithm
CN116739345A (en) * 2023-06-08 2023-09-12 南京工业大学 Real-time evaluation method for possibility of dangerous chemical road transportation accident

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177088A (en) * 2013-03-08 2013-06-26 北京理工大学 Biomedicine missing data compensation method
US20150261846A1 (en) * 2014-03-11 2015-09-17 Sas Institute Inc. Computerized cluster analysis framework for decorrelated cluster identification in datasets
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 Text classification method and device
CN108932301A (en) * 2018-06-11 2018-12-04 天津科技大学 Data filling method and device
CN109446185A (en) * 2018-08-29 2019-03-08 广西大学 Collaborative filtering missing data processing method based on user's cluster
CN110275895A (en) * 2019-06-25 2019-09-24 广东工业大学 It is a kind of to lack the filling equipment of traffic data, device and method
CN114741457A (en) * 2022-04-14 2022-07-12 郑州轻工业大学 Data missing value filling method based on function dependence and clustering

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103177088A (en) * 2013-03-08 2013-06-26 北京理工大学 Biomedicine missing data compensation method
US20150261846A1 (en) * 2014-03-11 2015-09-17 Sas Institute Inc. Computerized cluster analysis framework for decorrelated cluster identification in datasets
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 Text classification method and device
CN108932301A (en) * 2018-06-11 2018-12-04 天津科技大学 Data filling method and device
CN109446185A (en) * 2018-08-29 2019-03-08 广西大学 Collaborative filtering missing data processing method based on user's cluster
CN110275895A (en) * 2019-06-25 2019-09-24 广东工业大学 It is a kind of to lack the filling equipment of traffic data, device and method
CN114741457A (en) * 2022-04-14 2022-07-12 郑州轻工业大学 Data missing value filling method based on function dependence and clustering

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MICROSTRONG0305: "EM算法详解", 《CSDN博客》 *
曹建军等: "《数据质量导论》", 31 October 2017, 北京:国防工业出版社 *
樊东辉等: "基于聚类的KNN算法改进", 《电脑知识与技术》 *
赵星等: "基于距离最大化和缺失数据聚类的填充算法", 《电子设计工程》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114004266A (en) * 2020-07-27 2022-02-01 中国电信股份有限公司 Non-equilibrium industrial data classification method and device and computer readable storage medium
CN111984626A (en) * 2020-08-25 2020-11-24 西安建筑科技大学 Statistical mode-based energy consumption data identification and restoration method
CN113065574A (en) * 2021-02-24 2021-07-02 同济大学 Data preprocessing method and device for semiconductor manufacturing system
CN113139570A (en) * 2021-03-05 2021-07-20 河海大学 Dam safety monitoring data completion method based on optimal hybrid valuation
CN113435536A (en) * 2021-07-15 2021-09-24 广东电网有限责任公司 Electricity charge data preprocessing method, device, terminal equipment and medium
CN114168578A (en) * 2021-12-09 2022-03-11 国网江苏省电力有限公司 Daily load data missing value interpolation method based on clustering and neighbor algorithm
CN116739345A (en) * 2023-06-08 2023-09-12 南京工业大学 Real-time evaluation method for possibility of dangerous chemical road transportation accident
CN116739345B (en) * 2023-06-08 2024-03-22 南京工业大学 Real-time evaluation method for possibility of dangerous chemical road transportation accident

Similar Documents

Publication Publication Date Title
CN111046977A (en) Data preprocessing method based on EM algorithm and KNN algorithm
US10970650B1 (en) AUC-maximized high-accuracy classifier for imbalanced datasets
CN111324642A (en) Model algorithm type selection and evaluation method for power grid big data analysis
Antunes et al. Knee/elbow estimation based on first derivative threshold
CN111833175A (en) Internet financial platform application fraud behavior detection method based on KNN algorithm
Jabbar Local and global outlier detection algorithms in unsupervised approach: a review
CN112288465A (en) Client segmentation method based on semi-supervised clustering ensemble learning
CN112634022B (en) Credit risk assessment method and system based on unbalanced data processing
Barandela et al. Restricted decontamination for the imbalanced training sample problem
Sugiharto et al. Mall Customer Clustering Using Gaussian Mixture Model, K-Means, and BIRCH Algorithm
CN113052268A (en) Attribute reduction algorithm based on uncertainty measurement under interval set data type
Koch et al. Exploring the open world using incremental extreme value machines
Zhao et al. Outlier detection for partially labeled categorical data based on conditional information entropy
CN114529004A (en) Quantum clustering method based on nearest neighbor KNN and improved wave function
CN113792141A (en) Feature selection method based on covariance measurement factor
CN112288571A (en) Personal credit risk assessment method based on rapid construction of neighborhood coverage
CN117539920B (en) Data query method and system based on real estate transaction multidimensional data
Lemhadri et al. RbX: Region-based explanations of prediction models
Lumauag A Modified Memory-Based Collaborative Filtering Algorithm based on a New User Similarity Measure
Wang Automated shmoo data analysis: A machine learning approach
Takahashi et al. Change detection method using cluster transition probability
Li et al. Fuzzy clustering with automated model selection: entropy penalty approach
Ngo et al. Active Level Set Estimation for Continuous Search Space with Theoretical Guarantee
CN117708622B (en) Abnormal index analysis method and system of operation and maintenance system and electronic device
Volodymyr et al. Classification of Images of Visual Objects Based on Statistical Relevance Measures of Their Structural Descriptions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200421