CN111046977A - Data preprocessing method based on EM algorithm and KNN algorithm - Google Patents
Data preprocessing method based on EM algorithm and KNN algorithm Download PDFInfo
- Publication number
- CN111046977A CN111046977A CN201911392045.7A CN201911392045A CN111046977A CN 111046977 A CN111046977 A CN 111046977A CN 201911392045 A CN201911392045 A CN 201911392045A CN 111046977 A CN111046977 A CN 111046977A
- Authority
- CN
- China
- Prior art keywords
- algorithm
- data
- class
- incomplete
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000007781 pre-processing Methods 0.000 title claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 13
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000005315 distribution function Methods 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 abstract description 11
- 238000007418 data mining Methods 0.000 description 10
- 230000006870 function Effects 0.000 description 8
- 238000012217 deletion Methods 0.000 description 5
- 230000037430 deletion Effects 0.000 description 5
- 238000007476 Maximum Likelihood Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data preprocessing method based on an EM algorithm and a KNN algorithm, which comprises the following steps: s1, dividing the original data set into a complete data subset and an incomplete data subset according to whether the attribute values are missing or not, taking the complete data subset as a training sample of the EM algorithm, and clustering by using the EM algorithm; and S2, filling missing values on the clustering result by using a KNN algorithm. According to the method, before missing value filling is carried out by using KNN, clustering analysis is carried out on an original data set by using an EM algorithm, and then missing value filling is carried out by using KNN on the obtained clustering result, so that the method is simple to operate and high in filling accuracy.
Description
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a data preprocessing method based on an EM algorithm and a KNN algorithm.
Background
The financial statement analysis is used for processing, analyzing, comparing, evaluating and explaining the data provided by the enterprise financial statement. If it is said that the accounting and tabulation belong to the reflection function of accounting, then the financial statement analysis is subject to the interpretation and evaluation functions. The purpose of the financial statement analysis is to judge the financial condition of the enterprise and diagnose the loss of the enterprise operation management. Through analysis, whether the financial condition of an enterprise is good or not, whether the operation management of the enterprise is sound or not and whether the business prospect of the enterprise is bright or not can be judged, and meanwhile, the syndrome of the operation management of the enterprise can be found through analysis, and a problem solving method is provided. The method for analyzing the financial statements mainly comprises a trend analysis method and a ratio analysis method. The trend analysis method is to compare the increasing and decreasing directions and inclinations of each item at the later stage according to the financial statements of several successive stages, so as to reveal the changes and trends in finance and operation.
Data mining requires a large amount of data resources, in practical applications, data from different original databases have a large amount of incomplete data, noisy data, heterogeneous data, error data, and the like due to different initial definitions or structures of the databases, however, most data mining algorithms are usually based on clean and complete data sets. Therefore, data in an actual system cannot be directly applied to data analysis, difficulty of data mining is increased, and unprocessed data can seriously affect the result of knowledge discovery. It follows that data preprocessing is critical to data mining. Statistically, the data preprocessing accounts for 60% of the whole data mining process, and the subsequent learning training only accounts for 10% of the whole work. The quality of data preprocessing directly influences the quality of data, and finally controls the result of subsequent data mining. The effective data preprocessing can improve the quality of the whole data, not only saves space cost and time cost, but also is beneficial to obtaining good data mining results to conduct decision guidance and value evaluation.
Various data quality problems are often encountered in the data mining process, wherein the data imperfection problem is particularly prominent. The phenomenon of data missing is common, for example, in a UCI database commonly used in the field of machine learning, a data set containing missing data accounts for more than 40%. The existing processing methods for data incompleteness problems can be roughly divided into three types: deletion methods, padding methods, and unprocessed methods that retain the original information. The application of the deletion method is very limited, the original information of the data set is lost due to the adoption of the deletion method for dealing with the incomplete data problem, the waste of useful information of the data is easily caused, and meanwhile, the accuracy and the objectivity of a data mining result are influenced to a certain extent by the discarding of the information. The deletion method is mainly suitable for data sets with complete random deletion and small proportion of missing data. The filling method is a relatively scientific and effective processing method, and fully utilizes the information of the data to fill, so that the estimated filling value is as close as possible to the true value of the original data. Compared with the former two methods for changing the original data set, the method without processing keeps the original state of the data set. The method utilizes the machine learning technology to weaken the influence of data missing, and directly learns from an incomplete data set, and the learning methods comprise a Bayesian belief network, a rough set method, an artificial neural network and the like.
Clustering is a typical unsupervised learning method. Under the guidance of no prior knowledge, similar example objects are classified into different categories by a static classification method, so that the example objects in the same category are similar as much as possible, and the difference between different categories is large as much as possible. The KNN algorithm is used for missing value filling, and K nearest neighbor filling is obtained. The K nearest neighbor filling is to search K complete objects which are closest to incomplete objects in a complete data set, and fill missing values by using the information of the K neighbors. Compared with other missing value filling algorithms, the K nearest neighbor filling algorithm has the advantages of simplicity in operation and high filling accuracy, but the algorithm is troublesome in operation because the K value needs to be set manually, and the K values required to be set by different training data are different.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a data preprocessing method based on an EM algorithm and a KNN algorithm, which is simple to operate and high in filling accuracy, wherein the EM algorithm is used for carrying out clustering analysis on an original data set before KNN is used for filling missing values, and then KNN is used for filling the missing values on the obtained clustering result.
The purpose of the invention is realized by the following technical scheme: the data preprocessing method based on the EM algorithm and the KNN algorithm comprises the following steps:
s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing, taking the complete data subset as a training sample of an EM (effective magnetic field) algorithm, and clustering by using the EM algorithm;
and S2, filling missing values on the clustering result by using a KNN algorithm.
Further, the step S1 includes the following sub-steps:
s11, recording the complete data subset data as (x)1,x2,...,xn) Sample x1,x2,...,xnIndependent of each other, each sample corresponds to a class ziUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongsi;zi) Maximization, p (x)i;zi) The likelihood function of (d) is:
taking the logarithm of the above formula to obtain:
wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)i,zi(ii) a Theta) is the sample x when the model parameter is thetaiAnd class ziA joint distribution between;
s12, defining a category variable ziSatisfy a certain distribution QiAnd the distribution function Qi(zi) The following conditions are satisfied:
transforming the solving formula of l (theta) in the step S1 by using the Zhansen inequality to obtain:
namely, it isExpected probability f (E [ X ]]) Greater than or equal toExpectation of function E [ f (X)];
As known from the jensen inequality, if and only if X is a constant, the inequality takes an equal sign, then there is:
where C is a constant, for a series of different ziThe values, summed, result in:
thus, Qi(zi) The calculation formula of (2) is as follows:
p(zi|xi(ii) a θ) refers to the sample x when the model parameter is θiBelong to the class ziThe conditional probability of (a);
s13, Q obtained in the step S2i(zi) As the distribution of the categories, then maximizing the likelihood function to obtain the final clustering result: given an initial value θ, the loop repeats steps E and M until convergence:
e, step E: for each xiCalculating Qi(zi)=p(zi|xi;θ);
further, the step S2 includes the following sub-steps:
s21, matching incomplete data subsets D according to the number of missing attribute valuesiSorting from small to large;
s22, calculating the distance from each record r in the sorted incomplete data subset to each cluster center c formed by EM clustering, and sorting from small to large;
s23, classifying each incomplete record into the class of the cluster center c with the minimum distance to the record;
s24, calculating the distance dis between the incomplete record and other training data in the belonged class by using an Euclidean distance formula; for the continuous attributes in the incomplete record, missing value padding is performed using the following formula:
wherein v isnMeaning incomplete recording, βiRefers to a complete record, P, of the class in which cluster center c is locatedrRefers to incomplete recording vnContaining consecutive attributes of missing values, n referring toα refers to the similarity of two records, i.e. the calculated distance dis;
and S25, filling discrete attributes in the incomplete record by obtaining the mode of other complete records in the corresponding attributes in the belonged records.
The invention has the beneficial effects that: according to the method, before missing value filling is carried out by using KNN, clustering analysis is carried out on an original data set by using an EM algorithm, and then missing value filling is carried out by using KNN on the obtained clustering result, so that the method is simple to operate and high in filling accuracy.
Drawings
Fig. 1 is a flow chart of a data preprocessing method based on the EM algorithm and the KNN algorithm.
Detailed Description
The method provided by the invention belongs to a filling method, and the background technology related to the method is explained as follows:
1. maximum Expectation (EM) algorithm
The Expectation-maximization (EM) algorithm is an algorithm that finds a parameter maximum likelihood estimate or maximum a posteriori estimate in a probabilistic (probabilistic) model, where the probabilistic model relies on unobservable hidden variables (Latent variables). The algorithm is mainly calculated by two steps alternately, wherein the first step is to calculate expectation (E) and calculate the maximum likelihood estimated value of the hidden variable by using the existing estimated value of the hidden variable; the second step is to maximize (M), the maximum likelihood found at step E is maximized to calculate the value of the parameter. The parameter estimates found in step M will be used for the next E step calculation, alternating until convergence. The most direct application of the EM algorithm is to solve parameter estimation, but if we consider potential classes as hidden variables and samples as observed values, the clustering problem can be converted into a parameter estimation problem, which is the principle of clustering using the EM algorithm. The main flow of the EM algorithm is as follows:
(1) initializing initial values theta of model parameters theta randomly0;
(2) Start iteration of the EM algorithm:
(a) first, a known joint distribution P (x) is calculated(i),z(i)(ii) a Theta) conditional probability expectation L (theta )j):
Qi(z(i))=P(z(i)|x(i);θ)
(b) Second, L (θ, θ) is maximizedj) To obtain thetaj+1:
θj+1=argmaxxL(θ,θj)
(c) If theta is greater than thetaj+1And (4) converging, finishing the algorithm, and otherwise, continuing iteration.
(3) And outputting the model parameter theta.
The EM algorithm can guarantee convergence to a stable point but cannot guarantee convergence to a global maximum point, and thus is a locally optimal algorithm. Of course, if the target L (θ, θ) is optimizedj) Convex, the EM algorithm can guarantee convergence to a global maximum, which is the same as the iterative algorithm in the gradient descent method.
2. K Nearest Neighbor (KNN) algorithm
The basic idea of the method is as follows: if most of K most similar samples (namely K adjacent samples in the feature space) in the feature space of a sample to be classified belong to a certain class, the sample also belongs to the class, and the method is a supervised learning algorithm. The general flow of the KNN algorithm is:
(1) calculating the distance between the test data point and each training data point, and sequencing according to the distance increasing order;
(2) selecting K points with the minimum distance from the current test data point;
(3) determining the occurrence frequency of the category where the first K training data points are located;
(4) and returning the class with the highest frequency of the current K training data points as the prediction classification of the current test data point.
It is worth noting that in the KNN algorithm, the distance between the objects is used as a non-similarity measurement index between the objects, and the problem of matching between the objects is avoided. The distance is generally calculated by using a Euclidean distance formula or a Manhattan distance formula, and in the invention, the Euclidean distance is used for measuring the distance between each object, and the distance formula is shown as follows:
the technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1, the data preprocessing method based on EM algorithm and KNN algorithm of the present invention includes the following steps:
s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing or not, if so, determining the collected data to be the incomplete data, otherwise, determining the collected data to be the complete data, taking the complete data subset as a training sample of an EM algorithm, and clustering by using the EM algorithm; the method comprises the following substeps:
s11, recording the complete data subset data as (x)1,x2,...,xn) Sample x1,x2,...,xnIndependent of each other, each sample corresponds to a class ziUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongsi;zi) Maximization, p (x)i;zi) The likelihood function of (d) is:
taking the logarithm of the above formula to obtain:
wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)i,zi(ii) a Theta) is a model parameter of thetaSample x of timeiAnd class ziA joint distribution between;
s12, defining a category variable ziSatisfy a certain distribution QiAnd the distribution function Qi(zi) The following conditions are satisfied:
transforming the solving formula of l (theta) in the step S1 by using the Zhansen inequality to obtain:
namely, it isExpected probability f (E [ X ]]) Greater than or equal toExpectation of function E [ f (X)];
As known from the jensen inequality, if and only if X is a constant, the inequality takes an equal sign, then there is:
where C is a constant, for a series of different ziThe values, summed, result in:
thus, Qi(zi) The calculation formula of (2) is as follows:
p(zi|xi(ii) a θ) refers to the sample x when the model parameter is θiBelong to the class ziThe conditional probability of (a);
s13, Q obtained in the step S2i(zi) As the distribution of the categories, then maximizing the likelihood function to obtain the final clustering result: given an initial value θ, the loop repeats steps E and M until convergence:
e, step E: for each xiCalculating Qi(zi)=p(zi|xi;θ);
step E refers to calculating, for each data point in the training sample, i.e. each record in the complete data subset, the probability that it belongs to each cluster therein, and using this as a weight. M steps refer to the estimation of the relevant parameters (mean, variance) of each cluster using the weights calculated in the previous step: and taking the probability in the step E as a weight of each data point, and then calculating the mean value and the variance of each cluster like K-means so as to solve the overall probability or the maximum likelihood of the clusters.
S2, filling missing values on the clustering result by using a KNN algorithm; the method comprises the following substeps:
s21, matching incomplete data subsets D according to the number of missing attribute valuesiSorting from small to large;
s22, calculating the distance from each record r in the sorted incomplete data subset to each cluster center c formed by EM clustering, and sorting from small to large;
s23, classifying each incomplete record into the class of the cluster center c with the minimum distance to the record;
s24, calculating the distance dis between the incomplete record and other training data in the belonged class by using an Euclidean distance formula; for the continuous attributes in the incomplete record, missing value padding is performed using the following formula:
wherein v isnMeaning incomplete recording, βiRefers to a complete record, P, of the class in which cluster center c is locatedrRefers to incomplete recording vnThe cluster center c comprises continuous attributes of missing values, n refers to the total number of complete records of the class where the cluster center c is located, α refers to the similarity of two records, namely the calculated distance dis;
and S25, filling discrete attributes in the incomplete record by obtaining the mode of other complete records in the corresponding attributes in the belonged records.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.
Claims (3)
1. The data preprocessing method based on the EM algorithm and the KNN algorithm is characterized by comprising the following steps of:
s1, collecting financial system data, dividing the collected data into a complete data subset and an incomplete data subset according to whether attribute values are missing, taking the complete data subset as a training sample of an EM (effective magnetic field) algorithm, and clustering by using the EM algorithm;
and S2, filling missing values on the clustering result by using a KNN algorithm.
2. The method for preprocessing data based on EM algorithm and KNN algorithm as claimed in claim 1, wherein said step S1 comprises the following sub-steps:
s11, recording the complete data subset data as (x)1,x2,...,xn) Sample x1,x2,...,xnIndependent of each other, each sample corresponds to a class ziUnknown; the purpose of the clustering algorithm is to determine the class to which the sample belongs, such that the joint distribution p (x) of the sample and the class belongsi;zi) Maximization, p (x)i;zi) The likelihood function of (d) is:
taking the logarithm of the above formula to obtain:
wherein n is the number of sample data, theta is the model parameter of EM algorithm, and p (x)i,zi(ii) a Theta) is the sample x when the model parameter is thetaiAnd class ziA joint distribution between;
s12, defining a category variable ziSatisfy a certain distribution QiAnd the distribution function Qi(zi) The following conditions are satisfied:
transforming the solving formula of l (theta) in the step S1 by using the Zhansen inequality to obtain:
namely, it isDesired probability f (E [ x ]]) Greater than or equal toExpectation of function E [ f (X)];
As known from the jensen inequality, if and only if X is a constant, the inequality takes an equal sign, then there is:
where C is a constant, for a series of different ziThe values, summed, result in:
thus, Qi(zi) The calculation formula of (2) is as follows:
p(zi|xi(ii) a θ) refers to the sample x when the model parameter is θiBelong to the class ziThe conditional probability of (a);
s13, Q obtained in the step S2i(zi) As the distribution of the categories, then maximizing the likelihood function to obtain the final clustering result: given an initial value θ, the loop repeats steps E and M until convergence:
e, step E: for each xiCalculating Qi(zi)=p(zi|xi;θ);
3. the method for preprocessing data based on EM algorithm and KNN algorithm as claimed in claim 1, wherein said step S2 comprises the following sub-steps:
s21, matching incomplete data subsets D according to the number of missing attribute valuesiSorting from small to large;
s22, calculating the distance from each record r in the sorted incomplete data subset to each cluster center c formed by EM clustering, and sorting from small to large;
s23, classifying each incomplete record into the class of the cluster center c with the minimum distance to the record;
s24, calculating the distance dis between the incomplete record and other training data in the belonged class by using an Euclidean distance formula; for the continuous attributes in the incomplete record, missing value padding is performed using the following formula:
wherein v isnMeaning incomplete recording, βiRefers to a complete record, P, of the class in which cluster center c is locatedrRefers to incomplete recording vnThe cluster center c comprises continuous attributes of missing values, n refers to the total number of complete records of the class where the cluster center c is located, α refers to the similarity of two records, namely the calculated distance dis;
and S25, filling discrete attributes in the incomplete record by obtaining the mode of other complete records in the corresponding attributes in the belonged records.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911392045.7A CN111046977A (en) | 2019-12-30 | 2019-12-30 | Data preprocessing method based on EM algorithm and KNN algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911392045.7A CN111046977A (en) | 2019-12-30 | 2019-12-30 | Data preprocessing method based on EM algorithm and KNN algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111046977A true CN111046977A (en) | 2020-04-21 |
Family
ID=70241570
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911392045.7A Pending CN111046977A (en) | 2019-12-30 | 2019-12-30 | Data preprocessing method based on EM algorithm and KNN algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111046977A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111984626A (en) * | 2020-08-25 | 2020-11-24 | 西安建筑科技大学 | Statistical mode-based energy consumption data identification and restoration method |
CN113065574A (en) * | 2021-02-24 | 2021-07-02 | 同济大学 | Data preprocessing method and device for semiconductor manufacturing system |
CN113139570A (en) * | 2021-03-05 | 2021-07-20 | 河海大学 | Dam safety monitoring data completion method based on optimal hybrid valuation |
CN113435536A (en) * | 2021-07-15 | 2021-09-24 | 广东电网有限责任公司 | Electricity charge data preprocessing method, device, terminal equipment and medium |
CN114004266A (en) * | 2020-07-27 | 2022-02-01 | 中国电信股份有限公司 | Non-equilibrium industrial data classification method and device and computer readable storage medium |
CN114168578A (en) * | 2021-12-09 | 2022-03-11 | 国网江苏省电力有限公司 | Daily load data missing value interpolation method based on clustering and neighbor algorithm |
CN116739345A (en) * | 2023-06-08 | 2023-09-12 | 南京工业大学 | Real-time evaluation method for possibility of dangerous chemical road transportation accident |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177088A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Biomedicine missing data compensation method |
US20150261846A1 (en) * | 2014-03-11 | 2015-09-17 | Sas Institute Inc. | Computerized cluster analysis framework for decorrelated cluster identification in datasets |
CN108363810A (en) * | 2018-03-09 | 2018-08-03 | 南京工业大学 | Text classification method and device |
CN108932301A (en) * | 2018-06-11 | 2018-12-04 | 天津科技大学 | Data filling method and device |
CN109446185A (en) * | 2018-08-29 | 2019-03-08 | 广西大学 | Collaborative filtering missing data processing method based on user's cluster |
CN110275895A (en) * | 2019-06-25 | 2019-09-24 | 广东工业大学 | It is a kind of to lack the filling equipment of traffic data, device and method |
CN114741457A (en) * | 2022-04-14 | 2022-07-12 | 郑州轻工业大学 | Data missing value filling method based on function dependence and clustering |
-
2019
- 2019-12-30 CN CN201911392045.7A patent/CN111046977A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177088A (en) * | 2013-03-08 | 2013-06-26 | 北京理工大学 | Biomedicine missing data compensation method |
US20150261846A1 (en) * | 2014-03-11 | 2015-09-17 | Sas Institute Inc. | Computerized cluster analysis framework for decorrelated cluster identification in datasets |
CN108363810A (en) * | 2018-03-09 | 2018-08-03 | 南京工业大学 | Text classification method and device |
CN108932301A (en) * | 2018-06-11 | 2018-12-04 | 天津科技大学 | Data filling method and device |
CN109446185A (en) * | 2018-08-29 | 2019-03-08 | 广西大学 | Collaborative filtering missing data processing method based on user's cluster |
CN110275895A (en) * | 2019-06-25 | 2019-09-24 | 广东工业大学 | It is a kind of to lack the filling equipment of traffic data, device and method |
CN114741457A (en) * | 2022-04-14 | 2022-07-12 | 郑州轻工业大学 | Data missing value filling method based on function dependence and clustering |
Non-Patent Citations (4)
Title |
---|
MICROSTRONG0305: "EM算法详解", 《CSDN博客》 * |
曹建军等: "《数据质量导论》", 31 October 2017, 北京:国防工业出版社 * |
樊东辉等: "基于聚类的KNN算法改进", 《电脑知识与技术》 * |
赵星等: "基于距离最大化和缺失数据聚类的填充算法", 《电子设计工程》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114004266A (en) * | 2020-07-27 | 2022-02-01 | 中国电信股份有限公司 | Non-equilibrium industrial data classification method and device and computer readable storage medium |
CN111984626A (en) * | 2020-08-25 | 2020-11-24 | 西安建筑科技大学 | Statistical mode-based energy consumption data identification and restoration method |
CN113065574A (en) * | 2021-02-24 | 2021-07-02 | 同济大学 | Data preprocessing method and device for semiconductor manufacturing system |
CN113139570A (en) * | 2021-03-05 | 2021-07-20 | 河海大学 | Dam safety monitoring data completion method based on optimal hybrid valuation |
CN113435536A (en) * | 2021-07-15 | 2021-09-24 | 广东电网有限责任公司 | Electricity charge data preprocessing method, device, terminal equipment and medium |
CN114168578A (en) * | 2021-12-09 | 2022-03-11 | 国网江苏省电力有限公司 | Daily load data missing value interpolation method based on clustering and neighbor algorithm |
CN116739345A (en) * | 2023-06-08 | 2023-09-12 | 南京工业大学 | Real-time evaluation method for possibility of dangerous chemical road transportation accident |
CN116739345B (en) * | 2023-06-08 | 2024-03-22 | 南京工业大学 | Real-time evaluation method for possibility of dangerous chemical road transportation accident |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111046977A (en) | Data preprocessing method based on EM algorithm and KNN algorithm | |
US10970650B1 (en) | AUC-maximized high-accuracy classifier for imbalanced datasets | |
CN111324642A (en) | Model algorithm type selection and evaluation method for power grid big data analysis | |
Antunes et al. | Knee/elbow estimation based on first derivative threshold | |
CN111833175A (en) | Internet financial platform application fraud behavior detection method based on KNN algorithm | |
Jabbar | Local and global outlier detection algorithms in unsupervised approach: a review | |
CN112288465A (en) | Client segmentation method based on semi-supervised clustering ensemble learning | |
CN112634022B (en) | Credit risk assessment method and system based on unbalanced data processing | |
Barandela et al. | Restricted decontamination for the imbalanced training sample problem | |
Sugiharto et al. | Mall Customer Clustering Using Gaussian Mixture Model, K-Means, and BIRCH Algorithm | |
CN113052268A (en) | Attribute reduction algorithm based on uncertainty measurement under interval set data type | |
Koch et al. | Exploring the open world using incremental extreme value machines | |
Zhao et al. | Outlier detection for partially labeled categorical data based on conditional information entropy | |
CN114529004A (en) | Quantum clustering method based on nearest neighbor KNN and improved wave function | |
CN113792141A (en) | Feature selection method based on covariance measurement factor | |
CN112288571A (en) | Personal credit risk assessment method based on rapid construction of neighborhood coverage | |
CN117539920B (en) | Data query method and system based on real estate transaction multidimensional data | |
Lemhadri et al. | RbX: Region-based explanations of prediction models | |
Lumauag | A Modified Memory-Based Collaborative Filtering Algorithm based on a New User Similarity Measure | |
Wang | Automated shmoo data analysis: A machine learning approach | |
Takahashi et al. | Change detection method using cluster transition probability | |
Li et al. | Fuzzy clustering with automated model selection: entropy penalty approach | |
Ngo et al. | Active Level Set Estimation for Continuous Search Space with Theoretical Guarantee | |
CN117708622B (en) | Abnormal index analysis method and system of operation and maintenance system and electronic device | |
Volodymyr et al. | Classification of Images of Visual Objects Based on Statistical Relevance Measures of Their Structural Descriptions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200421 |