CN106156805A - A kind of classifier training method of sample label missing data - Google Patents
A kind of classifier training method of sample label missing data Download PDFInfo
- Publication number
- CN106156805A CN106156805A CN201610818737.3A CN201610818737A CN106156805A CN 106156805 A CN106156805 A CN 106156805A CN 201610818737 A CN201610818737 A CN 201610818737A CN 106156805 A CN106156805 A CN 106156805A
- Authority
- CN
- China
- Prior art keywords
- sample
- label
- theta
- data
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 title claims abstract description 23
- 230000006870 function Effects 0.000 claims description 24
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000008929 regeneration Effects 0.000 claims description 3
- 238000011069 regeneration method Methods 0.000 claims description 3
- 238000012360 testing method Methods 0.000 abstract description 11
- 238000005457 optimization Methods 0.000 abstract description 5
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000010845 search algorithm Methods 0.000 abstract description 2
- 238000012706 support-vector machine Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 239000002184 metal Substances 0.000 description 1
- 230000001343 mnemonic effect Effects 0.000 description 1
- 229920001184 polypeptide Polymers 0.000 description 1
- 102000004196 processed proteins & peptides Human genes 0.000 description 1
- 108090000765 processed proteins & peptides Proteins 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000002341 toxic gas Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of classifier training method of sample label missing data, be suitable to process the categorical data with two class samples, the label data of one type sample all lacks. and the present invention provides a kind of Optimization Solution technology, using the reliability of unmarked sample as decision variable to be solved, optimizing model is set up based on structural risk minimization principle. this model can directly invoke the tool kit of Non-Linear Programming on middle and small scale data set and be solved, on large-scale dataset, available alternate search algorithm solves two convex programming subproblems respectively, two parts variable of iterative model. the present invention is highly versatile on different pieces of information collection, independent test set has good popularization performance.
Description
Technical Field
The invention relates to a data analysis method, in particular to a classifier training method based on a support vector machine for sample label missing data.
Background
The support vector machine has become an important data processing and analyzing method for processing supervised learning problems.
The invention relates to a method for acquiring data of a plurality of samples, which comprises the steps of acquiring data of a plurality of samples, wherein the data of the samples are only two types of samples, namely a positive type sample and a negative type sample.
The invention relates to a method for training a classifier from data with all missing positive sample labels, which is used for classification and identification, belongs to a special semi-supervised learning problem, the invention with the bulletin number of CN105005790A trains a plurality of basic classifiers on the known label sample data set, and the label sample data set is obtained by training a plurality of basic classifiers by a voting strategy, classifying samples of an unknown labeled sample data set, iteratively updating a labeled sample set and an unlabeled sample set, training the obtained classifier to be used for intelligent identification of toxic gas in an electronic nose chamber, adopting a similar idea in the invention with the notice number of CN104992184A, training a plurality of basic classifiers on a known label sample data set, introducing an artificial labeling technology to carry out iterative classification, and applying the training method to image classification.
The basic idea of the above semi-supervised learning techniques and methods in the current relevant research literature is to predict or label some samples with higher confidence in unlabeled samples by integrating the classical classification and clustering techniques, or introducing external information or even manual labeling, etc. methods, so as to iteratively update the labeled sample set and the unlabeled sample set.
(1) Due to the difference of distribution rules in data of different data sets, the rules of the existing method for updating the marked sample set and the unmarked sample set cannot be directly applied to different data sets, especially to data sets with large difference in potential probability distribution of data.
(2) Although some application problems do not require classification and recognition on an independent test set, the poor popularization performance means that the classification and recognition results on the training set have larger deviation.
The invention discloses a data processing method for all the labels of positive samples, aiming at the limitations, wherein the acquired data lacks all the labels of the positive samples and part of the labels of the negative samples.
Disclosure of Invention
In order to overcome the defects of poor universality on different data sets and poor popularization performance of the obtained classifier or classification method in the prior art, the invention provides an optimization solving technology, wherein the reliability of a label is used as a decision variable to be solved according to all unmarked samples as normal samples, an optimization model is established based on the structure risk minimization principle, and an effective algorithm is provided for solving.
The technical scheme adopted by the invention for solving the technical problems is as follows: a method for training a classifier from missing data of a positive sample label mainly comprises the following steps:
step 1, data preprocessing, namely converting each characteristic of data into numerical data, removing redundant characteristics and normalizing the data;
step 2, the training sample after the pretreatment is set asWherein x isi∈Rd,yi∈ { -1, +1}, N being the number of all training samples known negative class sample points are labeled "-1", all unlabeled samples are labeled "+ 1". The Ω is labeled-={i|yi=-1},Ω+={i|yiSolving an adaptive semi-supervised learning model of the form:
wherein,is the classification function to be solved for,is a Hilbert space of a regeneration kernel to which a classification function to be solved belongs, and theta is ═ theta1,...,θN]T∈RNIs the decision variable, θ, of the model to be solvedi∈[0,1]The confidence characterizing the ith sample label, L (-) is a loss function,is a regularization function with respect to θ, c1>0,c2C is a constant value > 0, mu > 01Weight representing loss of negative class samples, c2Representing the weight lost by the unlabeled sample (positive type sample).
And 3, predicting the label of the unlabeled sample according to the classifier f obtained by training.
Detailed description of the steps:
step 1, preprocessing data.
Data normalization: if it is known from experience that some features have important roles, the corresponding features can be multiplied by proper coefficients after the data normalization operation is completed.
Step 2.1 for the nonlinear classification data, select the proper kernel function to measure the similarity of the samples, if there is no prior knowledge to the data set, use the Gaussian kernel functionWhere σ > 0 is a constant.
Step 2.2 adaptive semi-supervised learning model (1) executable form.
2.2.1 decision function in an adaptive semi-supervised learning model (1), according to the representation theorem, has the form:
therefore, the temperature of the molten metal is controlled,
wherein the kernel matrix K ═ (K)ij),Kij=k(xi,xj),KjRepresenting the jth column of matrix K.
2.2.2 loss function
The invention discloses a specific solving algorithm of the model by taking a classic Hinge loss function and a square loss function as examples, and the Hinge loss has the following form:
L(yi,f(xi))=max(0,1-yif(xi) 1, n. (4) the square loss has the form:
L(yi,f(xi))={max(0,1-yif(xi))}2,i=1,...,N. (5)
2.2.3 regularization termThe selection principle is as follows:
(1)with respect to theta being a convex function, thetai∈[0,1],i=1,...,N.
(2) Note the bookThen theta*(μ, l) is monotonically not increasing with respect to l, and
should satisfy the above simultaneouslyAccording to this principle, a regularization term can be givenThe invention in various formsThe two practical forms of (1) are taken as examples, and the solution scheme of the model (1) is explained.Can be calculated according to the following formula:
or
2.2.4 specific form of adaptive semi-supervised learning model
According to the formula (1) and the properties of the Hilbert space of the regeneration core,
substituting the above formulas (3), (4) into the self-adaptive semi-supervised learning model (1) to obtain the concrete form of the model:
wherein K is (K)ij),Kij=k(xi,xj) P-1 or p-2, corresponding to Hinge loss and square loss, respectively,determined by the formula (6) or (7).
2.2.5 solving method of self-adaptive semi-supervised learning model
The adaptive semi-supervised learning model (8) is non-supervisedLinear programming problem comprising two parts of variable β∈ R to be solvedN,θ∈RNFor a data set with the scale N less than or equal to 10000, an algorithm toolkit, such as fmincon of Matlab, is directly called to solve.
For large-scale datasets, the present invention discloses the following iterative algorithm.
The objective function of the memory model (8) is
Algorithm 1. alternative search Algorithm
Inputting: training sampleConstant c1,c2,μ;
β output*∈RN,θ*∈RN;
Step 1, initialize, choose β0=[0,...,0]T,θ0=[1,...,1]TSetting k to be 0;
step 2. for fixed thetakAt βkFor the initial point, the convex optimization problem is solved approximately
Put the optimal solution as βk+1;
Step 3. for fixed βk+1Solving the convex optimization problem
s.t.θi=1,i∈Ω-
0≤θi≤1,i∈Ω+. (11)
Put the optimal solution as thetak+1And setting k to be k +1, and turning to step 2 until the termination criterion is met.
The flow chart of the algorithm is shown in the attached figure 1. the specific implementation of the algorithm:
solving of sub-problem (11)
When in useIn a particular form, the subproblem (11) has an analytical solutioni=L(yi,f(xi) The optimal solution of the n. mnemonic problem (11) is θ ═ 1k+1Is provided with
If determined by the formula (6)Then
If determined by the formula (7)Then
Approximate solution of sub-problem (10)
The sub-problem (10) is one sub-problem for each iteration of the algorithm 1, requiring only an approximate solution to be solved, the sub-problem (10) can be written as
Wherein,the problem is a weighted, original form of the support vector machine, approximately solved using an algorithm that solves the support vector machine, such as the SMO algorithm, defined by equation (11).
Termination criteria for Algorithm 1
If the number of iterations exceedsThe algorithm terminates the iteration r1> 0 is a constant.
Step 3. predicting the label of the unlabeled sample
Note β*∈RNFor the optimal solution output by algorithm 1, the classification function of the model training is:
for sample xkThe label of which is predicted to be
The invention has the beneficial effects that: :
(1) the reliability of the sample label is used as a decision variable to be solved by the model, so that the defect that the error of the unmarked sample is predicted by artificially introducing other data processing technologies is large can be overcome.
(2) The method is strong in universality on different data sets, particularly on different data sets with large difference of probability distribution in the data.
(3) The classifier obtained by training has good popularization performance, and the classification error rate can be tested on an independent test set, so that overfitting of the classifier is effectively avoided.
Drawings
FIG. 1 is a flow chart of the algorithm 1 disclosed in the present invention, which is the subject algorithm of the present invention, and which alternately calculates β by solving two subproblemskAnd thetakUntil the iteration ends, output β*And calculating a classification function according to the formula (13)
Detailed Description
The invention is further illustrated below with reference to the accompanying figures and examples.3 polypeptide identification datasets were selected to test the effectiveness of the disclosed method.3 datasets are listed in Table 1: total number of samples yeast, ups1, and tal 08; the number of known negative class samples; the data processing method provided by the invention calculates on the training set to obtain the classification function, and tests the performance of the classification function on an independent test set.
Unified parameter settings were used for all 3 tested datasets: taking mu as 5.0, c in the self-adaptive semi-supervised learning model (8)1=c21.0, using a square loss function, and determining by equation (6)Termination criterion r of Algorithm 11The value is 0.5.
Table 2 lists the number of TPs and FPs identified on the training set and test set by the method of the present invention, where the error rate FP/(TP + FP) is taken at a level of 0.025 it can be seen from table 2 that the number of TPs and FPs identified on the training set and test set by the method is nearly consistent with the ratio "test set sample/total sample count is 50%" indicating that the classification function calculated by the method has good predictive performance on the test set.
TABLE 1 data set
Total of | Negative sample | Unlabeled specimen | |
yeast | 14892 | 8189 | 6703 |
ups1 | 17335 | 8361 | 8974 |
tal08 | 69560 | 27338 | 42222 |
Table 2 performance of the method of the invention on training and test sets (error rate 0.025)
Claims (6)
1. A classifier training method for sample label missing data comprises the following steps:
step 1, preprocessing data;
and 2, solving the self-adaptive semi-supervised learning model in the following form:
s.t.θi=1,i∈Ω-,
0≤θi≤1,i∈Ω+
wherein,to train the samples, xi∈Rd,yi∈ { -1, +1}, the negative class sample point label is "-1", the label of the unlabeled sample is "+ 1", Ω-={i|yi=-1},Ω+={i|yi=+1},Is the classification function to be solved for,is a Hilbert space of a regeneration kernel to which a classification function to be solved belongs, and theta is ═ theta1,...,θN]T∈RNIs the decision variable for the model to solve, L (-) is the loss function,is a regularization function with respect to θ, c1>0,c2> 0, mu > 0 is a constant;
and 3, predicting the label of the unlabeled sample according to the classifier f obtained by training.
2. The method as claimed in claim 1, wherein the regularization term in step 2The following two rules are satisfied simultaneously:
(1)with respect to theta being a convex function, thetai∈[0,1],i=1,...,N.
(2) Note the bookThen theta*(μ, l) is monotonically not increasing with respect to l, and 。
3. the method as claimed in claim 1, wherein the adaptive semi-supervised learning model selects a certain loss function and regularization termNonlinear programming problem that can be modeled as followsAnd directly calling a tool pack of the nonlinear programming to solve on the small and medium-scale data set.
4. The method as claimed in claim 3, wherein the method is a non-linear programming problemComprising two parts of variable β∈ R to be solvedN,θ∈RNAnd solving the large-scale problem by adopting the following iterative algorithm:
step 1, initialize, choose β0,θ0Setting k to be 0;
step 2. for fixed thetakAt βkAs an initial point, the problem is solved approximatelyPut the optimal solution as βk+1;
Step 3. for fixed βk+1Solving forPut the optimal solution as thetak+1And setting k to be k +1, and turning to step 2 until the termination criterion is met.
5. The method as claimed in claim 4, wherein θ in step 2 of the algorithm is θk+1In thatIn a particular form, it can be calculated as follows:
if it isThen
If it isThen
Whereinli=L(yi,f(xi)),i=1,...,N。
6. The method as claimed in claim 1, wherein the label of the unlabeled sample is predicted in step 3 according to the following formula: for sample xkThe label of which is predicted to be
WhereinA classification function trained for the model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610818737.3A CN106156805A (en) | 2016-09-12 | 2016-09-12 | A kind of classifier training method of sample label missing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610818737.3A CN106156805A (en) | 2016-09-12 | 2016-09-12 | A kind of classifier training method of sample label missing data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106156805A true CN106156805A (en) | 2016-11-23 |
Family
ID=57340820
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610818737.3A Pending CN106156805A (en) | 2016-09-12 | 2016-09-12 | A kind of classifier training method of sample label missing data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156805A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107657282A (en) * | 2017-09-29 | 2018-02-02 | 中国石油大学(华东) | Peptide identification from step-length learning method |
CN108091397A (en) * | 2018-01-24 | 2018-05-29 | 浙江大学 | A kind of bleeding episode Forecasting Methodology for the Ischemic Heart Disease analyzed based on promotion-resampling and feature association |
CN108388774A (en) * | 2018-01-17 | 2018-08-10 | 中国石油大学(华东) | A kind of on-line analysis of polypeptide spectrum matched data |
CN109492695A (en) * | 2018-11-08 | 2019-03-19 | 北京字节跳动网络技术有限公司 | Sample processing method, device, electronic equipment and the readable medium of data modeling |
CN109543693A (en) * | 2018-11-28 | 2019-03-29 | 中国人民解放军国防科技大学 | Weak labeling data noise reduction method based on regularization label propagation |
CN109960745A (en) * | 2019-03-20 | 2019-07-02 | 网易(杭州)网络有限公司 | Visual classification processing method and processing device, storage medium and electronic equipment |
CN111563721A (en) * | 2020-04-21 | 2020-08-21 | 上海爱数信息技术股份有限公司 | Mail classification method suitable for different label distribution occasions |
CN112598265A (en) * | 2020-12-18 | 2021-04-02 | 武汉大学 | Decoupling risk estimation-based rapid detection method for hyperspectral pine nematode disease of unmanned aerial vehicle |
CN112836750A (en) * | 2021-02-03 | 2021-05-25 | 中国工商银行股份有限公司 | System resource allocation method, device and equipment |
-
2016
- 2016-09-12 CN CN201610818737.3A patent/CN106156805A/en active Pending
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107657282A (en) * | 2017-09-29 | 2018-02-02 | 中国石油大学(华东) | Peptide identification from step-length learning method |
CN108388774A (en) * | 2018-01-17 | 2018-08-10 | 中国石油大学(华东) | A kind of on-line analysis of polypeptide spectrum matched data |
CN108388774B (en) * | 2018-01-17 | 2021-07-23 | 中国石油大学(华东) | Online analysis method of polypeptide spectrum matching data |
CN108091397A (en) * | 2018-01-24 | 2018-05-29 | 浙江大学 | A kind of bleeding episode Forecasting Methodology for the Ischemic Heart Disease analyzed based on promotion-resampling and feature association |
CN109492695A (en) * | 2018-11-08 | 2019-03-19 | 北京字节跳动网络技术有限公司 | Sample processing method, device, electronic equipment and the readable medium of data modeling |
CN109492695B (en) * | 2018-11-08 | 2021-07-23 | 北京字节跳动网络技术有限公司 | Sample processing method and device for data modeling, electronic equipment and readable medium |
CN109543693B (en) * | 2018-11-28 | 2021-05-07 | 中国人民解放军国防科技大学 | Weak labeling data noise reduction method based on regularization label propagation |
CN109543693A (en) * | 2018-11-28 | 2019-03-29 | 中国人民解放军国防科技大学 | Weak labeling data noise reduction method based on regularization label propagation |
CN109960745B (en) * | 2019-03-20 | 2021-03-23 | 网易(杭州)网络有限公司 | Video classification processing method and device, storage medium and electronic equipment |
CN109960745A (en) * | 2019-03-20 | 2019-07-02 | 网易(杭州)网络有限公司 | Visual classification processing method and processing device, storage medium and electronic equipment |
CN111563721A (en) * | 2020-04-21 | 2020-08-21 | 上海爱数信息技术股份有限公司 | Mail classification method suitable for different label distribution occasions |
CN111563721B (en) * | 2020-04-21 | 2023-07-11 | 上海爱数信息技术股份有限公司 | Mail classification method suitable for different label distribution occasions |
CN112598265A (en) * | 2020-12-18 | 2021-04-02 | 武汉大学 | Decoupling risk estimation-based rapid detection method for hyperspectral pine nematode disease of unmanned aerial vehicle |
CN112598265B (en) * | 2020-12-18 | 2022-06-07 | 武汉大学 | Decoupling risk estimation-based rapid detection method for hyperspectral pine nematode disease of unmanned aerial vehicle |
CN112836750A (en) * | 2021-02-03 | 2021-05-25 | 中国工商银行股份有限公司 | System resource allocation method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106156805A (en) | A kind of classifier training method of sample label missing data | |
CN106778832B (en) | The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization | |
CN103617435B (en) | Image sorting method and system for active learning | |
CN106682696A (en) | Multi-example detection network based on refining of online example classifier and training method thereof | |
CN113887643B (en) | New dialogue intention recognition method based on pseudo tag self-training and source domain retraining | |
CN106203534A (en) | A kind of cost-sensitive Software Defects Predict Methods based on Boosting | |
CN113010683B (en) | Entity relationship identification method and system based on improved graph attention network | |
CN103617429A (en) | Sorting method and system for active learning | |
CN109492750B (en) | Zero sample image classification method based on convolutional neural network and factor space | |
CN104966105A (en) | Robust machine error retrieving method and system | |
CN112115993B (en) | Zero sample and small sample evidence photo anomaly detection method based on meta-learning | |
CN109583635A (en) | A kind of short-term load forecasting modeling method towards operational reliability | |
CN104657574A (en) | Building method and device for medical diagnosis models | |
CN109582974A (en) | A kind of student enrollment's credit estimation method and device based on deep learning | |
CN110659682A (en) | Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm | |
CN113095229B (en) | Self-adaptive pedestrian re-identification system and method for unsupervised domain | |
CN116152554A (en) | Knowledge-guided small sample image recognition system | |
CN108596204B (en) | Improved SCDAE-based semi-supervised modulation mode classification model method | |
CN106569954A (en) | Method based on KL divergence for predicting multi-source software defects | |
CN110796260B (en) | Neural network model optimization method based on class expansion learning | |
CN116361454A (en) | Automatic course teaching case assessment method based on Bloom classification method | |
CN115165366A (en) | Variable working condition fault diagnosis method and system for rotary machine | |
CN111191033A (en) | Open set classification method based on classification utility | |
CN103559510B (en) | Method for recognizing social group behaviors through related topic model | |
CN105787045A (en) | Precision enhancing method for visual media semantic indexing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20161123 |
|
RJ01 | Rejection of invention patent application after publication |