CN106156805A

CN106156805A - A kind of classifier training method of sample label missing data

Info

Publication number: CN106156805A
Application number: CN201610818737.3A
Authority: CN
Inventors: 梁锡军; 夏重杭
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2016-09-12
Filing date: 2016-09-12
Publication date: 2016-11-23

Abstract

The invention discloses a kind of classifier training method of sample label missing data, be suitable to process the categorical data with two class samples, the label data of one type sample all lacks. and the present invention provides a kind of Optimization Solution technology, using the reliability of unmarked sample as decision variable to be solved, optimizing model is set up based on structural risk minimization principle. this model can directly invoke the tool kit of Non-Linear Programming on middle and small scale data set and be solved, on large-scale dataset, available alternate search algorithm solves two convex programming subproblems respectively, two parts variable of iterative model. the present invention is highly versatile on different pieces of information collection, independent test set has good popularization performance.

Description

Classifier training method for sample label missing data

Technical Field

The invention relates to a data analysis method, in particular to a classifier training method based on a support vector machine for sample label missing data.

Background

The support vector machine has become an important data processing and analyzing method for processing supervised learning problems.

The invention relates to a method for acquiring data of a plurality of samples, which comprises the steps of acquiring data of a plurality of samples, wherein the data of the samples are only two types of samples, namely a positive type sample and a negative type sample.

The invention relates to a method for training a classifier from data with all missing positive sample labels, which is used for classification and identification, belongs to a special semi-supervised learning problem, the invention with the bulletin number of CN105005790A trains a plurality of basic classifiers on the known label sample data set, and the label sample data set is obtained by training a plurality of basic classifiers by a voting strategy, classifying samples of an unknown labeled sample data set, iteratively updating a labeled sample set and an unlabeled sample set, training the obtained classifier to be used for intelligent identification of toxic gas in an electronic nose chamber, adopting a similar idea in the invention with the notice number of CN104992184A, training a plurality of basic classifiers on a known label sample data set, introducing an artificial labeling technology to carry out iterative classification, and applying the training method to image classification.

The basic idea of the above semi-supervised learning techniques and methods in the current relevant research literature is to predict or label some samples with higher confidence in unlabeled samples by integrating the classical classification and clustering techniques, or introducing external information or even manual labeling, etc. methods, so as to iteratively update the labeled sample set and the unlabeled sample set.

(1) Due to the difference of distribution rules in data of different data sets, the rules of the existing method for updating the marked sample set and the unmarked sample set cannot be directly applied to different data sets, especially to data sets with large difference in potential probability distribution of data.

(2) Although some application problems do not require classification and recognition on an independent test set, the poor popularization performance means that the classification and recognition results on the training set have larger deviation.

The invention discloses a data processing method for all the labels of positive samples, aiming at the limitations, wherein the acquired data lacks all the labels of the positive samples and part of the labels of the negative samples.

Disclosure of Invention

In order to overcome the defects of poor universality on different data sets and poor popularization performance of the obtained classifier or classification method in the prior art, the invention provides an optimization solving technology, wherein the reliability of a label is used as a decision variable to be solved according to all unmarked samples as normal samples, an optimization model is established based on the structure risk minimization principle, and an effective algorithm is provided for solving.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for training a classifier from missing data of a positive sample label mainly comprises the following steps:

step 1, data preprocessing, namely converting each characteristic of data into numerical data, removing redundant characteristics and normalizing the data;

step 2, the training sample after the pretreatment is set asWherein x is_i∈R^d,y_i∈ { -1, +1}, N being the number of all training samples known negative class sample points are labeled "-1", all unlabeled samples are labeled "+ 1". The Ω is labeled_-＝{i|y_i＝-1},Ω₊＝{i|y_iSolving an adaptive semi-supervised learning model of the form:

wherein,is the classification function to be solved for,is a Hilbert space of a regeneration kernel to which a classification function to be solved belongs, and theta is ═ theta₁,...,θ_N]^T∈R^NIs the decision variable, θ, of the model to be solved_i∈[0,1]The confidence characterizing the ith sample label, L (-) is a loss function,is a regularization function with respect to θ, c₁＞0,c₂C is a constant value > 0, mu > 0₁Weight representing loss of negative class samples, c₂Representing the weight lost by the unlabeled sample (positive type sample).

And 3, predicting the label of the unlabeled sample according to the classifier f obtained by training.

Detailed description of the steps:

step 1, preprocessing data.

Data normalization: if it is known from experience that some features have important roles, the corresponding features can be multiplied by proper coefficients after the data normalization operation is completed.

Step 2.1 for the nonlinear classification data, select the proper kernel function to measure the similarity of the samples, if there is no prior knowledge to the data set, use the Gaussian kernel functionWhere σ > 0 is a constant.

Step 2.2 adaptive semi-supervised learning model (1) executable form.

2.2.1 decision function in an adaptive semi-supervised learning model (1), according to the representation theorem, has the form:

f (x) = Σ_{i = 1}^{N} β_{i} k (x_{i}, x) . - - - (2)

therefore, the temperature of the molten metal is controlled,

f (x_{j}) = Σ_{i = 1}^{N} β_{i} k (x_{i}, x_{j}) = K_{j}^{T} β, j = 1, ..., N - - - (3)

wherein the kernel matrix K ═ (K)_ij),K_ij＝k(x_i,x_j)，K_jRepresenting the jth column of matrix K.

2.2.2 loss function

The invention discloses a specific solving algorithm of the model by taking a classic Hinge loss function and a square loss function as examples, and the Hinge loss has the following form:

L(y_i,f(x_i))＝max(0,1-y_if(x_i) 1, n. (4) the square loss has the form:

L(y_i,f(x_i))＝{max(0,1-y_if(x_i))}²,i＝1,...,N. (5)

2.2.3 regularization termThe selection principle is as follows:

(1)with respect to theta being a convex function, theta_i∈[0,1],i＝1,...,N.

(2) Note the bookThen theta^*(μ, l) is monotonically not increasing with respect to l, and

should satisfy the above simultaneouslyAccording to this principle, a regularization term can be givenThe invention in various formsThe two practical forms of (1) are taken as examples, and the solution scheme of the model (1) is explained.Can be calculated according to the following formula:

or

2.2.4 specific form of adaptive semi-supervised learning model

According to the formula (1) and the properties of the Hilbert space of the regeneration core,

| | f | |_{H}^{2} = < Σ_{i = 1}^{N} β_{i} k (x_{i}, x), Σ_{j = 1}^{N} β_{j} k (x_{j}, x) > = Σ_{i = 1}^{N} Σ_{j = 1}^{N} k (x_{i}, x_{j}) β_{i} β_{j} = β^{T} K β .

substituting the above formulas (3), (4) into the self-adaptive semi-supervised learning model (1) to obtain the concrete form of the model:

wherein K is (K)_ij),K_ij＝k(x_i,x_j) P-1 or p-2, corresponding to Hinge loss and square loss, respectively,determined by the formula (6) or (7).

2.2.5 solving method of self-adaptive semi-supervised learning model

The adaptive semi-supervised learning model (8) is non-supervisedLinear programming problem comprising two parts of variable β∈ R to be solved^N，θ∈R^NFor a data set with the scale N less than or equal to 10000, an algorithm toolkit, such as fmincon of Matlab, is directly called to solve.

For large-scale datasets, the present invention discloses the following iterative algorithm.

The objective function of the memory model (8) is

Algorithm 1. alternative search Algorithm

Inputting: training sampleConstant c₁,c₂,μ；

β output^*∈R^N,θ^*∈R^N；

Step 1, initialize, choose β⁰＝[0,...,0]^T,θ⁰＝[1,...,1]^TSetting k to be 0;

step 2. for fixed theta^kAt β^kFor the initial point, the convex optimization problem is solved approximately

\underset{β}{m i n} F (β, θ^{k}) - - - (10)

Put the optimal solution as β^k+1；

Step 3. for fixed β^k+1Solving the convex optimization problem

\underset{θ}{m i n} F (β^{k + 1}, θ)

s.t.θ_i＝1,i∈Ω_-

0≤θ_i≤1,i∈Ω₊. (11)

Put the optimal solution as theta^k+1And setting k to be k +1, and turning to step 2 until the termination criterion is met.

The flow chart of the algorithm is shown in the attached figure 1. the specific implementation of the algorithm:

solving of sub-problem (11)

When in useIn a particular form, the subproblem (11) has an analytical solution_i＝L(y_i,f(x_i) The optimal solution of the n. mnemonic problem (11) is θ ═ 1^k+1Is provided with

If determined by the formula (6)Then

If determined by the formula (7)Then

Approximate solution of sub-problem (10)

The sub-problem (10) is one sub-problem for each iteration of the algorithm 1, requiring only an approximate solution to be solved, the sub-problem (10) can be written as

Wherein,the problem is a weighted, original form of the support vector machine, approximately solved using an algorithm that solves the support vector machine, such as the SMO algorithm, defined by equation (11).

Termination criteria for Algorithm 1

If the number of iterations exceedsThe algorithm terminates the iteration r₁> 0 is a constant.

Step 3. predicting the label of the unlabeled sample

Note β^*∈R^NFor the optimal solution output by algorithm 1, the classification function of the model training is:

\hat{f} (x) = Σ_{i = 1}^{N} β_{i}^{*} k (x_{i}, x) . - - - (13)

for sample x^kThe label of which is predicted to be

The invention has the beneficial effects that: :

(1) the reliability of the sample label is used as a decision variable to be solved by the model, so that the defect that the error of the unmarked sample is predicted by artificially introducing other data processing technologies is large can be overcome.

(2) The method is strong in universality on different data sets, particularly on different data sets with large difference of probability distribution in the data.

(3) The classifier obtained by training has good popularization performance, and the classification error rate can be tested on an independent test set, so that overfitting of the classifier is effectively avoided.

Drawings

FIG. 1 is a flow chart of the algorithm 1 disclosed in the present invention, which is the subject algorithm of the present invention, and which alternately calculates β by solving two subproblems^kAnd theta^kUntil the iteration ends, output β^*And calculating a classification function according to the formula (13)

Detailed Description

The invention is further illustrated below with reference to the accompanying figures and examples.3 polypeptide identification datasets were selected to test the effectiveness of the disclosed method.3 datasets are listed in Table 1: total number of samples yeast, ups1, and tal 08; the number of known negative class samples; the data processing method provided by the invention calculates on the training set to obtain the classification function, and tests the performance of the classification function on an independent test set.

Unified parameter settings were used for all 3 tested datasets: taking mu as 5.0, c in the self-adaptive semi-supervised learning model (8)₁＝c₂1.0, using a square loss function, and determining by equation (6)Termination criterion r of Algorithm 1₁The value is 0.5.

Table 2 lists the number of TPs and FPs identified on the training set and test set by the method of the present invention, where the error rate FP/(TP + FP) is taken at a level of 0.025 it can be seen from table 2 that the number of TPs and FPs identified on the training set and test set by the method is nearly consistent with the ratio "test set sample/total sample count is 50%" indicating that the classification function calculated by the method has good predictive performance on the test set.

TABLE 1 data set

	Total of	Negative sample	Unlabeled specimen
				yeast	14892	8189	6703
ups1	17335	8361	8974
				tal08	69560	27338	42222

Table 2 performance of the method of the invention on training and test sets (error rate 0.025)

Claims

1. A classifier training method for sample label missing data comprises the following steps:

step 1, preprocessing data;

and 2, solving the self-adaptive semi-supervised learning model in the following form:

s.t.θ_i＝1,i∈Ω_-,

0≤θ_i≤1,i∈Ω₊

wherein,to train the samples, x_i∈R^d,y_i∈ { -1, +1}, the negative class sample point label is "-1", the label of the unlabeled sample is "+ 1", Ω_-＝{i|y_i＝-1},Ω₊＝{i|y_i＝+1}，Is the classification function to be solved for,is a Hilbert space of a regeneration kernel to which a classification function to be solved belongs, and theta is ═ theta₁,...,θ_N]^T∈R^NIs the decision variable for the model to solve, L (-) is the loss function,is a regularization function with respect to θ, c₁＞0,c₂> 0, mu > 0 is a constant;

2. The method as claimed in claim 1, wherein the regularization term in step 2The following two rules are satisfied simultaneously:

(1)with respect to theta being a convex function, theta_i∈[0,1],i＝1,...,N.

(2) Note the bookThen theta^*(μ, l) is monotonically not increasing with respect to l, and 。

3. the method as claimed in claim 1, wherein the adaptive semi-supervised learning model selects a certain loss function and regularization termNonlinear programming problem that can be modeled as followsAnd directly calling a tool pack of the nonlinear programming to solve on the small and medium-scale data set.

4. The method as claimed in claim 3, wherein the method is a non-linear programming problemComprising two parts of variable β∈ R to be solved^N，θ∈R^NAnd solving the large-scale problem by adopting the following iterative algorithm:

step 1, initialize, choose β⁰,θ⁰Setting k to be 0;

step 2. for fixed theta^kAt β^kAs an initial point, the problem is solved approximatelyPut the optimal solution as β^k+1；

Step 3. for fixed β^k+1Solving forPut the optimal solution as theta^k+1And setting k to be k +1, and turning to step 2 until the termination criterion is met.

5. The method as claimed in claim 4, wherein θ in step 2 of the algorithm is θ^k+1In thatIn a particular form, it can be calculated as follows:

if it isThen

If it isThen

Whereinl_i＝L(y_i,f(x_i)),i＝1,...,N。

6. The method as claimed in claim 1, wherein the label of the unlabeled sample is predicted in step 3 according to the following formula: for sample x_kThe label of which is predicted to be

WhereinA classification function trained for the model.