[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN106156805A - A kind of classifier training method of sample label missing data - Google Patents

A kind of classifier training method of sample label missing data Download PDF

Info

Publication number
CN106156805A
CN106156805A CN201610818737.3A CN201610818737A CN106156805A CN 106156805 A CN106156805 A CN 106156805A CN 201610818737 A CN201610818737 A CN 201610818737A CN 106156805 A CN106156805 A CN 106156805A
Authority
CN
China
Prior art keywords
sample
label
theta
data
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610818737.3A
Other languages
Chinese (zh)
Inventor
梁锡军
夏重杭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201610818737.3A priority Critical patent/CN106156805A/en
Publication of CN106156805A publication Critical patent/CN106156805A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of classifier training method of sample label missing data, be suitable to process the categorical data with two class samples, the label data of one type sample all lacks. and the present invention provides a kind of Optimization Solution technology, using the reliability of unmarked sample as decision variable to be solved, optimizing model is set up based on structural risk minimization principle. this model can directly invoke the tool kit of Non-Linear Programming on middle and small scale data set and be solved, on large-scale dataset, available alternate search algorithm solves two convex programming subproblems respectively, two parts variable of iterative model. the present invention is highly versatile on different pieces of information collection, independent test set has good popularization performance.

Description

Classifier training method for sample label missing data
Technical Field
The invention relates to a data analysis method, in particular to a classifier training method based on a support vector machine for sample label missing data.
Background
The support vector machine has become an important data processing and analyzing method for processing supervised learning problems.
The invention relates to a method for acquiring data of a plurality of samples, which comprises the steps of acquiring data of a plurality of samples, wherein the data of the samples are only two types of samples, namely a positive type sample and a negative type sample.
The invention relates to a method for training a classifier from data with all missing positive sample labels, which is used for classification and identification, belongs to a special semi-supervised learning problem, the invention with the bulletin number of CN105005790A trains a plurality of basic classifiers on the known label sample data set, and the label sample data set is obtained by training a plurality of basic classifiers by a voting strategy, classifying samples of an unknown labeled sample data set, iteratively updating a labeled sample set and an unlabeled sample set, training the obtained classifier to be used for intelligent identification of toxic gas in an electronic nose chamber, adopting a similar idea in the invention with the notice number of CN104992184A, training a plurality of basic classifiers on a known label sample data set, introducing an artificial labeling technology to carry out iterative classification, and applying the training method to image classification.
The basic idea of the above semi-supervised learning techniques and methods in the current relevant research literature is to predict or label some samples with higher confidence in unlabeled samples by integrating the classical classification and clustering techniques, or introducing external information or even manual labeling, etc. methods, so as to iteratively update the labeled sample set and the unlabeled sample set.
(1) Due to the difference of distribution rules in data of different data sets, the rules of the existing method for updating the marked sample set and the unmarked sample set cannot be directly applied to different data sets, especially to data sets with large difference in potential probability distribution of data.
(2) Although some application problems do not require classification and recognition on an independent test set, the poor popularization performance means that the classification and recognition results on the training set have larger deviation.
The invention discloses a data processing method for all the labels of positive samples, aiming at the limitations, wherein the acquired data lacks all the labels of the positive samples and part of the labels of the negative samples.
Disclosure of Invention
In order to overcome the defects of poor universality on different data sets and poor popularization performance of the obtained classifier or classification method in the prior art, the invention provides an optimization solving technology, wherein the reliability of a label is used as a decision variable to be solved according to all unmarked samples as normal samples, an optimization model is established based on the structure risk minimization principle, and an effective algorithm is provided for solving.
The technical scheme adopted by the invention for solving the technical problems is as follows: a method for training a classifier from missing data of a positive sample label mainly comprises the following steps:
step 1, data preprocessing, namely converting each characteristic of data into numerical data, removing redundant characteristics and normalizing the data;
step 2, the training sample after the pretreatment is set asWherein x isi∈Rd,yi∈ { -1, +1}, N being the number of all training samples known negative class sample points are labeled "-1", all unlabeled samples are labeled "+ 1". The Ω is labeled-={i|yi=-1},Ω+={i|yiSolving an adaptive semi-supervised learning model of the form:
wherein,is the classification function to be solved for,is a Hilbert space of a regeneration kernel to which a classification function to be solved belongs, and theta is ═ theta1,...,θN]T∈RNIs the decision variable, θ, of the model to be solvedi∈[0,1]The confidence characterizing the ith sample label, L (-) is a loss function,is a regularization function with respect to θ, c1>0,c2C is a constant value > 0, mu > 01Weight representing loss of negative class samples, c2Representing the weight lost by the unlabeled sample (positive type sample).
And 3, predicting the label of the unlabeled sample according to the classifier f obtained by training.
Detailed description of the steps:
step 1, preprocessing data.
Data normalization: if it is known from experience that some features have important roles, the corresponding features can be multiplied by proper coefficients after the data normalization operation is completed.
Step 2.1 for the nonlinear classification data, select the proper kernel function to measure the similarity of the samples, if there is no prior knowledge to the data set, use the Gaussian kernel functionWhere σ > 0 is a constant.
Step 2.2 adaptive semi-supervised learning model (1) executable form.
2.2.1 decision function in an adaptive semi-supervised learning model (1), according to the representation theorem, has the form:
f ( x ) = Σ i = 1 N β i k ( x i , x ) . - - - ( 2 )
therefore, the temperature of the molten metal is controlled,
f ( x j ) = Σ i = 1 N β i k ( x i , x j ) = K j T β , j = 1 , ... , N - - - ( 3 )
wherein the kernel matrix K ═ (K)ij),Kij=k(xi,xj),KjRepresenting the jth column of matrix K.
2.2.2 loss function
The invention discloses a specific solving algorithm of the model by taking a classic Hinge loss function and a square loss function as examples, and the Hinge loss has the following form:
L(yi,f(xi))=max(0,1-yif(xi) 1, n. (4) the square loss has the form:
L(yi,f(xi))={max(0,1-yif(xi))}2,i=1,...,N. (5)
2.2.3 regularization termThe selection principle is as follows:
(1)with respect to theta being a convex function, thetai∈[0,1],i=1,...,N.
(2) Note the bookThen theta*(μ, l) is monotonically not increasing with respect to l, and
should satisfy the above simultaneouslyAccording to this principle, a regularization term can be givenThe invention in various formsThe two practical forms of (1) are taken as examples, and the solution scheme of the model (1) is explained.Can be calculated according to the following formula:
or
2.2.4 specific form of adaptive semi-supervised learning model
According to the formula (1) and the properties of the Hilbert space of the regeneration core,
| | f | | H 2 = < &Sigma; i = 1 N &beta; i k ( x i , x ) , &Sigma; j = 1 N &beta; j k ( x j , x ) > = &Sigma; i = 1 N &Sigma; j = 1 N k ( x i , x j ) &beta; i &beta; j = &beta; T K &beta; .
substituting the above formulas (3), (4) into the self-adaptive semi-supervised learning model (1) to obtain the concrete form of the model:
wherein K is (K)ij),Kij=k(xi,xj) P-1 or p-2, corresponding to Hinge loss and square loss, respectively,determined by the formula (6) or (7).
2.2.5 solving method of self-adaptive semi-supervised learning model
The adaptive semi-supervised learning model (8) is non-supervisedLinear programming problem comprising two parts of variable β∈ R to be solvedN,θ∈RNFor a data set with the scale N less than or equal to 10000, an algorithm toolkit, such as fmincon of Matlab, is directly called to solve.
For large-scale datasets, the present invention discloses the following iterative algorithm.
The objective function of the memory model (8) is
Algorithm 1. alternative search Algorithm
Inputting: training sampleConstant c1,c2,μ;
β output*∈RN*∈RN
Step 1, initialize, choose β0=[0,...,0]T0=[1,...,1]TSetting k to be 0;
step 2. for fixed thetakAt βkFor the initial point, the convex optimization problem is solved approximately
m i n &beta; F ( &beta; , &theta; k ) - - - ( 10 )
Put the optimal solution as βk+1
Step 3. for fixed βk+1Solving the convex optimization problem
m i n &theta; F ( &beta; k + 1 , &theta; )
s.t.θi=1,i∈Ω-
0≤θi≤1,i∈Ω+. (11)
Put the optimal solution as thetak+1And setting k to be k +1, and turning to step 2 until the termination criterion is met.
The flow chart of the algorithm is shown in the attached figure 1. the specific implementation of the algorithm:
solving of sub-problem (11)
When in useIn a particular form, the subproblem (11) has an analytical solutioni=L(yi,f(xi) The optimal solution of the n. mnemonic problem (11) is θ ═ 1k+1Is provided with
If determined by the formula (6)Then
If determined by the formula (7)Then
Approximate solution of sub-problem (10)
The sub-problem (10) is one sub-problem for each iteration of the algorithm 1, requiring only an approximate solution to be solved, the sub-problem (10) can be written as
Wherein,the problem is a weighted, original form of the support vector machine, approximately solved using an algorithm that solves the support vector machine, such as the SMO algorithm, defined by equation (11).
Termination criteria for Algorithm 1
If the number of iterations exceedsThe algorithm terminates the iteration r1> 0 is a constant.
Step 3. predicting the label of the unlabeled sample
Note β*∈RNFor the optimal solution output by algorithm 1, the classification function of the model training is:
f ^ ( x ) = &Sigma; i = 1 N &beta; i * k ( x i , x ) . - - - ( 13 )
for sample xkThe label of which is predicted to be
The invention has the beneficial effects that: :
(1) the reliability of the sample label is used as a decision variable to be solved by the model, so that the defect that the error of the unmarked sample is predicted by artificially introducing other data processing technologies is large can be overcome.
(2) The method is strong in universality on different data sets, particularly on different data sets with large difference of probability distribution in the data.
(3) The classifier obtained by training has good popularization performance, and the classification error rate can be tested on an independent test set, so that overfitting of the classifier is effectively avoided.
Drawings
FIG. 1 is a flow chart of the algorithm 1 disclosed in the present invention, which is the subject algorithm of the present invention, and which alternately calculates β by solving two subproblemskAnd thetakUntil the iteration ends, output β*And calculating a classification function according to the formula (13)
Detailed Description
The invention is further illustrated below with reference to the accompanying figures and examples.3 polypeptide identification datasets were selected to test the effectiveness of the disclosed method.3 datasets are listed in Table 1: total number of samples yeast, ups1, and tal 08; the number of known negative class samples; the data processing method provided by the invention calculates on the training set to obtain the classification function, and tests the performance of the classification function on an independent test set.
Unified parameter settings were used for all 3 tested datasets: taking mu as 5.0, c in the self-adaptive semi-supervised learning model (8)1=c21.0, using a square loss function, and determining by equation (6)Termination criterion r of Algorithm 11The value is 0.5.
Table 2 lists the number of TPs and FPs identified on the training set and test set by the method of the present invention, where the error rate FP/(TP + FP) is taken at a level of 0.025 it can be seen from table 2 that the number of TPs and FPs identified on the training set and test set by the method is nearly consistent with the ratio "test set sample/total sample count is 50%" indicating that the classification function calculated by the method has good predictive performance on the test set.
TABLE 1 data set
Total of Negative sample Unlabeled specimen
yeast 14892 8189 6703
ups1 17335 8361 8974
tal08 69560 27338 42222
Table 2 performance of the method of the invention on training and test sets (error rate 0.025)

Claims (6)

1. A classifier training method for sample label missing data comprises the following steps:
step 1, preprocessing data;
and 2, solving the self-adaptive semi-supervised learning model in the following form:
s.t.θi=1,i∈Ω-,
0≤θi≤1,i∈Ω+
wherein,to train the samples, xi∈Rd,yi∈ { -1, +1}, the negative class sample point label is "-1", the label of the unlabeled sample is "+ 1", Ω-={i|yi=-1},Ω+={i|yi=+1},Is the classification function to be solved for,is a Hilbert space of a regeneration kernel to which a classification function to be solved belongs, and theta is ═ theta1,...,θN]T∈RNIs the decision variable for the model to solve, L (-) is the loss function,is a regularization function with respect to θ, c1>0,c2> 0, mu > 0 is a constant;
and 3, predicting the label of the unlabeled sample according to the classifier f obtained by training.
2. The method as claimed in claim 1, wherein the regularization term in step 2The following two rules are satisfied simultaneously:
(1)with respect to theta being a convex function, thetai∈[0,1],i=1,...,N.
(2) Note the bookThen theta*(μ, l) is monotonically not increasing with respect to l, and
3. the method as claimed in claim 1, wherein the adaptive semi-supervised learning model selects a certain loss function and regularization termNonlinear programming problem that can be modeled as followsAnd directly calling a tool pack of the nonlinear programming to solve on the small and medium-scale data set.
4. The method as claimed in claim 3, wherein the method is a non-linear programming problemComprising two parts of variable β∈ R to be solvedN,θ∈RNAnd solving the large-scale problem by adopting the following iterative algorithm:
step 1, initialize, choose β00Setting k to be 0;
step 2. for fixed thetakAt βkAs an initial point, the problem is solved approximatelyPut the optimal solution as βk+1
Step 3. for fixed βk+1Solving forPut the optimal solution as thetak+1And setting k to be k +1, and turning to step 2 until the termination criterion is met.
5. The method as claimed in claim 4, wherein θ in step 2 of the algorithm is θk+1In thatIn a particular form, it can be calculated as follows:
if it isThen
If it isThen
Whereinli=L(yi,f(xi)),i=1,...,N。
6. The method as claimed in claim 1, wherein the label of the unlabeled sample is predicted in step 3 according to the following formula: for sample xkThe label of which is predicted to be
WhereinA classification function trained for the model.
CN201610818737.3A 2016-09-12 2016-09-12 A kind of classifier training method of sample label missing data Pending CN106156805A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610818737.3A CN106156805A (en) 2016-09-12 2016-09-12 A kind of classifier training method of sample label missing data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610818737.3A CN106156805A (en) 2016-09-12 2016-09-12 A kind of classifier training method of sample label missing data

Publications (1)

Publication Number Publication Date
CN106156805A true CN106156805A (en) 2016-11-23

Family

ID=57340820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610818737.3A Pending CN106156805A (en) 2016-09-12 2016-09-12 A kind of classifier training method of sample label missing data

Country Status (1)

Country Link
CN (1) CN106156805A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657282A (en) * 2017-09-29 2018-02-02 中国石油大学(华东) Peptide identification from step-length learning method
CN108091397A (en) * 2018-01-24 2018-05-29 浙江大学 A kind of bleeding episode Forecasting Methodology for the Ischemic Heart Disease analyzed based on promotion-resampling and feature association
CN108388774A (en) * 2018-01-17 2018-08-10 中国石油大学(华东) A kind of on-line analysis of polypeptide spectrum matched data
CN109492695A (en) * 2018-11-08 2019-03-19 北京字节跳动网络技术有限公司 Sample processing method, device, electronic equipment and the readable medium of data modeling
CN109543693A (en) * 2018-11-28 2019-03-29 中国人民解放军国防科技大学 Weak labeling data noise reduction method based on regularization label propagation
CN109960745A (en) * 2019-03-20 2019-07-02 网易(杭州)网络有限公司 Visual classification processing method and processing device, storage medium and electronic equipment
CN111563721A (en) * 2020-04-21 2020-08-21 上海爱数信息技术股份有限公司 Mail classification method suitable for different label distribution occasions
CN112598265A (en) * 2020-12-18 2021-04-02 武汉大学 Decoupling risk estimation-based rapid detection method for hyperspectral pine nematode disease of unmanned aerial vehicle
CN112836750A (en) * 2021-02-03 2021-05-25 中国工商银行股份有限公司 System resource allocation method, device and equipment

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107657282A (en) * 2017-09-29 2018-02-02 中国石油大学(华东) Peptide identification from step-length learning method
CN108388774A (en) * 2018-01-17 2018-08-10 中国石油大学(华东) A kind of on-line analysis of polypeptide spectrum matched data
CN108388774B (en) * 2018-01-17 2021-07-23 中国石油大学(华东) Online analysis method of polypeptide spectrum matching data
CN108091397A (en) * 2018-01-24 2018-05-29 浙江大学 A kind of bleeding episode Forecasting Methodology for the Ischemic Heart Disease analyzed based on promotion-resampling and feature association
CN109492695A (en) * 2018-11-08 2019-03-19 北京字节跳动网络技术有限公司 Sample processing method, device, electronic equipment and the readable medium of data modeling
CN109492695B (en) * 2018-11-08 2021-07-23 北京字节跳动网络技术有限公司 Sample processing method and device for data modeling, electronic equipment and readable medium
CN109543693B (en) * 2018-11-28 2021-05-07 中国人民解放军国防科技大学 Weak labeling data noise reduction method based on regularization label propagation
CN109543693A (en) * 2018-11-28 2019-03-29 中国人民解放军国防科技大学 Weak labeling data noise reduction method based on regularization label propagation
CN109960745B (en) * 2019-03-20 2021-03-23 网易(杭州)网络有限公司 Video classification processing method and device, storage medium and electronic equipment
CN109960745A (en) * 2019-03-20 2019-07-02 网易(杭州)网络有限公司 Visual classification processing method and processing device, storage medium and electronic equipment
CN111563721A (en) * 2020-04-21 2020-08-21 上海爱数信息技术股份有限公司 Mail classification method suitable for different label distribution occasions
CN111563721B (en) * 2020-04-21 2023-07-11 上海爱数信息技术股份有限公司 Mail classification method suitable for different label distribution occasions
CN112598265A (en) * 2020-12-18 2021-04-02 武汉大学 Decoupling risk estimation-based rapid detection method for hyperspectral pine nematode disease of unmanned aerial vehicle
CN112598265B (en) * 2020-12-18 2022-06-07 武汉大学 Decoupling risk estimation-based rapid detection method for hyperspectral pine nematode disease of unmanned aerial vehicle
CN112836750A (en) * 2021-02-03 2021-05-25 中国工商银行股份有限公司 System resource allocation method, device and equipment

Similar Documents

Publication Publication Date Title
CN106156805A (en) A kind of classifier training method of sample label missing data
CN106778832B (en) The semi-supervised Ensemble classifier method of high dimensional data based on multiple-objection optimization
CN103617435B (en) Image sorting method and system for active learning
CN106682696A (en) Multi-example detection network based on refining of online example classifier and training method thereof
CN113887643B (en) New dialogue intention recognition method based on pseudo tag self-training and source domain retraining
CN106203534A (en) A kind of cost-sensitive Software Defects Predict Methods based on Boosting
CN113010683B (en) Entity relationship identification method and system based on improved graph attention network
CN103617429A (en) Sorting method and system for active learning
CN109492750B (en) Zero sample image classification method based on convolutional neural network and factor space
CN104966105A (en) Robust machine error retrieving method and system
CN112115993B (en) Zero sample and small sample evidence photo anomaly detection method based on meta-learning
CN109583635A (en) A kind of short-term load forecasting modeling method towards operational reliability
CN104657574A (en) Building method and device for medical diagnosis models
CN109582974A (en) A kind of student enrollment&#39;s credit estimation method and device based on deep learning
CN110659682A (en) Data classification method based on MCWD-KSMOTE-AdaBoost-DenseNet algorithm
CN113095229B (en) Self-adaptive pedestrian re-identification system and method for unsupervised domain
CN116152554A (en) Knowledge-guided small sample image recognition system
CN108596204B (en) Improved SCDAE-based semi-supervised modulation mode classification model method
CN106569954A (en) Method based on KL divergence for predicting multi-source software defects
CN110796260B (en) Neural network model optimization method based on class expansion learning
CN116361454A (en) Automatic course teaching case assessment method based on Bloom classification method
CN115165366A (en) Variable working condition fault diagnosis method and system for rotary machine
CN111191033A (en) Open set classification method based on classification utility
CN103559510B (en) Method for recognizing social group behaviors through related topic model
CN105787045A (en) Precision enhancing method for visual media semantic indexing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20161123

RJ01 Rejection of invention patent application after publication