CN108038511A

CN108038511A - Cluster hypothesis is corrected to be unified into constraining semisupervised classification method

Info

Publication number: CN108038511A
Application number: CN201711421475.8A
Authority: CN
Inventors: 钱鹏江; 邵袁; 黄华; 刘杰; 蒋亦樟; 陈爱国; 田爱平; 刘子扬
Original assignee: Jiangsu Jiangda Smart Technology Co Ltd
Current assignee: Jiangsu Jiangda Smart Technology Co Ltd
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-05-15

Abstract

Cluster hypothesis is corrected the invention discloses one kind to be unified into constraining semisupervised classification method, it is related to a kind of semi-supervised learning algorithm.Its step of is：The class degree of membership of unmarked sample is initialized by FCM methods, selects appropriate parameter lambda₁, λ₂And the α, membership function v (x), fresh target function M of initialization are calculated according to formula, judge whether to reach stopping criterion for iteration, if, then return to membership function v (x), and categorised decision function f (x) is obtained according to α, if it is not, then recalculating the α of initialization, membership function v (x), fresh target function M and judging.The cluster hypothesis of the invention that will correct utilizes combination to unlabelled exploration and paired constraint to supervision message, collectively constitute more perfect empiric risk item, so as to further excavate the knowledge that supervision message is included, achieve the purpose that algorithm performance improves, possess the validity and correctness of higher.

Description

Modified clustering hypothesis joint pairwise constraint semi-supervised classification method

Technical Field

The invention relates to a semi-supervised learning algorithm, in particular to a modified clustering hypothesis joint pairwise constraint semi-supervised classification method.

Background

Semi-supervised learning is a learning mode between supervised learning and unsupervised learning, and the basic premise of learning is as follows: in addition to the large number of unlabeled specimens, supervisory information such as class labels is provided for labeled specimens; semi-supervised learning differs from supervised learning in that it can augment the training data set with a large number of unlabeled samples. The main mode of semi-supervised learning is from the perspective of supervised learning, and when labeled samples with supervised information are not enough to train a good model, how to automatically utilize information of a large number of unlabeled samples to assist in improving the performance of the classifier.

Semi-supervised classification generally improves the performance of the classifier from two aspects: on one hand, for marked samples, some efficient learning means are often used to mine knowledge such as supervision information contained in a small amount of marked samples, wherein the method is mainly completed by a method of recommending supervised learning; on the other hand, the unsupervised learning method is used for acquiring data distribution information contained in a large number of unlabeled samples. From the perspective of the exploitation of supervisory information, data class labels are widely used as one of the most common and straightforward a priori knowledge. Paired constraints, also known as must-associate and impossible-associate constraints, are another type of supervisory information that has the advantage of being more flexible and more practical than other supervisory information. In some practical cases, only the pair constraint is given, but the class label data of the sample is not given, in which case the pair constraint is converted from the data label; the data distribution information contained in the unmarked samples is reversely mined mainly by depending on basic assumptions of manifold assumption, clustering assumption and smooth assumption of the three semi-supervised learning. The main idea of the clustering assumption is that "when the sample data are relatively close to each other, they have the same class", and according to this assumption, the classification boundary must pass through the sparse (low density) data as much as possible to avoid dividing the dense sample data points to both sides of the classification decision boundary. On the premise of this assumption, the learning algorithm can analyze the distribution of sample data in the sample space by using a large amount of unlabeled sample data, so as to guide the learning algorithm to adjust the classification boundary, so that the classification boundary passes through an area where the sample data is sparse as much as possible, and finally, the learning performance is very good.

The core idea of the semi-supervised learning method is how to utilize knowledge contained in a small number of marked samples and a large number of unmarked samples to improve the learning capacity of the algorithm, the currently mainstream semi-supervised learning algorithm mainly obtains the knowledge from the unmarked samples to mine the distribution information of data to improve the performance of the classifier, but ignores the deep ploughing utilization of the monitoring information such as the marked samples and the like, loses important information contained in the marked samples to a certain extent, does not realize the maximized utilization of the knowledge, lacks effectiveness and correctness, and has low algorithm performance. For example, an improved clustering hypothesis idea modifies a clustering hypothesis by introducing a membership concept, improves a common clustering hypothesis that samples in the same class cluster have a larger possibility of having the same class label into samples in the same class cluster having a similar membership, and provides a new semi-supervised classification method, namely a semi-supervised classification method (SSCCM) based on class membership on the basis of the common clustering hypothesis, but it can be seen that the SSCCM algorithm is a new semi-supervised classification method, mainly depends on modifying the clustering hypothesis, and does not utilize supervision information. Based on the method, the novel modified clustering hypothesis and pairwise constraint combined semi-supervised classification method is designed by combining the modified clustering hypothesis and the pairwise constraint.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a modified clustering hypothesis combined pairwise constraint semi-supervised classification method, which has higher effectiveness and correctness, further excavates the knowledge contained in the supervision information, improves the algorithm performance, is practical and reliable, and is easy to popularize and use.

In order to achieve the purpose, the invention is realized by the following technical scheme: a modified clustering hypothesis joint pairwise constraint semi-supervised classification method comprises the following steps:

inputting: l marked samplesu unlabeled samplesAn iteration termination threshold epsilon and a maximum iteration time Maxiter;

and (3) outputting: a classification decision function f (x) and a membership function v (x);

(1) Initializing class membership of unlabeled samples by an FCM method, and selecting an appropriate parameter lambda ₁ ，λ ₂ And calculating initialized alpha and M according to a formula;

(2) Passing through type

Updating alpha;

(3) According to the formula

Updating the membership function v (x);

(4) According to the formula

Updating the target function M;

(5) Judging whether an iteration termination condition is reached, if so, executing the step (6), and if not, returning to the step (2);

(6) And returning the membership function v (x), and obtaining a classification decision function f (x) according to alpha.

The invention has the beneficial effects that: (1) The advantages brought by modifying the clustering hypothesis model are inherited, namely after a sample membership degree concept is introduced, the problem of hard classification of classification boundary cross data by a general classification method is solved by converting the classification boundary cross data into a fuzzy problem, and the boundary cross data can have better fuzzy classification capability.

(2) The pair-wise constraint and the decision function jointly form a more complete experience risk term for the predicted value of the marked sample, and the supervision information is fully utilized. By converting the sample class labels into a pair-wise constraint form and combining the expanded knowledge with the loss function of the modified clustering hypothesis framework, the knowledge contained in the supervision information is further mined, the purpose of improving the performance of the algorithm is achieved, and higher effectiveness and correctness are achieved.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

The technical scheme adopted by the specific implementation mode is as follows: a modified clustering hypothesis joint pairwise constraint semi-supervised classification method comprises the following steps:

inputting: l marked samplesu unlabelled samplesAn iteration termination threshold epsilon and a maximum iteration number Maxiter;

(1) initializing class membership of unlabeled samples by FCM method, and selecting appropriate parameter lambda ₁ ，λ ₂ And calculating the initialized alpha according to a formula toAnd M;

(2) passing through type

Updating alpha;

(3) according to the formula

Updating the membership function v (x);

(4) according to the formula

Updating the target function M;

In the specific implementation mode, the exploration of the modified clustering hypothesis on the unmarked condition and the utilization of the paired constraint on the monitoring information are combined, a modified clustering hypothesis and paired constraint semi-monitoring classification method is provided, so that the performance of the classifier can be improved, in the aspect of monitoring information utilization, similar to many mainstream semi-monitoring methods, the SSCCM algorithm utilizes an experience risk control item obtained by a data label through an optimization problem, and if the experience risk control item is combined with the paired constraint, a new experience risk item is formed:

the further utilization of the supervision information can be realized; meanwhile, in the aspect of utilization of unsupervised information, because the SSCCM algorithm introduces a fuzzy membership concept and is used as a weight coefficient of label prediction difference in a specific optimization problem, namely, the third item of the SSCCM algorithm can be regarded as an unsupervised means, the form of the SSCCM algorithm is similar to that of an FCM (fuzzy C-means) target function, the difference measurement in the FCM is mainly based on the weight sum of the distances from the calculated samples to the class centers, the SSCCM algorithm is slightly different, and the weight sum of the sample label prediction difference is mainly calculated, so that the fuzzy division capability of boundary crossing data is realized.

Then, the specific optimization problem of the modified clustering hypothesis joint pairwise constraint semi-supervised classification method is as follows:

in equation (2), the first term mainly controls classifier complexity, the second term controls empirical risk by learning labeled samples and pairwise constraints, and the third term is to explore unlabeled samples in an unsupervised way by introducing the concept of label membership.

Since the variables in equation (2) exist mainly in vector form, their identities can be transformed into matrix form here:

wherein α = [ α = ₁ ,α ₂ ,…α _n ]∈R ^C×n Is a matrix of lagrange multipliers,is a kernel function matrix, K _ll ＝<φ(X _l ),φ(X _l )> _H ，K _lu ＝<φ(X _l ),φ(X _u )> _H ，K _ul ＝<φ(X _u ),φ(X _l )> _H ，K _uu ＝<φ(X _u ),φ(X _u )> _H 。L _k Is Cxn _u Matrix and k row elements are all 1-nThe balance being 0,V _k Is the label membership matrix for the unlabeled exemplar for the kth class,is a membership matrix V _k The diagonal matrix of (a) is,

derivation and proof of a modified clustering hypothesis joint pairwise constraint semi-supervised classification algorithm:

using Lagrange multiplier method, M is divided ₁ Taking 0 for the partial derivative of α, the following form is obtained:

solving formula (4) can yield the form α:

thus, the device is provided withThen the modified clustering hypothesis is fixed, and the optimized form of the joint pairwise constrained semi-supervised classification becomes:

0≤v _k (x _j )≤1,k＝1,...,C,j＝n _l +1,n _l +2,...,n.

(6)

similarly, using the lagrange multiplier method, the following equation is obtained:

m in the formula (7) ₂ To v is to v _k (x _j ) The partial derivative takes 0, i.e.:

obtaining:

because ofThen:

thus, for any sample X, its membership function is of the form:

from the optimization problem solution, the data prediction of the modified clustering hypothesis joint pairwise constraint semi-supervised classification algorithm can be utilizedObtaining a decision function, or obtaining the decision function bySolving the membership function to obtain the result. Specifically, if X ∈ X is obtained from f (X) _k Then must satisfyIf X ∈ X is to be obtained from V (X) _k Then must satisfy

With respect to the consistency of the data predictions for the decision function f (x) and the membership function V (x), the following is set forth: for arbitrary sample x _i Its class prediction is obtained by a classification decision functionThe optimization is solved, and can obtainThis means x _i Belong to the kth class; and the process of predicting data labels by membership functions is,the solution form of the membership function is as shown in formula (11), x _i Belonging to class k is also equivalent to | | | f (x) _i )-r _k || ² ＜||f(x _i )-r _j || ² I.e. f (x) _i ) ^T r _k ＞f(x _i ) ^T r _j Equivalent to the previousTherefore, for label prediction of any sample, the prediction results obtained by the classification decision function and the membership function are consistent.

The specific implementation mode is based on the utilization of the monitoring information, and based on the modified clustering hypothesis model, the modified clustering hypothesis is combined with the utilization of the monitoring information, a modified clustering hypothesis combined pairwise constraint semi-monitoring classification algorithm is provided, on one hand, the algorithm inherits the fuzzy partition capacity of a membership function to boundary cross data brought by the modified clustering hypothesis and can have better fuzzy partition capacity to the boundary cross data, on the other hand, by converting a sample class label into a pairwise constraint form, combining the expanded knowledge with a loss function of a modified clustering hypothesis frame, combining pairwise constraint information with a predicted value of a decision function to a labeled sample, and jointly forming a more complete experience risk item, so that the knowledge contained in the monitoring information is further mined, the purpose of improving the performance of the algorithm is achieved, and the algorithm has higher effectiveness and correctness and has a wide market application prospect.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A modified clustering hypothesis joint pairwise constraint semi-supervised classification method is characterized by comprising the following steps:

(2) Passing through type

Updating alpha;

(3) According to the formula

Updating the membership function v (x);

(4) According to the formula

Updating the target function M;