CN105894032A

CN105894032A - Method of extracting effective features based on sample properties

Info

Publication number: CN105894032A
Application number: CN201610202600.5A
Authority: CN
Inventors: 詹德川; 姜�远; 周志华; 李静
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2016-04-01
Filing date: 2016-04-01
Publication date: 2016-08-24

Abstract

The invention discloses a method of extracting effective features based on sample properties, comprising a training sample feature serialization step, a sample feature selector and corresponding model training step, and a sample model classification step. At the early stage of classification, an initial feature set is set; for each sample needing classification, a feature set needing to be extracted in next step is decided according to the current existing feature set, and then, whether or not to stop extracting features is judged; the previous step is repeated if there is further need to extract features; and if feature extraction is stopped, the sample is input to an appropriate classifier for classification, and a prediction result is obtained. Compared with the prior art, the time cost and classification confidence of sample feature extraction are fully considered.

Description

Method for extracting effective features aiming at sample properties

Technical Field

The invention relates to an effective characteristic extraction technology for a sample in pattern recognition, which is particularly suitable for the problem that the characteristic extraction cost and the classification result reliability need to be considered at the same time.

Background

With the rapid development of the internet and various portable internet devices, the network has become an important component of people's life and an important carrier for the propagation and development of human civilization; more and more data is being propagated through the network. In order to meet different requirements of people on information forms, the information such as characters, sound, images and the like is usually integrated; this results in a complex and complicated data format in the network. Nowadays, more and more complex media data is produced and spread in large quantities in networks. We face the problem of how to efficiently retrieve and classify on these large and complex data. It is therefore desirable to find an efficient and useful way of feature extraction to process this vast amount of information.

Currently, there are many online machine learning methods, such as: online clustering and online classification; they all accelerate the learning process by sampling or optimization strategies. However, these methods are based on the feature proposition overhead not being considered; that is, the overhead of extracting the data from the raw data to the valid features is not taken into account. In fact, it is a not small overhead to extract valid features from the raw data during the operation of the whole classification system; with the data form becoming more and more complex, the proportion of the feature extraction overhead in the whole system becomes larger and larger. How to efficiently extract useful features is a problem that we need to solve.

In medical diagnostic systems, there is a series of tests, such as: body temperature measurement, blood test, blood pressure measurement. However, we do not get the results of all tests during the diagnosis process and then go to diagnosis, which is too costly; but firstly, carrying out preliminary examination, then judging whether to carry out the next examination according to the preliminary examination result, if so, judging which examination to carry out next step, and if not, obtaining a diagnosis conclusion. We inspired by this idea hope to extract a set of features that are most efficient for a sample for different samples to classify, rather than extracting all features, thereby reducing the feature extraction overhead.

The invention content is as follows:

the purpose of the invention is as follows: in the prior art, many machine learning algorithms consider how to improve the efficiency of the learning algorithm from the perspective of sampling or optimization, few algorithms consider the problem of the feature extraction overhead of samples, and the feature extraction overhead is larger and larger as the data form is more and more complex. Aiming at the problems, the invention provides a method for extracting effective features aiming at sample properties, which only extracts simple features, namely features with small expenditure, for samples which are easy to classify; for samples which are difficult to classify, not only simple features are extracted, but also some complex features are extracted to help sample classification.

The technical scheme is as follows: a method for extracting effective features aiming at sample properties is characterized in that an initial feature set is set in the initial stage, for each sample needing to be classified, a feature set needing to be extracted in the next step is determined according to a current existing feature set, and then whether feature extraction is stopped or not is judged. If the characteristics need to be extracted, repeating the previous step; if the feature extraction is stopped, the feature is input into a proper classifier for classification to obtain a prediction result. The method specifically comprises a training sample feature serialization step, a sample feature selector and corresponding model training step and a model classification step aiming at the sample;

the training sample characteristic serialization comprises the following specific steps:

step 100, marking training sample data, and acquiring all characteristics and time overhead of corresponding characteristics;

step 101, calculating Euclidean distances between training sample pairs according to the acquired characteristics;

102, searching a neighbor set of a training sample according to the distance between the sample pairs and the set neighbor number;

step 103, calculating weights of all features of each training sample in a neighbor set of the training samples, namely the useful degree of all groups of features to sample classification;

step 104, sorting the features, wherein the larger the weight value is, the larger the contribution of the features to classification is, the earlier the features should be extracted;

the specific steps of training the sample feature selector and the corresponding model are as follows:

step 200, after training data are serialized, splitting the data according to the existing feature set and the form of feature set to be extracted in the next step to obtain a group of feature set pairs;

step 201, according to the split feature set pair, training a feature selector G based on the current existing features and classifiers aiming at different feature combinations;

the specific steps of the model classification are as follows:

step 300, extracting an initial feature set from a test sample;

step 301, judging whether a next feature set needs to be extracted or not according to the evaluation index, and if so, skipping to step 302; otherwise, jumping to step 303;

step 302, according to the existing feature set and feature selector G, determining a feature set to be extracted next step, merging the current extracted feature set and the existing feature set, and jumping to step 301;

and step 303, searching a trained corresponding classification model according to the current existing feature set for classification.

The specific method for searching the training sample neighbor set in the step 102 is as follows: and sorting the calculated Euclidean distances according to an ascending order, and selecting the first k Euclidean distances according to the set number k of adjacent neighbors.

The method for calculating the weight of the training sample feature in step 103 comprises the following steps: the sum of the weighted weight-average variance of the neighbor and the weighted weight-average variance of each neighbor is minimized, and the specific formula is as follows:

\begin{matrix} \arg \min_{u_{i}} & \underset{j &Element; δ_{i}}{Σ} l o g (1 + \exp (r_{i j} (D_{i} (x_{j}) - c_{i}))) + λ | | u_{i} | |_{1} & s . t . & u_{i} &GreaterEqual; 0 \end{matrix} - - - (1)

wherein,

X_idenotes the ith feature, X, of the sample_jDenotes the j-th feature of the sample, D_i(X_j) Represents X_iAnd X_jWeighted distance between u_iA weight representing the ith feature of the sample,_iis a sample set consisting of k neighbors of the ith sample; y is_iAnd y_jRespectively denote the i and j samples, if y_i＝y_jThen r is_ij1, otherwise r_ij＝-1；c_iAnd λ is a set parameter, c_iAnd the upper limit of the sample distance between the same class is represented, and lambda is a regularization parameter.

The specific formula of the feature selector G in step 201 is as follows:

wherein x is^lRepresenting the characteristics which have been extracted for the previous l times, l representing the characteristics extracted for the first time, c representing the characteristics extracted for the next time,representing the extracted feature set, f is a function on the features, w represents a linear coefficient;

the function f of the features is expressed as:

f(x^l,c)＝x^l1^TC (4)

1^Tis a vector with the size of 1 × m and all elements of 1, wherein m is the number of groups for extracting features; c denotes a diagonal matrix, C_kkDenotes the element on the k-th main diagonal line, when C is k, C_kk1, otherwise C_kk＝-1。

The linear coefficient w is expressed as:

\begin{matrix} \arg & \begin{matrix} \min_{w} & | | w | |_{2}^{2} + α \underset{i, l}{Σ} ξ_{i}^{l} \end{matrix} \\ s . t . & w^{T} f (X_{i}^{l}, c^{l + 1}) > Δ (c^{l + 1}, \hat{c^{l + 1}}) + w^{T} f (X_{i}^{l}, {\hat{c}}^{l + 1}) - ξ_{i}^{l} \end{matrix} - - - (5)

indicating that the ith sample has extracted l sets of features, c^l+1Representing the characteristics to be extracted in the step of i +1 of the ith sample,representing other candidate characteristics except the characteristic to be extracted in the step l +1, and defining delta as delta (c)_i,c_i)＝0，Δ(c_i,c_j) 1, where i ≠ j,for the relaxation variable, α is the regularization parameter.

The classifier C in the step 201^sThe specific formula of (A) is as follows:

\begin{matrix} C^{s} (x^{s}) = \arg & \max_{y &Element; Z} \end{matrix} V^{T} f (x^{s}, y) - - - (6)

wherein x is^sRepresenting the extracted features, y represents the labels of the sample, Z represents the label space, i.e. the set of all labels, f is a function on the features, V is solved according to the following optimization formula:

\begin{matrix} \arg & \begin{matrix} \min_{V} & | | V | |_{2}^{2} + D \underset{i}{Σ} ϵ_{i} \end{matrix} \\ s . t . & V^{T} f (x_{i}^{s}, y_{i}) > Δ (y_{i}, \hat{y}) + V^{T} f (x_{i}^{s}, \hat{y}) - ϵ_{i} \end{matrix} - - - (7)

representing the extracted features of the ith sample, y_iA marker representing the ith sample,marking y for removing sample_iIn addition to other notation, Δ is defined as Δ (y)_i,y_i)＝0，Here, the _iIs the relaxation variable.

The step 301 evaluation index includes a time line threshold of the extracted feature and a classification accuracy requirement of the classifier.

Has the advantages that: compared with the prior art, the method fully considers the time overhead of sample feature extraction and the confidence coefficient of classification. The method extracts the most classified characteristics of the samples of the type by utilizing the characteristics of each sample, and only extracts some basic characteristics aiming at simple samples; for complex samples, more features are extracted. Because the action degrees of different feature sets are different for the same sample, the invention provides the features which are most beneficial to classification, and is beneficial to improving the classification precision.

Drawings

FIG. 1 is a flow chart of the operation of the training sample feature serialization phase of the present invention;

FIG. 2 is a flowchart illustrating the operation of the sample feature selector and corresponding model training phase of the present invention;

FIG. 3 is a workflow diagram of the model classification phase for a sample of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

The workflow of the training sample feature serialization phase is shown in fig. 1. A certain amount of data with labels and all features is required at this stage of feature serialization for training data. In actual use, a company can label a batch of data and obtain all its features and the time overhead of the corresponding features (step 10); then, calculating Euclidean distances between the training samples according to the characteristics of the data (step 11); selecting a corresponding number of neighbors according to the set number of neighbors (step 12); next, calculating the weight of each feature of each training sample (step 13); finally, the features are sorted according to this weight, with larger weights ranked ahead. This results in our serialized training samples.

The flow of the sample feature selector and the corresponding model training phase is shown in fig. 2. Firstly, splitting a training sample serialized in the previous process to obtain a group of existing feature sets and a feature set pair of which the feature set needs to be extracted in the next step (step 15); then, based on these feature set pairs, a feature selector G is trained. Meanwhile, according to the feature combination of the training samples, a corresponding classification model C is trained^s(step 16).

The model classification workflow for the sample is shown in fig. 3. Firstly, extracting an initial feature set for a test sample (step 18); then, judging whether the existing features meet the requirement of stopping feature extraction, wherein the requirement of stopping feature extraction can be a time line threshold value of feature extraction or the accuracy (which can be selected according to the actual situation requirement) which can be reached by a classifier (step 19); if the requirement of stopping extracting the features is reached, the matched model can be directly selected for classification to obtain a classification result (step 20 a); otherwise, according to the feature selector, the feature to be extracted next is selected, and the process returns to step 19 (step 20 b).

Claims

1. A method of extracting valid features for a sample property, comprising: training sample feature serialization, training a sample feature selector and a corresponding model and classifying a model aiming at the sample;

the specific steps of the model classification are as follows:

step 300, extracting an initial feature set from a test sample;

2. The method of extracting valid features for a sample property of claim 1, wherein: the specific method for searching the training sample neighbor set in the step 102 is as follows: and sorting the calculated Euclidean distances according to an ascending order, and selecting the first k Euclidean distances according to the set number k of adjacent neighbors.

3. The method of extracting valid features for a sample property of claim 1, wherein: the method for calculating the weight of the training sample feature in step 103 comprises the following steps: the sum of weighted weight-average variances of the training sample and each neighbor is minimized, and the specific formula is as follows:

\begin{matrix} \arg \underset{u_{i}}{m i n} & \underset{j &Element; δ_{i}}{Σ} l o g (1 + \exp (r_{i j} (D_{i} (x_{j}) - c_{i}))) + λ | | u_{i} | |_{1} & s . t . & u_{i} &GreaterEqual; 0 \end{matrix} - - - (1)

wherein,

X_idenotes the ith feature, X, of the sample_jDenotes the j-th feature of the sample, D_i(X_j) Represents X_iAnd X_jWeighted distance between u_iA weight representing the ith feature of the sample,_iis formed by k of the ith sampleA sample set composed of neighbors; y is_iAnd y_jRespectively denote the i and j samples, if y_i＝y_jThen r is_ij1, otherwise r_ij＝-1；c_iAnd λ is a set parameter, c_iAnd the upper limit of the sample distance between the same class is represented, and lambda is a regularization parameter.

4. The method of extracting valid features for a sample property of claim 1, wherein: the specific formula of the feature selector G in step 201 is as follows:

the function f of the features is expressed as:

f(x^l,c)＝x^l1^TC (4)

The linear coefficient w is expressed as:

\begin{matrix} \arg & \begin{matrix} \min_{w} & | | w | |_{2}^{2} + α \underset{i, l}{Σ} ξ_{i}^{l} \end{matrix} \\ s . t . & w^{T} f (X_{i}^{l}, c^{l + 1}) > Δ (c^{l + 1}, \hat{c^{l + 1}}) + w^{T} f (X_{i}^{l}, {\hat{c}}^{l + 1}) - ξ_{i}^{l} \end{matrix} - - - (5)

5. The method as claimed in claim 1The method for extracting the effective features from the sample properties is characterized by comprising the following steps: the classifier C in the step 201^sThe specific formula of (A) is as follows:

\begin{matrix} C^{s} (x^{s}) = \arg & \underset{y &Element; Z}{m a x} V^{T} f (x^{s}, y) \end{matrix} - - - (6)

\begin{matrix} \arg & \begin{matrix} \min_{V} & | | V | |_{2}^{2} + D \underset{i}{Σ} ϵ_{i} \end{matrix} \\ s . t . & V^{T} f (x_{i}^{s}, y_{i}) > Δ (y_{i}, \hat{y}) + V^{T} f (x_{i}^{s}, \hat{y}) - ϵ_{i} \end{matrix} - - - (7)

6. The method of extracting valid features for a sample property of claim 1, wherein: the step 301 evaluation index includes a time line threshold of the extracted feature and a classification accuracy requirement of the classifier.